Navigating the complexity of DNA methylation data poses substantial
challenges due to its intricate biological patterns. The
regionalpcs
package is conceived to address the substantial
need for enhanced summarization and interpretation at a regional level.
Traditional methodologies, while foundational, may not fully encapsulate
the biological nuances of methylation patterns, thereby potentially
yielding interpretations that may be suboptimal or veer towards
inaccuracies. This package introduces and utilizes regional principal
components (rPCs), designed to adeptly capture biologically relevant
signals embedded within DNA methylation data. Through the implementation
of rPCs, researchers can gain new insights into complex interactions and
effects in methylation data that might otherwise be missed.
The regionalpcs
package can be easily installed from
Bioconductor, providing you with the latest stable version suitable for
general use. Alternatively, for those interested in exploring or
utilizing the latest features and developments, the GitHub version can
be installed directly.
Install regionalpcs
from Bioconductor using the command
below. Ensure that your R version is compatible with the package version
available on Bioconductor for smooth installation and functionality.
To access the development version of regionalpcs
directly from GitHub, which might include new features or enhancements
not yet available in the Bioconductor version, use the following
commands. Note that the development version might be less stable than
the officially released version.
regionalpcs
R Package TutorialThis tutorial depends on several Bioconductor packages. These packages should be loaded at the beginning of the analysis.
library(regionalpcs)
library(GenomicRanges)
library(tidyr)
library(tibble)
library(dplyr)
library(minfiData)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
Let’s proceed to load the packages, briefly understanding their roles in this tutorial.
regionalpcs
: Primary package for summarizing and
interpreting DNA methylation data at a regional level.
GenomicRanges
: Facilitates representation and
manipulation of genomic intervals and variables defined along a
genome.
tidyr
, tibble
, dplyr
:
Assist in data tidying, representation, and manipulation.
minfiData
: Provides example Illumina 450k data,
aiding in the demonstration of regionalpcs
functionalities.
TxDb.Hsapiens.UCSC.hg19.knownGene
: Accommodates
transcriptome data, useful for annotating results.
Once the packages are loaded, we can start using the functions provided by each package.
In this tutorial, we’ll utilize a sample dataset,
MsetEx.sub
, which is a subset derived from Illumina’s Human
Methylation 450k dataset, specifically preprocessed to contain 600 CpGs
across 6 samples. The dataset is stored in a MethylSet
object, which is commonly used to represent methylation data.
The methylation beta values, denoting the proportion of methylated cytosines at a particular CpG site, will be extracted from this dataset for our subsequent analyses.
# Load the MethylSet data
data(MsetEx.sub)
# Display the first few rows of the dataset for a preliminary view
head(MsetEx.sub)
#> class: MethylSet
#> dim: 6 6
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(6): cg00050873 cg00212031 ... cg00455876 cg01707559
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#> 5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#> array: IlluminaHumanMethylation450k
#> annotation: ilmn12.hg19
#> Preprocessing
#> Method: Raw (no normalization or bg correction)
#> minfi version: 1.21.2
#> Manifest version: 0.4.0
# Extract methylation M-values from the MethylSet
# M-values are logit-transformed beta-values and are often used in differential
# methylation analysis for improved statistical performance.
mvals <- getM(MsetEx.sub)
# Display the extracted methylation beta values
head(mvals)
#> 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> cg00050873 3.502348 0.4414491 4.340695
#> cg00212031 -3.273751 0.9234662 -2.614777
#> cg00213748 2.076816 -0.1309465 1.260995
#> cg00214611 -3.438838 1.7463950 -2.270551
#> cg00455876 1.839010 -0.9927320 1.619479
#> cg01707559 -3.303987 -0.6433201 -3.540887
#> 5723646053_R04C02 5723646053_R05C02 5723646053_R06C02
#> cg00050873 0.24458355 -0.3219281 0.2744392
#> cg00212031 -0.21052257 -0.6861413 -0.1397595
#> cg00213748 -1.10373279 -1.6616553 -0.1270869
#> cg00214611 0.29029649 -0.2103599 -0.6138630
#> cg00455876 -0.09504721 -0.2854655 0.6361273
#> cg01707559 -0.74835377 -0.4678048 -1.1345421
Note that MsetEx.sub
provides a manageable slice of data
that we can utilize to illustrate the capabilities of
regionalpcs
without requiring extensive computational
resources.
Now that we have our dataset loaded and methylation values extracted,
let’s proceed with demonstrating the core functionalities of
regionalpcs
.
In this step, we’ll obtain the genomic coordinates of the CpG sites
in our methylation dataset using the 450k array probe annotations using
the minfi
package.
# Map the methylation data to genomic coordinates using the mapToGenome
# function. This creates a GenomicMethylSet (gset) which includes genomic
# position information for each probe.
gset <- mapToGenome(MsetEx.sub)
# Display the GenomicMethylSet object to observe the structure and initial
# entries.
gset
#> class: GenomicMethylSet
#> dim: 600 6
#> metadata(0):
#> assays(2): Meth Unmeth
#> rownames(600): cg01003813 cg03723195 ... cg03930849 cg08265308
#> rowData names(0):
#> colnames(6): 5723646052_R02C02 5723646052_R04C01 ... 5723646053_R05C02
#> 5723646053_R06C02
#> colData names(13): Sample_Name Sample_Well ... Basename filenames
#> Annotation
#> array: IlluminaHumanMethylation450k
#> annotation: ilmn12.hg19
#> Preprocessing
#> Method: Raw (no normalization or bg correction)
#> minfi version: 1.21.2
#> Manifest version: 0.4.0
# Convert the genomic position data into a GRanges object, enabling genomic
# range operations in subsequent analyses.
# The GRanges object (cpg_gr) provides a versatile structure for handling
# genomic coordinates in R/Bioconductor.
cpg_gr <- granges(gset)
# Display the GRanges object for a preliminary view of the genomic coordinates.
cpg_gr
#> GRanges object with 600 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> cg01003813 chrX 2715058 *
#> cg03723195 chrX 2847354 *
#> cg00074638 chrX 2883307 *
#> cg01728536 chrX 7070451 *
#> cg01864404 chrX 8434367 *
#> ... ... ... ...
#> cg26983430 chrY 24549675 *
#> cg01757887 chrY 25114810 *
#> cg00061679 chrY 25314171 *
#> cg03930849 chrY 26716289 *
#> cg08265308 chrY 28555550 *
#> -------
#> seqinfo: 2 sequences from hg19 genome; no seqlengths
Gene regions, which include functional segments such as promoters,
gene bodies, and intergenic regions, play pivotal roles in gene
expression and regulation. Summarizing methylation patterns across these
regions can provide insights into potential gene regulatory mechanisms
and associations with phenotypes or disease states. Herein, we will
delve into how to succinctly summarize methylation data at these crucial
genomic segments using the regionalpcs
package.
First, let’s load the gene region annotations. Make sure to align the genomic builds of your annotations and methylation data.
# Obtain promoter regions
# The TxDb object 'txdb' facilitates the retrieval of transcript-based
# genomic annotations.
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
# Extracting promoter regions with a defined upstream and downstream window.
# This GRanges object 'promoters_gr' will be utilized to map and summarize
# methylation data in promoter regions.
promoters_gr <- suppressMessages(promoters(genes(txdb),
upstream=1000,
downstream=0))
# Display the GRanges object containing the genomic coordinates of promoter
# regions.
promoters_gr
#> GRanges object with 23056 ranges and 1 metadata column:
#> seqnames ranges strand | gene_id
#> <Rle> <IRanges> <Rle> | <character>
#> 1 chr19 58874215-58875214 - | 1
#> 10 chr8 18247755-18248754 + | 10
#> 100 chr20 43280377-43281376 - | 100
#> 1000 chr18 25757446-25758445 - | 1000
#> 10000 chr1 244006887-244007886 - | 10000
#> ... ... ... ... . ...
#> 9991 chr9 115095945-115096944 - | 9991
#> 9992 chr21 35735323-35736322 + | 9992
#> 9993 chr22 19109968-19110967 - | 9993
#> 9994 chr6 90538619-90539618 + | 9994
#> 9997 chr22 50964906-50965905 - | 9997
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome
Creating a region map, which systematically assigns CpGs to specific
gene regions, stands as a crucial precursor to gene-region summarization
using the regionalpcs
package. This mapping elucidates the
physical positioning of CpGs within particular gene regions,
facilitating our upcoming endeavors to comprehend how methylation varies
across distinct genomic segments. We’ll use the
create_region_map
function from the
regionalpcs
package. This function takes two genomic ranges
objects, cpg_gr
contains CpG positions and
genes_gr
contains gene region positions. Make sure both
positions are aligned to the same genome build (e.g. GrCH37,
CrCH38).
# get the region map using the regionalpcs function
region_map <- regionalpcs::create_region_map(cpg_gr=cpg_gr,
genes_gr=promoters_gr)
# Display the initial few rows of the region map.
head(region_map)
#> gene_id cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4 10084 cg01303569
#> 5 1069 cg01056373
#> 6 10857 cg01333849
Note: The second column of region_map
must contain values matching the rownames of your methylation
dataframe.
In this final section, we’ll summarize gene regions using Principal
Components (PCs) to capture the maximum variation. We’ll utilize the
compute_regional_pcs
function from the
regionalpcs
package for this.
Let’s calculate the regional PCs using our gene regions for demonstration purposes.
# Display head of region map
head(region_map)
#> gene_id cpg_id
#> 1 100131434 cg00466309
#> 2 100133941 cg05230942
#> 3 100133957 cg00636562
#> 4 10084 cg01303569
#> 5 1069 cg01056373
#> 6 10857 cg01333849
dim(region_map)
#> [1] 211 2
# Compute regional PCs
res <- compute_regional_pcs(meth=mvals, region_map=region_map, pc_method="gd")
#> Using Gavish-Donoho method
The function returns a list containing multiple elements. Let’s first look at what these elements are.
The first element (res$regional_pcs
) is a data frame
containing the calculated regional PCs.
# Extract regional PCs
regional_pcs <- res$regional_pcs
head(regional_pcs)[1:4]
#> 5723646052_R02C02 5723646052_R04C01 5723646052_R05C02
#> 100131434-PC1 -3.887949 1.82168108 -4.2354365
#> 100133941-PC1 1.100370 0.06719947 0.1932361
#> 100133957-PC1 -2.032026 0.19154436 -0.9193417
#> 10084-PC1 -5.011579 2.34574585 -4.2050727
#> 1069-PC1 -2.004785 1.20363622 -2.5436686
#> 10857-PC1 -1.986294 1.31482789 -2.0044911
#> 5723646053_R04C02
#> 100131434-PC1 2.1707679
#> 100133941-PC1 -0.7438658
#> 100133957-PC1 1.1903709
#> 10084-PC1 2.2586300
#> 1069-PC1 1.5851359
#> 10857-PC1 2.1369556
The output is a data frame with regional PCs for each region as rows and our samples as columns. This is our new representation of methylation values, now on a gene regional PC scale. We can feed these into downstream analyses as is.
The number of regional PCs representing each gene region was
determined by the Gavish-Donoho method. This method allows us to
identify PCs that capture actual signal in our data and not the noise
that is inherent in any dataset. To explore alternative methods, we can
change the pc_method
parameter.
# Count the number of unique gene regions and PCs
regions <- data.frame(gene_pc = rownames(regional_pcs)) |>
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(regions)
#> gene pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4 10084 PC1
#> 5 1069 PC1
#> 6 10857 PC1
# number of genes that have been summarized
length(unique(regions$gene))
#> [1] 142
# how many of each PC did we get
table(regions$pc)
#>
#> PC1
#> 142
We have summarized each of our genes using just one PC. The number of PCs depends on three main factors: the number of samples, the number of CpGs in the gene region, and the noise in the methylation data.
By default, the compute_regional_pcs
function uses the
Gavish-Donoho method. However, we can also use the Marcenko-Pasteur
method by setting the pc_method
parameter:
# Using Marcenko-Pasteur method
mp_res <- compute_regional_pcs(mvals, region_map, pc_method = "mp")
#> Using Marchenko-Pastur method
# select the regional pcs
mp_regional_pcs <- mp_res$regional_pcs
# separate the genes from the pc numbers
mp_regions <- data.frame(gene_pc = rownames(mp_regional_pcs)) |>
separate(gene_pc, into = c("gene", "pc"), sep = "-")
head(mp_regions)
#> gene pc
#> 1 100131434 PC1
#> 2 100133941 PC1
#> 3 100133957 PC1
#> 4 10084 PC1
#> 5 1069 PC1
#> 6 10857 PC1
# number of genes that have been summarized
length(unique(mp_regions$gene))
#> [1] 142
# how many of each PC did we get
table(mp_regions$pc)
#>
#> PC1
#> 142
The Marcenko-Pasteur and the Gavish-Donoho methods are both based on Random Matrix Theory, and they aim to identify the number of significant PCs that capture the true signal in the data and not just the noise. However, these methods differ in how they select the number of significant PCs. The Marcenko-Pasteur method typically selects more PCs to represent a gene region compared to the Gavish-Donoho method. This may be due to the different ways in which the two methods estimate the noise level in the data.
Ultimately, the choice between the two methods depends on the specific needs and goals of the analysis. The Gavish-Donoho method tends to provide more conservative results, while the Marcenko-Pasteur method may capture more of the underlying signal in the data. Researchers should carefully consider their objectives and the characteristics of their data when selecting a method.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [2] GenomicFeatures_1.59.1
#> [3] AnnotationDbi_1.69.0
#> [4] minfiData_0.52.0
#> [5] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
#> [6] IlluminaHumanMethylation450kmanifest_0.4.0
#> [7] minfi_1.53.1
#> [8] bumphunter_1.49.0
#> [9] locfit_1.5-9.10
#> [10] iterators_1.0.14
#> [11] foreach_1.5.2
#> [12] Biostrings_2.75.3
#> [13] XVector_0.47.1
#> [14] SummarizedExperiment_1.37.0
#> [15] Biobase_2.67.0
#> [16] MatrixGenerics_1.19.0
#> [17] matrixStats_1.4.1
#> [18] dplyr_1.1.4
#> [19] tibble_3.2.1
#> [20] tidyr_1.3.1
#> [21] GenomicRanges_1.59.1
#> [22] GenomeInfoDb_1.43.2
#> [23] IRanges_2.41.2
#> [24] S4Vectors_0.45.2
#> [25] BiocGenerics_0.53.3
#> [26] generics_0.1.3
#> [27] regionalpcs_1.5.0
#> [28] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 sys_3.4.3
#> [3] jsonlite_1.8.9 magrittr_2.0.3
#> [5] rmarkdown_2.29 BiocIO_1.17.1
#> [7] zlibbioc_1.52.0 vctrs_0.6.5
#> [9] multtest_2.63.0 memoise_2.0.1
#> [11] Rsamtools_2.23.1 DelayedMatrixStats_1.29.0
#> [13] RCurl_1.98-1.16 askpass_1.2.1
#> [15] htmltools_0.5.8.1 S4Arrays_1.7.1
#> [17] curl_6.0.1 Rhdf5lib_1.29.0
#> [19] SparseArray_1.7.2 rhdf5_2.51.1
#> [21] sass_0.4.9 nor1mix_1.3-3
#> [23] bslib_0.8.0 plyr_1.8.9
#> [25] cachem_1.1.0 buildtools_1.0.0
#> [27] GenomicAlignments_1.43.0 lifecycle_1.0.4
#> [29] pkgconfig_2.0.3 rsvd_1.0.5
#> [31] Matrix_1.7-1 R6_2.5.1
#> [33] fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [35] digest_0.6.37 colorspace_2.1-1
#> [37] siggenes_1.81.0 reshape_0.8.9
#> [39] dqrng_0.4.1 irlba_2.3.5.1
#> [41] RSQLite_2.3.9 beachmat_2.23.5
#> [43] base64_2.0.2 PCAtools_2.19.0
#> [45] RMTstat_0.3.1 httr_1.4.7
#> [47] abind_1.4-8 compiler_4.4.2
#> [49] beanplot_1.3.1 rngtools_1.5.2
#> [51] withr_3.0.2 bit64_4.5.2
#> [53] BiocParallel_1.41.0 DBI_1.2.3
#> [55] HDF5Array_1.35.2 MASS_7.3-61
#> [57] openssl_2.3.0 DelayedArray_0.33.3
#> [59] rjson_0.2.23 tools_4.4.2
#> [61] rentrez_1.2.3 glue_1.8.0
#> [63] quadprog_1.5-8 restfulr_0.0.15
#> [65] nlme_3.1-166 rhdf5filters_1.19.0
#> [67] grid_4.4.2 reshape2_1.4.4
#> [69] gtable_0.3.6 tzdb_0.4.0
#> [71] preprocessCore_1.69.0 data.table_1.16.4
#> [73] hms_1.1.3 ScaledMatrix_1.15.0
#> [75] BiocSingular_1.23.0 xml2_1.3.6
#> [77] stringr_1.5.1 ggrepel_0.9.6
#> [79] pillar_1.10.0 limma_3.63.2
#> [81] genefilter_1.89.0 splines_4.4.2
#> [83] lattice_0.22-6 survival_3.8-3
#> [85] rtracklayer_1.67.0 bit_4.5.0.1
#> [87] GEOquery_2.75.0 annotate_1.85.0
#> [89] tidyselect_1.2.1 maketools_1.3.1
#> [91] knitr_1.49 xfun_0.49
#> [93] scrime_1.3.5 statmod_1.5.0
#> [95] stringi_1.8.4 UCSC.utils_1.3.0
#> [97] yaml_2.3.10 evaluate_1.0.1
#> [99] codetools_0.2-20 BiocManager_1.30.25
#> [101] cli_3.6.3 xtable_1.8-4
#> [103] munsell_0.5.1 jquerylib_0.1.4
#> [105] Rcpp_1.0.13-1 png_0.1-8
#> [107] XML_3.99-0.17 ggplot2_3.5.1
#> [109] readr_2.1.5 blob_1.2.4
#> [111] mclust_6.1.1 doRNG_1.8.6
#> [113] sparseMatrixStats_1.19.0 bitops_1.0-9
#> [115] scales_1.3.0 illuminaio_0.49.0
#> [117] purrr_1.0.2 crayon_1.5.3
#> [119] rlang_1.1.4 cowplot_1.1.3
#> [121] KEGGREST_1.47.0