Most functions from methodical
take as input
RangedSummarizedExperiment objects with methylation data. If there are
many samples, there can be many millions or, in the case of WGBS data,
even billions of data points within a DNA methylation dataset. It can be
unfeasible to load all this data into memory at once. This problem can
be overcome by using DelayedArrays backed by HDF5 files, enabling data
to be read into memory only as needed. Methodical provides a suite of
functions for working with such DNA methylation
RangedSummarizedExperiments, including functions to extract methylation
values for sites overlapping genomic regions of interest
GRanges
, to liftover the methylation sites from one genome
build to another and to mask methylation sites overlapping certain
genomic regions, e.g. repeats.
Here we demonstrate this different functionality using a dataset downloaded from TumourMethData.
# Download RangedSummarizedExperiment with methylation data for prostate metastases from TumourMethData
mcrpc_wgbs_hg38_chr11 = TumourMethData::download_meth_dataset(dataset = "mcrpc_wgbs_hg38_chr11")
#> see ?TumourMethData and browseVignettes('TumourMethData') for documentation
#> loading from cache
We’ll first demonstrate how to extract methylation data for
individual CpG sites in mcrpc_wgbs_hg38_chr11 overlapping a supplied
GRanges object with extractGRangesMethSiteValues
, using the
gene body as an example.
# Create a GRanges with the hg38 genomic coordinates for the GSTP1, including
# 2 KB upstream of its designated start in Ensembl
gstp1_start_site_region <- GRanges("chr11:67581742-67586656:+")
# Extract methylation values for CpG sites overlapping GSTP1 gene body
gstp1_cpg_methylation <- extractGRangesMethSiteValues(
meth_rse = mcrpc_wgbs_hg38_chr11, genomic_regions = gstp1_start_site_region)
# View the first few rows and columns of the result.
# extractGRangesMethSiteValues returns a row for each methylation site and a
# separate column for each sample where row names give the coordinates of the
# methylation sites in character format.
gstp1_cpg_methylation[1:6, 1:6]
#> DTB_003 DTB_005 DTB_008 DTB_009 DTB_011 DTB_018
#> chr11:67581759 0.1764706 0.2682927 0.23076923 0.10714286 0.09677419 0.7045455
#> chr11:67581799 0.1944444 0.1470588 0.12500000 0.10344828 0.04166667 0.4871795
#> chr11:67581838 0.3157895 0.1538462 0.15789474 0.13333333 0.10000000 0.6923077
#> chr11:67581849 0.1388889 0.0000000 0.07692308 0.06896552 0.16129032 0.4000000
#> chr11:67581866 0.3488372 0.2000000 0.17073171 0.08823529 0.29032258 0.7575758
#> chr11:67581875 0.1590909 0.0000000 0.15384615 0.08823529 0.25000000 0.3142857
Next we’ll show how to summarize the methylation of CpGs over regions
of interest with summarizeRegionMethylation
, using CpG
islands as an example. summarizeRegionMethylation
processes
the the supplied genomic regions in chunks so that the methylation data
for CpG sites overlapping the regions of interest is not all loaded into
memory at once. The parameter max_sites_per_chunk
controls
the approximate number of CpG sites maximally read into memory at once
and defaults to floor(62500000/ncol(meth_rse)
. Several
chunks can be processed in parallel using BiocParallel via the
BPPARAM
argument which takes a BiocParallelParam object.
The number of workers indicated by BiocParallelParam determines the
number of chunks that will be processed in parallel. Some
experimentation may be needed to find the optimal choices for
max_sites_per_chunk
and the number of workers in terms of
speed and memory usage.
# Load CpG islands annotation for hg38
cpg_island_annotation <- annotatr::build_annotations(genome = "hg38", annotation = "hg38_cpgs")
#> Building CpG islands...
#> Building CpG shores...
#> Building CpG shelves...
#> Building inter-CpG-islands...
names(cpg_island_annotation) <- cpg_island_annotation$id
# Filter for annotation for chr11
cpg_island_annotation = cpg_island_annotation[seqnames(cpg_island_annotation) == "chr11"]
# Convert into a GRangesList with separate GRanges for islands, shores, shelves and inter island regions
cpg_island_annotation <- GRangesList(split(cpg_island_annotation, cpg_island_annotation$type))
# Create a BPPARAM class
BPPARAM = BiocParallel::bpparam()
# Summarize methylation levels for CpG islands
cpg_island_methylation <- summarizeRegionMethylation(
meth_rse = mcrpc_wgbs_hg38_chr11, genomic_regions = cpg_island_annotation$hg38_cpg_islands,
BPPARAM = BPPARAM, summary_function = colMeans)
#> There are some seqlevels from genomic_regions missing from meth_rse
# Print a few rows for the first few samples of the result
cpg_island_methylation[1000:1006, 1:6]
#> DTB_003 DTB_005 DTB_008 DTB_009 DTB_011
#> island:14913 0.124852221 0.009667637 0.008330832 0.014607967 0.039786356
#> island:14914 0.009844099 0.029795906 0.006258614 0.072477345 0.052693357
#> island:14915 0.001953125 0.004184060 0.000000000 0.000000000 0.003801843
#> island:14916 0.006799628 0.002174912 0.006262350 0.003554305 0.007567667
#> island:14917 0.901040137 0.834174923 0.819908342 0.865086671 0.806623071
#> island:14918 0.940876775 0.940707064 0.950919937 0.959274590 0.907710156
#> island:14919 0.026562413 0.024696883 0.019078006 0.017051253 0.034368160
#> DTB_018
#> island:14913 0.012036163
#> island:14914 0.050911768
#> island:14915 0.002777778
#> island:14916 0.004661724
#> island:14917 0.838317611
#> island:14918 0.930573772
#> island:14919 0.022032495
We can plot methylation values extracted from a genomic region for a
single sample using the plotMethylationValues()
function.
We’ll demonstrate this using the values we extracted in the region
surrounding the 5’ end of GSTP1 for the DTB_003 prostate metastasis
sample. We indicate the sample we want to plot with the
sample_name
parameter.
# Plot the methylation values along the GSTP1 gene body for one prostate metastasis sample.
gstp1_methylation_plot = plotMethylationValues(gstp1_cpg_methylation, sample_name = "DTB_003")
print(gstp1_methylation_plot)
Additionally, we can also annotate our plots using the
annotatePlot()
function. It uses a GRangeList provided with
the annotation_grl
parameter to create an annotation plot
showing the regions in the GRangesList which overlap the genomic region
displayed in the main plot. Each of the GRanges objects making up the
GRangesList is given a different colour in the annotation plot and the
names of these component GRanges are indicated. We can control the
colours used with the grl_colours
parameter if we don’t
want to use the default colours.
If we provide a GRanges object with the location of a transcription
start site to the reference_tss
, parameter, we can show the
distance of methylation sites upstream and downstream of this.
By default, the main plot and annotation plot are combined into a
single plot and returned. The annotation_plot_proportion
parameter sets the proportion of the total plot height dedicated to the
annotation plot. We can instead return the annotation plot by itself by
setting the annotation_plot_only
parameter to TRUE.
We’ll annotate the location of CpG islands, CpG shores, CpG shores
and inter CpG island regions for gstp1_methylation_plot
using the cpg_island_annotation
GRangesList we created.
# Annotate gstp1_methylation_plot with cpg_island_annotation
annotatePlot(meth_site_plot = gstp1_methylation_plot,
annotation_grl = cpg_island_annotation, annotation_plot_proportion = 0.3,
grl_colours = c("#DEEBF7", "#9ECAE1", "#4292C6", "#08519C"))
# Create same plot, except showing the distance to the GSTP1 start site on the x-axis
annotatePlot(meth_site_plot = gstp1_methylation_plot,
annotation_grl = cpg_island_annotation,
reference_tss = GRanges("chr11:67583742"),annotation_plot_proportion = 0.3,
grl_colours = c("#DEEBF7", "#9ECAE1", "#4292C6", "#08519C"))
# Return the annotation plot by itself
annotatePlot(meth_site_plot = gstp1_methylation_plot,
annotation_grl = cpg_island_annotation, annotation_plot_proportion = 0.3,
grl_colours = c("#DEEBF7", "#9ECAE1", "#4292C6", "#08519C"), annotation_plot_only = TRUE)
We may want to mask certain regions in a methylation
RangedSummarizedExperiment. With the maskRangesInRSE()
function, we can mask regions across all samples (which could be useful
for repetitive regions or polymorphic regions) or on a sample by sample
basis (which could be appropriate for different regions that are known
to be mutated in different tumour samples). All methylation values
within the masked regions will be set to NA
.
The mask_ranges
argument takes either a GRanges or
GRangesList with the regions that should be masked. If a GRanges object
is provided, all methylation sites overlapping these regions will be
masked across all samples. If a GRangesList is provided, the names of
the component GRanges should match sample names in the
RangedSummarizedExperiment and in each sample, the methylation sites
overlapping the regions in its corresponding GRangesList entry will be
masked.
We will demonstrate how to mask LINE repetitive regions across all samples.
# Download repetitive sequences from AnnotationHub and filter for LINE elements
repeat_annotation_hg38 <- AnnotationHub::AnnotationHub()[["AH99003"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
line_elements_hg38 <- repeat_annotation_hg38[repeat_annotation_hg38$repClass == "SINE"]
# Mask LINE elements in mcrpc_wgbs_hg38_chr11
mcrpc_wgbs_hg38_chr11_lines_masked <- maskRangesInRSE(rse = mcrpc_wgbs_hg38_chr11,
mask_ranges = line_elements_hg38)
# Extract the methylation values for one of the LINE elements in the
# unmasked and masked RSE
extractGRangesMethSiteValues(meth_rse = mcrpc_wgbs_hg38_chr11,
genomic_regions = line_elements_hg38[1000])[, 1:6]
#> [1] DTB_003 DTB_005 DTB_008 DTB_009 DTB_011 DTB_018
#> <0 rows> (or 0-length row.names)
extractGRangesMethSiteValues(meth_rse = mcrpc_wgbs_hg38_chr11_lines_masked,
genomic_regions = line_elements_hg38[1000])[, 1:6]
#> [1] DTB_003 DTB_005 DTB_008 DTB_009 DTB_011 DTB_018
#> <0 rows> (or 0-length row.names)
Sometimes we may want to work with a different genome build to that
used to construct a methylation RangedSummarizedExperiment. We can
easily liftover the genomic coordinates of the methylation sites using
liftoverMethRSE()
function. To do this, we need a liftover
chain file for the appropriate source and target genome builds and which
we provide to the chain
argument.
All methylation sites which cannot be mapped to the target genome
build and those which result in many-to-one mappings are removed. We
also need to decide whether we want to remove methylation sites in the
source genome build which map to multiple sites in the target genome
build. We do this using the remove_one_to_many_mapping
argument, which has a default value of TRUE. We can also remove any
regions which do not map to desired regions in the target genome, for
example CpG sites, by providing a GRanges object to the argument
permitted_target_regions
.
We will demonstrate how to liftover mcrpc_wgbs_hg38_chr11 to hg19.
# Create a DNAStringSet for chromosome11
chr11_dss = setNames(DNAStringSet(BSgenome.Hsapiens.UCSC.hg19[["chr11"]]), "chr11")
# Get CpG sites for hg19 for chromsome 11
hg19_cpgs <- methodical::extractMethSitesFromGenome(genome = chr11_dss)
# Download hg38 to hg19 liftover chain from AnnotationHub
hg38tohg19Chain <- AnnotationHub::AnnotationHub()[["AH14108"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
# Liftover mcrpc_wgbs_hg38_chr11 to mcrpc_wgbs_hg19_chr11
mcrpc_wgbs_hg19_chr11 <- liftoverMethRSE(meth_rse = mcrpc_wgbs_hg38_chr11, chain = hg38tohg19Chain,
remove_one_to_many_mapping = TRUE, permitted_target_regions = hg19_cpgs)
#> 42343 non-mapping sites removed
#> 0 one-to-many mapping sites removed
#> 1773 many-to-one mapping sites removed
#> 46319 sites not overlapping permitted target regions removed
# Compare the dimensions of mcrpc_wgbs_hg38_chr11 and mcrpc_wgbs_hg19_chr11.
# 1,423,050 methylation sites could not be lifted over from hg38 to hg19.
dim(mcrpc_wgbs_hg38_chr11)
#> [1] 1333114 100
dim(mcrpc_wgbs_hg19_chr11)
#> [1] 1286553 100
# chr1:921635 should be lifted over to chr1:857015 so confirm that they have
# the same methylation values in hg38 and hg19
rtracklayer::liftOver(GRanges("chr11:67581759"), hg38tohg19Chain)
#> GRangesList object of length 1:
#> [[1]]
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr11 67349230 *
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
extractGRangesMethSiteValues(mcrpc_wgbs_hg38_chr11, GRanges("chr11:67581759"))[, 1:8]
#> DTB_003 DTB_005 DTB_008 DTB_009 DTB_011 DTB_018
#> chr11:67581759 0.1764706 0.2682927 0.2307692 0.1071429 0.09677419 0.7045455
#> DTB_019 DTB_020
#> chr11:67581759 0.1538462 0.1612903
extractGRangesMethSiteValues(mcrpc_wgbs_hg19_chr11, GRanges("chr11:67349230"))[, 1:8]
#> DTB_003 DTB_005 DTB_008 DTB_009 DTB_011 DTB_018
#> chr11:67349230 0.1764706 0.2682927 0.2307692 0.1071429 0.09677419 0.7045455
#> DTB_019 DTB_020
#> chr11:67349230 0.1538462 0.1612903
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.75.1
#> [3] BiocIO_1.17.1 Biostrings_2.75.3
#> [5] XVector_0.47.2 rtracklayer_1.67.0
#> [7] AnnotationHub_3.15.0 BiocFileCache_2.15.1
#> [9] dbplyr_2.5.0 rhdf5_2.51.2
#> [11] TumourMethData_1.4.0 DESeq2_1.47.1
#> [13] methodical_1.3.0 SummarizedExperiment_1.37.0
#> [15] Biobase_2.67.0 MatrixGenerics_1.19.1
#> [17] matrixStats_1.5.0 ggplot2_3.5.1
#> [19] GenomicRanges_1.59.1 GenomeInfoDb_1.43.3
#> [21] IRanges_2.41.2 S4Vectors_0.45.2
#> [23] BiocGenerics_0.53.5 generics_0.1.3
#> [25] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9 rlang_1.1.5
#> [4] magrittr_2.0.3 compiler_4.4.2 RSQLite_2.3.9
#> [7] GenomicFeatures_1.59.1 reshape2_1.4.4 png_0.1-8
#> [10] vctrs_0.6.5 stringr_1.5.1 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0 labeling_0.4.3
#> [16] Rsamtools_2.23.1 rmarkdown_2.29 tzdb_0.4.0
#> [19] UCSC.utils_1.3.1 purrr_1.0.2 bit_4.5.0.1
#> [22] xfun_0.50 cachem_1.1.0 jsonlite_1.8.9
#> [25] blob_1.2.4 rhdf5filters_1.19.0 DelayedArray_0.33.4
#> [28] Rhdf5lib_1.29.0 BiocParallel_1.41.0 parallel_4.4.2
#> [31] R6_2.5.1 stringi_1.8.4 bslib_0.8.0
#> [34] jquerylib_0.1.4 iterators_1.0.14 Rcpp_1.0.14
#> [37] knitr_1.49 R.utils_2.12.3 readr_2.1.5
#> [40] Matrix_1.7-2 tidyselect_1.2.1 abind_1.4-8
#> [43] yaml_2.3.10 codetools_0.2-20 curl_6.2.0
#> [46] plyr_1.8.9 regioneR_1.39.0 lattice_0.22-6
#> [49] tibble_3.2.1 withr_3.0.2 KEGGREST_1.47.0
#> [52] evaluate_1.0.3 ExperimentHub_2.15.0 pillar_1.10.1
#> [55] BiocManager_1.30.25 filelock_1.0.3 foreach_1.5.2
#> [58] vroom_1.6.5 RCurl_1.98-1.16 hms_1.1.3
#> [61] BiocVersion_3.21.1 munsell_0.5.1 scales_1.3.0
#> [64] glue_1.8.0 maketools_1.3.1 tools_4.4.2
#> [67] sys_3.4.3 data.table_1.16.4 GenomicAlignments_1.43.0
#> [70] locfit_1.5-9.10 buildtools_1.0.0 XML_3.99-0.18
#> [73] cowplot_1.1.3 grid_4.4.2 AnnotationDbi_1.69.0
#> [76] colorspace_2.1-1 GenomeInfoDbData_1.2.13 HDF5Array_1.35.8
#> [79] restfulr_0.0.15 annotatr_1.33.0 cli_3.6.3
#> [82] rappdirs_0.3.3 S4Arrays_1.7.1 dplyr_1.1.4
#> [85] gtable_0.3.6 R.methodsS3_1.8.2 sass_0.4.9
#> [88] digest_0.6.37 SparseArray_1.7.4 farver_2.1.2
#> [91] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.8.1
#> [94] R.oo_1.27.0 lifecycle_1.0.4 h5mread_0.99.4
#> [97] httr_1.4.7 mime_0.12 bit64_4.6.0-1