## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("BPRMeth")
## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/BPRMeth", build_vignettes = TRUE)
DNA methylation is probably the best studied epigenomic mark, due to its well established heritability and widespread association with diseases. Yet its role in gene regulation, and the molecular mechanisms underpinning its association with diseases, are still imperfectly understood. While the methylation status of individual cytosines can sometimes be informative, several recent papers have shown that the functional role of DNA methylation is better captured by a quantitative analysis of the spatial variation of methylation across a genomic region.
The BPRMeth
package is a probabilistic method to
quantify explicit features of methylation profiles, in a way that would
make it easier to formally use such profiles in downstream modelling
efforts, such as predicting gene expression levels or clustering genomic
regions according to their methylation profiles. The original
implementation [1] has now been enhanced in two important ways: we
introduced a fast, variational inference approach which
enables the quantification of Bayesian posterior confidence measures on
the model, and we adapted the method to use several observation models,
making it suitable for a diverse range of platforms including
single-cell analyses and methylation
arrays. Technical details of the updated version are explained
in [2].
In addition to being a flexible tool for methylation data, the proposed framework is in principle deployable to other measurements with a similar structure, and indeed the method was already used for single-cell chromatin accessibility data in [3].
Mathematically, BPRMeth
is based on a basis function
generalised linear model. The basic idea is as follows: the methylation
profile associated to a genomic region D is defined as a (latent) function
f: D → (0, 1) which
takes as input the genomic coordinate along the region and returns the
propensity for that locus to be methylated. In order to enforce spatial
smoothness, and to obtain a compact representation for this function in
terms of interpretable features, we represent the profile function as a
linear combination of basis functions
f(x) = Φ(wTh(x))
where h(x) are the basis functions (Gaussian bells by default), and Φ is a probit transformation (Gaussian cumulative distribution function) needed in order to map the function output to the (0, 1) interval. The latent function is observed at specific loci through a noise model which encapsulates the experimental technology.
The optimal weight parameters w can be recovered either by Bayesian inference or maximum likelihood estimation, providing a set of quantitative features which can be used in downstream models: in [1] these features were used in a machine learning predictor of gene expression, and to cluster genomic regions according to their methylation profiles. Modelling details and mathematical derivations for the different models can be found online: http://rpubs.com/cakapourani.
The workflow diagram and functionalities of the BPRMeth
package for analysis of methylation profiles are shown in Figure 2.
To illustrate the functionality of the BPRMeth
package
we will use real datasets that are publicly available from the ENCODE
project consortium [4]. More specifically we will focus on the K562
immortalized cell line, with GEO: GSE27584 for the bulk RRBS data and
GEO: GSE33480 for the RNA-Seq data. We will use directly the
preprocessed files, where we have kept only the protein coding genes,
and for the purpose of this vignette we focus only on
chr12 and chr13. Full details of where
to download the data and how to preprocess them are included in the
Supplementary Material [1].
Due to its general approach, BPRMeth
can be used for a
wide range of technologies such as array based, bulk sequencing (both
RRBS and WGBS) and single cell sequencing methylation datasets. It is
only required that the data are in the right format and we call the
appropriate observation model depending on the type of technology. Since
the encode data are generated from bulk RRBS
experiments, the observation model for the BPRMeth
package
will be Binomial.
The BPRMeth
package provides methods for reading files
for methylation, annotation and expression data, however they do not
cover all possible data formats available for describing biological
datasets. The user can implement his own methods for reading files with
different formats, provided that he can create an object similar to what
is described below. First, we load the preprocessed methylation data as
follows:
The met_region
object contains the following important
slots:
met
: A list containing the methylation region, where
each element of the list is a different genomic region. More
specifically, each methylation promoter region is an Li × 3
dimensional matrix, where Li denotes the
number of CpGs found in region i. The columns contain the following
information:
anno
: Corresponding annotation data for each genomic
region as GRanges
object. The anno
slot is
only required for downstream analysis, e.g. to predict gene expression
levels from methylation profiles, since we need to map the genomic
regions with the gene IDs.Now let’s observe the format of the met_region
object.
For example, we can access the 20th promoter
methylation region as follows
## [,1] [,2] [,3]
## [1,] -0.8787 42 1
## [2,] -0.8779 42 0
## [3,] -0.8770 42 2
## [4,] -0.8759 42 0
## [5,] -0.8754 42 0
## [6,] -0.8667 9 1
Below are the annotation data
## GRanges object with 6 ranges and 2 metadata columns:
## seqnames ranges strand | id centre
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr12 562529-576529 + | ENSG00000139044 569529
## [2] chr12 745147-759147 + | ENSG00000177406 752147
## [3] chr12 1051888-1065888 - | ENSG00000002016 1058888
## [4] chr12 1696331-1710331 - | ENSG00000171823 1703331
## [5] chr12 2020870-2034870 - | ENSG00000151062 2027870
## [6] chr12 2897118-2911118 + | ENSG00000004478 2904118
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
and the total number of genomic regions are
## [1] 339
In the previous section for simplicity we provided a pre-processed
object which we then can directly use for inferring methylation
profiles. The steps required to create this object are really simple and
follow the diagram in Figure 2. First we need to read the methylation
data using the read_met
function. Also we need read
annotation data using the read_anno
file.
Note that the annotation file can contain any
genomic context: from promoters and gene bodies to enhancers,
Nanog regulatory regions and CTCF regions; hence BPRMeth
can be used for a plethora of analyses that want to take spatial genomic
correlations into account. Finally, the
create_region_object
will create the methylation regions
object which is the main object for storing methylation data. The
following lines of code provide an example of creating this object (
Do not run )
# Read bulk methylation data
met_dt <- read_met(file = "met_file", type = "bulk_seq")
# Read annotation file, which will create annotation regions as well
anno_dt <- read_anno(file = "anno_file")
# Create methylation regions that have CpG coverage
met_region <- create_region_object(met_dt = met_dt, anno_dt = anno_dt)
If the goal is to relate methylation profiles to gene expression levels, we need also to store expression data. Again, here we have pre-processed expression data which can be loaded as follows
The structure of the expression data is rather simple, it is a data.table with two columns as shown below
## id expr
## <char> <num>
## 1: ENSG00000230032 -3.321928
## 2: ENSG00000226210 3.833477
## 3: ENSG00000225143 -1.967380
## 4: ENSG00000234465 -3.321928
## 5: ENSG00000206114 -3.321928
## 6: ENSG00000235591 -3.321928
To create the expression data (which are log2 transformed) we could call ( Do not run )
For each genomic region, we will learn a methylation profile using the Binomial observation model with a specified number of Radial Basis Functions (RBFs) M. For a single input variable x, the RBF takes the form hj(x) = exp(−γ||x − μj||2), where μj denotes the location of the jth basis function in the input space and γ controls the spatial scale. The case when M = 0 is equivalent to learning the mena methylation rate for the given region (i.e. learn a constant function).
For our running example, we will create two RBF objects, one with 5 basis functions and the other with 0 basis functions denoting the mean methylation rate
# Create RBF basis object with 5 RBFs
basis_profile <- create_rbf_object(M = 5)
# Create RBF basis object with 0 RBFs, i.e. constant function
basis_mean <- create_rbf_object(M = 0)
The rbf
object contains information such as the centre
locations μj and the value
of the spatial scale parameter γ
## $M
## [1] 5
##
## $mus
## [1] -0.6666667 -0.3333333 0.0000000 0.3333333 0.6666667
##
## $gamma
## [1] 6.25
##
## $eq_spaced_mus
## [1] TRUE
##
## $whole_region
## [1] TRUE
##
## attr(,"class")
## [1] "rbf"
The γ is computed by the
number of bases M, however the
user can tune it according to his liking. Except from RBF basis, the
BPRMeth
pacakge provides polynomial and
Fourier basis functions which can be created with the
create_polynomial_object
and
create_fourier_object
functions, respectively.
Learning the methylation profiles is equivalent to inferring the model parameters W of the generalised linear regression model. These parameters can be considered as the extracted features which quantitate precisely notions of shape of a methylation profile.
For performing parameter inference, the BPRMeth
package
is enhanced to provide both maximum likelihood estimation and Bayesian
inference for the parameters. Specifically, for Bayesian estimation it
provides an efficient mean-field variational inference
(Variational Bayes) and a Gibbs sampling
algorithm. For computational efficiency one should choose the VB
implementation (Note that for Beta observation models, one should use
MLE to infer the profiles, since currently the package does not provide
Bayesian treatment for Beta likelihood).
To infer the methylation profiles using VB we need
to call the infer_profiles_vb
function
fit_profiles <- infer_profiles_vb(X = met_region$met, model = "binomial",
basis = basis_profile, is_parallel = FALSE)
fit_mean <- infer_profiles_vb(X = met_region$met, model = "binomial",
basis = basis_mean, is_parallel = FALSE)
This results in a infer_profiles
object whose most
important slot is the inferred weight matrix W
, below we
show the structure of the fit_profiles
object
## List of 9
## $ W : num [1:339, 1:6] 0.546 6.128 -0.235 -0.389 -0.172 ...
## ..- attr(*, "dimnames")=List of 2
## $ W_Sigma :List of 339
## $ alpha : num [1:339, 1] 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...
## $ beta : num [1:339, 1] 35 35 3.887 1.03 0.393 ...
## $ basis :List of 5
## ..- attr(*, "class")= chr "rbf"
## $ nll_feat : num [1:339, 1] 61.6 101 88.9 90 30.2 ...
## $ rmse_feat : num [1:339, 1] 0.1055 0.0922 0.2108 0.1311 0.2783 ...
## $ coverage_feat: int [1:339, 1] 29 26 17 41 13 47 16 16 38 21 ...
## $ lb_feat : num [1:339, 1] -506 -185.3 -149.6 -229 -98.7 ...
## - attr(*, "class")= chr [1:3] "infer_profiles_vb_binomial" "infer_profiles_vb" "infer_profiles"
Below we show an example promoter region together with the fitted methylation profiles, where the points represent the DNA methylation level of each CpG site. The shape of the methylation profiles is captured by the blue curve, whereas the red line denotes the mean methylation rate
# Choose promoter region 21 -> i.e. LEPREL2 gene
gene_id <- met_region$anno$id[21]
p <- plot_infer_profiles(region = 21, obj_prof = fit_profiles,
obj_mean = fit_mean, obs = met_region$met,
title = paste0("Gene ID ", gene_id))
print(p)
To quantitatively predict expression at each region, we construct a
regression model by taking as input the higher-order methylation
features learned from the infer_profiles_vb
. In addition to
these features, we consider two supplementary sources of information:
(1) the goodness of fit in RMSE and (2) the CpG density. For our
analysis an SVM regression model is considered. We will use 70% of the data for training, and we will
test the model’s performance on the remaining 30%.
All the aforementioned steps are assembled in the
predict_expr
wrapper function:
# Set seed for reproducible results
set.seed(22)
# Perform predictions using methylation profiles
pred_profile <- predict_expr(prof_obj = fit_profiles, expr = expr,
anno = met_region$anno, model_name = "svm",
fit_feature = "RMSE", cov_feature = TRUE,
is_summary = FALSE)
# Perform predictions using mean methylation rate
pred_mean <- predict_expr(prof_obj = fit_mean, expr = expr,
anno = met_region$anno, model_name = "svm",
is_summary = FALSE)
We can now compare the Pearson’s correlation coefficient r for both models and observe that the higher-order methylation features achieve test correlations twice as large compared to mean methylation rate. Note that someone should use cross-validaion to get a better estimate of the correlation performance.
## [1] "Profile r: 0.59"
## [1] "Mean r: 0.41"
Below in Figure @ref(fig:plotpred) we show a scatter plot of the predicted and measured expression values for the and of the K562 cell line.
g1 <- plot_predicted_expr(pred_obj = pred_profile,
title = "Methylation profile")
g2 <- plot_predicted_expr(pred_obj = pred_mean,
title = "Methylation rate")
g <- cowplot::plot_grid(g1, g2, ncol = 2, nrow = 1)
print(g)
Another application of the BPRMeth
model is to use the
higher-order methylation features to cluster DNA methylation patterns
across genomic regions and examine whether distinct methylation profiles
are associated to different gene expression levels. To cluster
methylation profiles, a mixture modelling approach is considered and we
perform inference either via EM algorithm (MLE estimates) or using the
mean-field variational inference (Variational Bayes VB) model. In this
example we will use the VB model to perform variational inference.
The BPRMeth
package provides the
cluster_profiles_vb
function for performing Bayesian
clustering, where the user needs to provide the number of clusters K, the methylation regions X, the observation model and a basis
object. For efficiency we run on 20 VB iterations.
# Set seed for reproducible results
set.seed(12)
# Create basis object
basis_obj <- create_rbf_object(M = 3)
# Set number of clusters K
K <- 4
# Perform clustering
cl_obj <- cluster_profiles_vb(X = met_region$met, K = K, model = "binomial",
alpha_0 = .5, beta_0 = .1,
basis = basis_obj, vb_max_iter = 20)
Figure @ref(fig:clusterplot) shows the fitted methylation profiles for each cluster.
The structure of the package makes it straighforward to work with
methylation data generated from different technologies. For example,
with methylation array data, the most common approach
is to use the Beta distribution to capture the methylation level from
the array experiment. Below we provide a minimal analysis step for
inferring methylation profiles on synthetic data; note that we just need
to define that model = beta
. (In some cases, researchers
work with M-values, where they transform the Beta values to (−inf , inf ), i.e. they make them Gaussian
random variables, if that is the case the BPRMeth
package
can handle this by setting model = gaussian
.)
Note that for the Beta model, we currently provide only
maximum likelihood estimation (MLE).
Similarly, someone might analyse single cell
methylation data, where now would need to set
model = bernoulli
as shown in the code snippet below using
again synthetic data.
basis_prof <- create_rbf_object(M = 5)
basis_mean <- create_rbf_object(M = 0)
bern_prof <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
basis = basis_prof, is_parallel = FALSE)
bern_mean <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
basis = basis_mean, is_parallel = FALSE)
p <- plot_infer_profiles(region = 60, obj_prof = bern_prof,
obj_mean = bern_mean, obs = bernoulli_data,
title = "Bernoulli profile")
print(p)
We can finally cluster the single-cell methylation profiles as follows
# Perform clustering
cl_obj <- cluster_profiles_vb(X = bernoulli_data, K = 3, model = "bernoulli",
basis = basis_obj, vb_max_iter = 40)
cluster_plot <- plot_cluster_profiles(cluster_obj = cl_obj)
print(cluster_plot)
This vignette was compiled using:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BPRMeth_1.33.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [4] IRanges_2.41.1 S4Vectors_0.45.2 BiocGenerics_0.53.3
## [7] generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 class_7.3-22
## [4] lattice_0.22-6 digest_0.6.37 magrittr_2.0.3
## [7] RColorBrewer_1.1-3 evaluate_1.0.1 grid_4.4.2
## [10] iterators_1.0.14 fastmap_1.2.0 Matrix_1.7-1
## [13] foreach_1.5.2 jsonlite_1.8.9 e1071_1.7-16
## [16] BiocManager_1.30.25 mgcv_1.9-1 httr_1.4.7
## [19] fansi_1.0.6 UCSC.utils_1.3.0 scales_1.3.0
## [22] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3
## [25] rlang_1.1.4 XVector_0.47.0 splines_4.4.2
## [28] cowplot_1.1.3 munsell_0.5.1 withr_3.0.2
## [31] cachem_1.1.0 yaml_2.3.10 tools_4.4.2
## [34] colorspace_2.1-1 ggplot2_3.5.1 GenomeInfoDbData_1.2.13
## [37] assertthat_0.2.1 buildtools_1.0.0 vctrs_0.6.5
## [40] R6_2.5.1 proxy_0.4-27 lifecycle_1.0.4
## [43] zlibbioc_1.52.0 pkgconfig_2.0.3 bslib_0.8.0
## [46] pillar_1.9.0 gtable_0.3.6 glue_1.8.0
## [49] data.table_1.16.2 Rcpp_1.0.13-1 xfun_0.49
## [52] tibble_3.2.1 sys_3.4.3 knitr_1.49
## [55] farver_2.1.2 nlme_3.1-166 htmltools_0.5.8.1
## [58] labeling_0.4.3 rmarkdown_2.29 maketools_1.3.1
## [61] matrixcalc_1.0-6 compiler_4.4.2
[1] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412.
[2] Kapourani, C. A. and Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129
[3] Clark, S.J., Argelaguet, R., Kapourani, C.A., Stubbs, T.M., Lee, H.J., Alda-Catalinas, C., Krueger, F., Sanguinetti, G., Kelsey, G., Marioni, J.C., Stegle, O. and Reik, W., (2018). scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nature communications, 9(1), p.781.
[4] ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74.
This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.
This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh, and by the European Research Council through grant MLCS306999.