Cluster and impute scBS-seq data using Melissa

Installation

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("Melissa")

## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/Melissa", build_vignettes = TRUE)

Background

Measurements of DNA methylation at the single cell level are promising to revolutionise our understanding of epigenetic control of gene expression. Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites (around 5% to 20% coverage), effectively limiting the analysis repertoire to a semi-quantitative level. Melissa (MEthyLation Inference for Single cell Analysis) [1], is a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq).

Melissa addresses the data sparsity issue by leveraging local correlations between neighbouring CpGs and similarity between individual cells (see Fig.@ref(fig:melissa)). The starting point is the definition of a set of genomic regions (e.g. genes or enhancers). Within each region, Melissa postulates a latent profile of methylation, a function mapping each CpG within the region to a number in [0, 1] which defines the probability of that CpG being methylated. To ensure spatial smoothness of the profile, Melissa uses a generalised linear model of basis function regression along the lines of [2,3] (with modified likelihood to account for single cell data). Local correlations are however often insufficient for regions with extremely sparse coverage, and these are quite common in scBS-seq data. Therefore, we share information across different cells by coupling the local GLM regressions through a shared prior distribution. In order to respect the (generally unknown) population structure that may be present within the cells assayed, we choose a (finite) Dirichlet mixture model prior.

`Melissa` model overview. Melissa combines a likelihood computed from single cell methylation profiles fitted to each genomic region using a supervised regression approach (bottom left) and an unsupervised Bayesian clustering prior (top left). The posterior distribution provides a methylome-based clustering (top right) and imputation (bottom right) of single cells.

Melissa model overview. Melissa combines a likelihood computed from single cell methylation profiles fitted to each genomic region using a supervised regression approach (bottom left) and an unsupervised Bayesian clustering prior (top left). The posterior distribution provides a methylome-based clustering (top right) and imputation (bottom right) of single cells.

Melissa analysis pipeline

Reading scBS-seq data

For reading, processing and filtering raw scBS-seq data consult the vignette “Process and filter scBS-seq data”, which provides a step by step tutorial on obtaining a melissa_data_obj object, which will be the input to the variational inference machinery for Melissa.

Loading synthetic data

To create a minimal working example, Melissa comes with already generated synthetic data of N = 200 cells and M = 100 genomic regions, consisting of K = 4 cell subpopulations.

suppressPackageStartupMessages(library(Melissa)) # Load package
dt_obj <- melissa_encode_dt   # Load synthetic data

The structure of the dt_obj is a list of three elements: met contains the methylation data, anno_region contains the genomic regions that we will run Melissa (since these are synthetic data, this element is NULL), opts contains metadata information about the data generating/processing procedure.

# Elements of `dt_obj` object
names(dt_obj)
## [1] "met"         "anno_region" "opts"

From the 2nd cell we can access the 50th genomic region using the following code, where the 1st column represents the relative CpG location within the region (scaled to [-1, 1] interval) and the 2nd column represents the methylation state: 1 = methylated and 0 = unmethylated. The matrix rownames contain the actual CpG coordinates, for real data this would be in the format: <chr>:<CpG location>.

head(dt_obj$met[[2]][[50]])
##            [,1] [,2]
## loc:-963 -0.963    1
## loc:-952 -0.952    1
## loc:-941 -0.941    1
## loc:-939 -0.939    1
## loc:-859 -0.859    1
## loc:-858 -0.858    0
# Number of cells
cat("Number of cells: ", length(dt_obj$met))
## Number of cells:  200
# Number of genomic regions in each cell
cat("Number of genomic regions: ", length(dt_obj$met[[1]]) )
## Number of genomic regions:  100

Creating basis object

For each genomic region we infer a methylation profile using a GLM of basis function regression along the lines of [1,2]. Again we use the functionality of BPRMeth to create a basis object, in our case it will be a Radial Basis Function (RBF), however, one can create polynomial and Fourier basis functions which can be created with the create_polynomial_object and create_fourier_object functions, respectively.

library(BPRMeth)
# Create RBF basis object with 4 RBFs
basis_obj <- create_rbf_object(M = 4)

The rbf object contains information such as the centre locations μj and the value of the spatial scale parameter γ

# Show the slots of the 'rbf' object
basis_obj
## $M
## [1] 4
## 
## $mus
## [1] -0.6 -0.2  0.2  0.6
## 
## $gamma
## [1] 4
## 
## $eq_spaced_mus
## [1] TRUE
## 
## $whole_region
## [1] TRUE
## 
## attr(,"class")
## [1] "rbf"

Clustering and imputing synthetic scBS-seq data

Now we are ready to perfrom inference using the Melissa model and jointly cluster cells based on their DNA methylation landscape and impute missing CpG methylation states.

Partitioning to training and test set (Optional)

To be able to evaluate the imputation performance of Melissa, we need ground truth labels of methylation states. To do so, we can partition the original dataset to training and test set, where a subset of CpGs will be used for training and the remaining sites for testing. For instance in a genomic region with 50 CpGs, 35 will be used for training and the remaining will be used for testing. We should highlight that the partitioning is at the genomic region level and not at the cell level, so we do not have to impute the methylome of a whole cell.

Note that this process is optional and is required only for being able to evaluate how well Melissa performs imputation. The user can skip the partitioning step and directly run Melissa on the whole dataset. For real scBS-seq data the user can use the inferred profiles to impute CpGs with no coverage at all (Check next section on a process of how to run Melissa to impute data for non-covered CpGs).

set.seed(15)
# Partition to training and test set
dt_obj <- partition_dataset(dt_obj = dt_obj, data_train_prcg = 0.2,
                            region_train_prcg = 1, cpg_train_prcg = 0.4, 
                            is_synth = TRUE)

This code snippet, will update the met element of the dt_obj to retain only the CpGs used as training set and the test set will be now stored in the met_test element in the same object. Note that during inference, Melissa will only have access to the training data and will ignore totally the test set. For more details on the usage of the additional parameters type ?partition_dataset.

Running Melissa model

Whether or not a subset of the data was used as test set, the command for running Melissa remains the same. We need to provide it with the (training) data X, the number of clusters K that we expect to find in the cell population, the basis function object and with initial values for the hyperparameters. In this example, we will mostly keep the default values for the (hyper)-parameters. Note that in real applications we should set vb_max_iter to a much larger value, e.g. 500, and vb_init_nstart around 10, to ensure we obtain the best initial values when running VB.

set.seed(15)
# Run Melissa with K = 4 clusters
melissa_obj <- melissa(X = dt_obj$met, K = 4, basis = basis_obj,
                       vb_max_iter = 20, vb_init_nstart = 1, 
                       is_parallel = FALSE)
##   |                                                                              |                                                                      |   0%  |                                                                              |====                                                                  |   6%  |                                                                              |========                                                              |  11%  |                                                                              |============                                                          |  17%  |                                                                              |================                                                      |  22%  |                                                                              |===================                                                   |  28%  |                                                                              |=======================                                               |  33%  |                                                                              |===========================                                           |  39%  |                                                                              |===============================                                       |  44%  |                                                                              |===================================                                   |  50%  |                                                                              |=======================================                               |  56%  |                                                                              |===========================================                           |  61%  |                                                                              |===============================================                       |  67%  |                                                                              |===================================================                   |  72%  |                                                                              |======================================================                |  78%  |                                                                              |==========================================================            |  83%  |                                                                              |==============================================================        |  89%  |                                                                              |==================================================================    |  94%
##   |                                                                              |======================================================================| 100%

Output summary

We can check the mixing proportions for each cluster, to obtain an estimate of the proportion of cells that are assigned to each cell subpopulation.

melissa_obj$pi_k
##  cluster1  cluster2  cluster3  cluster4 
## 0.1757428 0.4034653 0.1609033 0.2599009

The posterior probabilities of each cell belonging to a certain cluster (responsibilities) are stored in the r_nk element. Here each column represents a different cluster and each row a different cell (as shown by rownames of the matrix).

head(melissa_obj$r_nk)
##             cluster1      cluster2      cluster3      cluster4
## cell_1 4.219314e-162  1.000000e+00 1.513088e-157 1.721918e-144
## cell_2 5.567614e-139  1.000000e+00 1.315603e-193 8.414939e-139
## cell_3 3.148453e-185 6.767318e-180 4.201472e-231  1.000000e+00
## cell_4 4.289418e-194 3.487772e-191  1.000000e+00 2.035704e-196
## cell_5 1.125544e-173  1.000000e+00 2.510176e-224 1.881073e-194
## cell_6 2.994690e-204 2.777733e-200  1.000000e+00 2.493941e-180

The posterior mean methylation profile of each genomic region and cell subtype is stored in the W element. We can access the posterior weights of the 10th genomic region for the 2nd cell subpopulation. Here each element of the vector corresponds to a basis function, except the first element which corresponds to the bias term.

melissa_obj$W[10, , 2]
## [1] -0.7915553  1.7962355  1.0500203 -1.8842497  0.4380053

Plottting methylation profiles

We can plot methylation profiles of each cell subtype for specific genomic regions using the plot_melissa_profiles function.

# Plot profiles from all cell subtypes for genomic region 25
plot_melissa_profiles(melissa_obj = melissa_obj, region = 25, 
                      title = "Methylation profiles for region 25")

# Plot profiles from all cell subtypes for genomic region 90
plot_melissa_profiles(melissa_obj = melissa_obj, region = 90, 
                      title = "Methylation profiles for region 90")

Evaluating clustering performance

Since these are synthetic generated data, we have the ground truth labels for the assignment of each cell to the corresponding cluster, which is stored in dt_obj$opts$C_true. Let’s now evaluate the clustering performance both in terms of Adjusted Rand Index (ARI) and clustering assignment error metrics.

# Run clustering performance
melissa_obj <- eval_cluster_performance(melissa_obj, dt_obj$opts$C_true)
# ARI metric
cat("ARI: ", melissa_obj$clustering$ari)
## ARI:  1
# Clustering assignment error metric
cat("Clustering assignment error: ", melissa_obj$clustering$error)
## Clustering assignment error:  0

As we can observe Melissa clustered perfectly the synthetic data, which are also clearly separable from the example methylation profiles shown above.

Evaluating imputation performance

To evaluate the imputation performance of Melissa, we use the held out test set and compare the true methylation state of each CpG with the predicted methylation value, which is the evaluation of the latent methylation profile at the corresponding location.

imputation_obj <- impute_test_met(obj = melissa_obj, 
                                  test = dt_obj$met_test)

Now we use different metrics to evaluate the prediction performance, such as AUC, ROC curve and precision-recall curves and store them as elements in melissa_obj object.

melissa_obj <- eval_imputation_performance(obj = melissa_obj, 
                                           imputation_obj=imputation_obj)
# AUC 
cat("AUC: ", melissa_obj$imputation$auc)
## AUC:  0.8525672
# F-measure
cat("F-measure: ", melissa_obj$imputation$f_measure)
## F-measure:  0.7374896

Clustering and imputing real scBS-seq data

The above section, mostly focused on showing how we can evaluate Melissa’s clustering and imputation performance by partitioning the data in training and test set. In common scenarios, we want to use all covered CpGs to infer the methylation patterns and cluster the single cells based on their methylome. Then try and impute the methylation state of CpGs for which we have missing information, and are stored in another file. Below we show the steps how to perform this analysis.

# Observed binarised met file format:
# chr pos met_state 
# 1   11  0
# X   15  1

# To impute met file format:
# chr pos met_state
# 1   20  -1 
# X   25  -1
# X   44  0

# Imputed met file format:
# chr pos met_state predicted
# 1   20    -1      0.74
# X   25    -1      0.1
# X   44    0       0.05

# Directory of files (cells) with binarised methylation data. Each row 
# contains CpGs that we have coverage and will be used to infer 
# methylation profiles with Melissa.
met_dir <- "directory_of_observed_met_data"

# Directory of files (cells, filenames and their structure should 
# match with those in met_dir) to make predictions. Each row corresponds 
# to a CpG that we will impute its value. It can contain both CpGs 
# with and without CpG coverage. Those CpGs without coverage the third 
# column can be any value, e.g. -1. Melissa will create a new folder 
# with the same filenames, but including a new column named <predicted>, 
# corresponding the imputed value.
impute_met_dir <- "directory_of_cells_to_impute_CpGs"

# Annotation file name for creating genomic regions of interest
anno_file <- "name_of_anno_file"

# Create melissa data object
melissa_data <- create_melissa_data_obj(met_dir, anno_file)

# QC and filtering regions 
# By CpG coverage
melissa_data <- filter_by_cpg_coverage(melissa_data, min_cpgcov = 5)
# By methylation variability
melissa_data <- filter_by_variability(melissa_data, min_var = 0.1)
# By genomic coverage across cells
melissa_data <- filter_by_coverage_across_cells(melissa_data, 
                                                min_cell_cov_prcg = 0.3)

# Run Melissa
melissa_obj <- melissa(X = melissa_data$met, K = 2)

# Extract annotation region object
anno_region <- melissa_data$anno_region

# Peform imputation. This function will create a new folder with 
# imputed met values for each cell. Note that the new files will only 
# contain imputed values for CpGs that fall in `anno_region`, since 
# Melissa can predict values only within these regions.
out <- impute_met_files(met_dir = impute_met_dir, obj = melissa_obj,
                        anno_region = anno_region)

Session Info

This vignette was compiled using:

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Melissa_1.23.0       BPRMeth_1.33.0       GenomicRanges_1.59.0
##  [4] GenomeInfoDb_1.43.1  IRanges_2.41.1       S4Vectors_0.45.2    
##  [7] BiocGenerics_0.53.3  generics_0.1.3       knitr_1.49          
## [10] BiocStyle_2.35.0    
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            xfun_0.49               bslib_0.8.0            
##  [4] ggplot2_3.5.1           lattice_0.22-6          vctrs_0.6.5            
##  [7] tools_4.4.2             tibble_3.2.1            fansi_1.0.6            
## [10] pkgconfig_2.0.3         Matrix_1.7-1            data.table_1.16.2      
## [13] RColorBrewer_1.1-3      matrixcalc_1.0-6        assertthat_0.2.1       
## [16] lifecycle_1.0.4         GenomeInfoDbData_1.2.13 truncnorm_1.0-9        
## [19] farver_2.1.2            compiler_4.4.2          MatrixModels_0.5-3     
## [22] mcmc_0.9-8              munsell_0.5.1           codetools_0.2-20       
## [25] SparseM_1.84-2          quantreg_5.99           htmltools_0.5.8.1      
## [28] sys_3.4.3               buildtools_1.0.0        sass_0.4.9             
## [31] yaml_2.3.10             pillar_1.9.0            jquerylib_0.1.4        
## [34] MASS_7.3-61             cachem_1.1.0            iterators_1.0.14       
## [37] mclust_6.1.1            foreach_1.5.2           digest_0.6.37          
## [40] mvtnorm_1.3-2           labeling_0.4.3          maketools_1.3.1        
## [43] splines_4.4.2           cowplot_1.1.3           fastmap_1.2.0          
## [46] grid_4.4.2              colorspace_2.1-1        cli_3.6.3              
## [49] magrittr_2.0.3          survival_3.7-0          utf8_1.2.4             
## [52] ROCR_1.0-11             withr_3.0.2             scales_1.3.0           
## [55] UCSC.utils_1.3.0        rmarkdown_2.29          XVector_0.47.0         
## [58] httr_1.4.7              coda_0.19-4.1           evaluate_1.0.1         
## [61] rlang_1.1.4             MCMCpack_1.7-1          Rcpp_1.0.13-1          
## [64] glue_1.8.0              BiocManager_1.30.25     jsonlite_1.8.9         
## [67] R6_2.5.1                zlibbioc_1.52.0

Bibliography

[1] Kapourani, C. A., & Sanguinetti, G. (2019). Melissa: Bayesian clustering and imputation of single cell methylomes. Genome Biology, 20(1), 61, DOI: https://doi.org/10.1186/s13059-019-1665-8

[2] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412, DOI: https://doi.org/10.1093/bioinformatics/btw432

[3] Kapourani, C. A. & Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129

Acknowledgements

This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.

This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.