## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("Melissa")
## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/Melissa", build_vignettes = TRUE)
Measurements of DNA methylation at the single cell level are
promising to revolutionise our understanding of epigenetic control of
gene expression. Yet, intrinsic limitations of the technology result in
very sparse coverage of CpG sites (around 5% to 20% coverage),
effectively limiting the analysis repertoire to a semi-quantitative
level. Melissa
(MEthyLation Inference for Single cell
Analysis) [1], is a Bayesian hierarchical method to quantify
spatially-varying methylation profiles across genomic regions from
single-cell bisulfite sequencing data (scBS-seq). Melissa clusters
individual cells based on local methylation patterns, enabling the
discovery of epigenetic differences and similarities among individual
cells. The clustering also acts as an effective regularisation method
for imputation of methylation on unassayed CpG sites, enabling transfer
of information between individual cells.
Melissa
depends heavily on the BPRMeth
package [2, 3] for reading and processing bisulfite sequencing data. It
assumes that the data are first processed using Bismark [4], hence from
fastq and BAM files we will obtain a coverage file by
running the bismark_methylation_extractor
command as shown
below,
# Requires Bismark
bismark_methylation_extractor --comprehensive --merge_non_CpG \
--no_header --gzip --bedGraph input_file.bam
The format of the coverage file is the following
<chr> <start> <end> <met_prcg> <met_reads> <unmet_reads>
where each row corresponds to an observed CpG (i.e. we have at least one read mapped to this location). Note that CpGs with no coverage are not included in this file. This format however contains redundant information, hence we bring the scBS-seq files in the format that Melissa (and BPRMeth) require, which is
<chr> <start> <met_level>
where met_level
corresponds to the binary methylation
state, either 0 or 1. We can do this by calling the
binarise_files
helper function, which only requires the
input directory of the files and optionally the path to the output
directory. Each file of the indir
corresponds to a
different cell and it can also be compressed, e.g. in .gz
file format, to save storage space.
Note that the new binarised files will not be compressed, due to system commands not being portable inside the R package. To save storage space, the user could compress the files for instance in .gz file format by running in command line :
Now we are ready to process the binarised input files and create methylation regions using functions from the BPRMeth package. Briefly, the steps required to create this object are as follows.
read_anno
file. Note that the annotation file can contain
any genomic context: from promoters and gene bodies to
enhancers, Nanog regulatory regions and CTCF regions; hence
Melissa
can be used for a plethora of analyses that want to
take spatial genomic correlations into account.read_met
function. We will do this independently per
cell.create_region_object
will create the
methylation regions object which is the main object for storing
methylation data.The create_melissa_data_obj
is a wrapper function for
doing all the above steps at once. Note that this step
is important so read carefully the purpose of all the parameters to
obtain the right object for downstream analysis.
melissa_data <- create_melissa_data_obj(met_dir = "path_to_bin_met_dir",
anno_file = "anno_file", cov = 3)
The melissa_data$met
contains the methylation data whose
structure is a list of length N (number of cells), and each
element of this list is another list of length M (number of genomic regions). Each
entry in the inner list is an I × 2 matrix, where I are the number of CpGs, where the
1st column contains the (relative) CpG locations and the 2nd column
contains the methylation state: methylated or unmethylated.
It is often useful to save this object to file with the
saveRDS
function. The object can then be restored using the
readRDS
function. This allows us to conduct downstream
analysis without having to repeat the processing steps described
above.
Next we will filter genomic regions according to different criteria. Note that these steps and their combinations are all optional and depend on the downstream analysis you want to perform.
Genomic regions with really sparse coverage of CpGs are not
informative to infer methylation profiles. Hence, we only consider
genomic regions with at least min_cpgcov
CpG coverage in
each region. Note that this step will not actually remove any genomic
regions, it will only set to NA
those regions that have
coverage below the threshold.
Genomic regions that have not heterogeneity across different cells are often of no interest, for example if we were to use them for identifying cell subpopulations. This way we will both keep only the informative genomic regions and reduce the number of genomic regions for downstream analysis for efficiency.
Genomic regions that have coverage only on a handful of cells are not
powerful for sharing information across cells. For example, a specific
promoter that has observations in 5 out of the 100 cells, will not
contain enough to perform sharing of information, either for imputation
or clustering. Hence, regions that are are not covered in at least
min_cell_cov_prcg
of the cells are filtered out.
The Smallwood et al. (2014) [5] dataset can be downloaded with accession number GSE56879. For this dataset we used the already processed coverage files from Bismark. The filtering of cells that do not pass quality control (QC) was done according to the original studies. See supplementary information of [5] for IDs of cells that passed filtering.
The Angermueller et al. (2016) [6] dataset can be downloaded with accession number GSE74535. For this dataset we used the already processed coverage files from Bismark. The filtering of cells that do not pass quality control (QC) was done according to the original studies. See supplementary information of [5] for IDs of cells that passed filtering.
The analysis on the subsampled ENCODE WGBS data was performed on the bulk GM12878 (GEO GSE86765) and H1-hESC (GEO GSE80911) cell lines. The BAM files for these studies can be obtained directly from the ENCODE project portal.
#=================
# 1. Download BAM data
DATA_DIR="../encode/wgbs/"
# Download GM12878 cell line
wget -P ${DATA_DIR}GM12878/ https://www.encodeproject.org/files/ENCFF681ASN/@@download/ENCFF681ASN.bam
# Download H1-hESC cell line
wget -P ${DATA_DIR}H1hESC/ https://www.encodeproject.org/files/ENCFF546TLK/@@download/ENCFF546TLK.bam
Then we subsample the WGBS data from the BAM file, that is, we are
going to remove individual reads instead of individual
CpGs to take into account the nature of missing values of scBS-seq data.
To do so we will run the samtools view
command which
subsamples random lines from a BAM file. This way, we can generate
artificially 40 pseudo-single cells by keeping only 0.5% of the bulk
reads for each single cell.
data_dir="encode/wgbs/GM12878/SRR4235788.bam"
out_dir="encode/wgbs/GM12878/subsampled/GM12878"
for (( i=1; i <= 40; ++i ))
do
my_command="samtools view -s ${i}.005 -b $data_dir > ${out_dir}_${i}.bam"
eval $my_command
done
Finally, we run the bismark_methylation_extractor
command to obtain the methylation state of each covered CpG fomr the
resulting BAM files. The following command will result in files of
coverage
output and bedGraph
output.
data_dir="encode/wgbs/GM12878/subsampled/"
proc_dir="encode/wgbs/GM12878/processed/"
for (( i=1; i <= 40; ++i ))
do
my_command="bismark_methylation_extractor --ignore 2 --comprehensive --merge_non_CpG --no_header --multicore 4 -o $proc_dir --gzip --bedGraph ${data_dir}GM12878_${i}.bam"
eval $my_command
done
The analysis on the subsampled ENCODE RRBS data was performed again
on the bulk GM12878 H1-hESC cells lines. We can download the the raw
fastq
files from.
and search for GM12878
or H1-hESC
and
download the fastq files only for the 2nd replicate.
Next we run Bismark.
First run the bismark_genome_preparation
command to create
a genome indexing for hg19
.
After that we run the bismark
command which will create
alignment files in bam
format.
#=================
# 3. Run bismark
bismark --genome hg19/ encode/wgEncodeHaibMethylRrbsGm12878HaibRawDataRep2.fastq.gz
bismark --genome hg19/ encode/wgEncodeHaibMethylRrbsH1hescHaibRawDataRep2.fastq.gz
After this step, we follow the same process as we did for the bulk ENCODE WGBS data above.
This vignette was compiled using:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
## [4] xfun_0.49 maketools_1.3.1 cachem_1.1.0
## [7] htmltools_0.5.8.1 rmarkdown_2.29 buildtools_1.0.0
## [10] lifecycle_1.0.4 cli_3.6.3 sass_0.4.9
## [13] jquerylib_0.1.4 compiler_4.4.2 sys_3.4.3
## [16] tools_4.4.2 evaluate_1.0.1 bslib_0.8.0
## [19] yaml_2.3.10 BiocManager_1.30.25 jsonlite_1.8.9
## [22] rlang_1.1.4
[1] Kapourani, C. A., & Sanguinetti, G. (2019). Melissa: Bayesian clustering and imputation of single cell methylomes. Genome Biology, 20(1), 61, DOI: https://doi.org/10.1186/s13059-019-1665-8
[2] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412, DOI: https://doi.org/10.1093/bioinformatics/btw432
[3] Kapourani, C. A. & Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129
[4] Krueger, F., & Andrews, S. R. (2011). Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11), 1571-1572.
[5] Smallwood, S. A., Lee, H. J., Angermueller C., Krueger F., Saadeh H., Peat J., Andrews S. R., Stegle S., Reik W., and Kelsey G. (2014). Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nature methods, 11(8):817.
[6] Angermueller, C., Clark, S.J., Lee, H.J., Macaulay, I.C., Teng, M.J., Hu, T.X., Krueger, F., Smallwood, S.A., Ponting, C.P., Voet, T. and Kelsey, G. (2016). Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nature methods, 13(3), p.229.
This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.
This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.