Process and filter scBS-seq data

Installation

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("Melissa")

## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/Melissa", build_vignettes = TRUE)

Introduction

Measurements of DNA methylation at the single cell level are promising to revolutionise our understanding of epigenetic control of gene expression. Yet, intrinsic limitations of the technology result in very sparse coverage of CpG sites (around 5% to 20% coverage), effectively limiting the analysis repertoire to a semi-quantitative level. Melissa (MEthyLation Inference for Single cell Analysis) [1], is a Bayesian hierarchical method to quantify spatially-varying methylation profiles across genomic regions from single-cell bisulfite sequencing data (scBS-seq). Melissa clusters individual cells based on local methylation patterns, enabling the discovery of epigenetic differences and similarities among individual cells. The clustering also acts as an effective regularisation method for imputation of methylation on unassayed CpG sites, enabling transfer of information between individual cells.

Reading scBS-seq data

Convert Bismark coverage format

Melissa depends heavily on the BPRMeth package [2, 3] for reading and processing bisulfite sequencing data. It assumes that the data are first processed using Bismark [4], hence from fastq and BAM files we will obtain a coverage file by running the bismark_methylation_extractor command as shown below,

# Requires Bismark
bismark_methylation_extractor --comprehensive --merge_non_CpG \
  --no_header --gzip --bedGraph input_file.bam

The format of the coverage file is the following

<chr> <start> <end> <met_prcg> <met_reads> <unmet_reads>

where each row corresponds to an observed CpG (i.e. we have at least one read mapped to this location). Note that CpGs with no coverage are not included in this file. This format however contains redundant information, hence we bring the scBS-seq files in the format that Melissa (and BPRMeth) require, which is

<chr> <start> <met_level>

where met_level corresponds to the binary methylation state, either 0 or 1. We can do this by calling the binarise_files helper function, which only requires the input directory of the files and optionally the path to the output directory. Each file of the indir corresponds to a different cell and it can also be compressed, e.g. in .gz file format, to save storage space.

library(Melissa)
# Binarise scBS-seq data
binarise_files(indir = "path_to_met_files_dir")

Note that the new binarised files will not be compressed, due to system commands not being portable inside the R package. To save storage space, the user could compress the files for instance in .gz file format by running in command line :

gzip filenames

Create methylation regions

Now we are ready to process the binarised input files and create methylation regions using functions from the BPRMeth package. Briefly, the steps required to create this object are as follows.

  1. First we require annotation data using the read_anno file. Note that the annotation file can contain any genomic context: from promoters and gene bodies to enhancers, Nanog regulatory regions and CTCF regions; hence Melissa can be used for a plethora of analyses that want to take spatial genomic correlations into account.
  2. Next we need to read the methylation data using the read_met function. We will do this independently per cell.
  3. Finally, the create_region_object will create the methylation regions object which is the main object for storing methylation data.

The create_melissa_data_obj is a wrapper function for doing all the above steps at once. Note that this step is important so read carefully the purpose of all the parameters to obtain the right object for downstream analysis.

melissa_data <- create_melissa_data_obj(met_dir = "path_to_bin_met_dir",
                                        anno_file = "anno_file", cov = 3)

The melissa_data$met contains the methylation data whose structure is a list of length N (number of cells), and each element of this list is another list of length M (number of genomic regions). Each entry in the inner list is an I × 2 matrix, where I are the number of CpGs, where the 1st column contains the (relative) CpG locations and the 2nd column contains the methylation state: methylated or unmethylated.

Store object

It is often useful to save this object to file with the saveRDS function. The object can then be restored using the readRDS function. This allows us to conduct downstream analysis without having to repeat the processing steps described above.

saveRDS(file = "melissa_data_obj.rds", melissa_data)

Filtering genomic regions

Next we will filter genomic regions according to different criteria. Note that these steps and their combinations are all optional and depend on the downstream analysis you want to perform.

Filter by CpG coverage

Genomic regions with really sparse coverage of CpGs are not informative to infer methylation profiles. Hence, we only consider genomic regions with at least min_cpgcov CpG coverage in each region. Note that this step will not actually remove any genomic regions, it will only set to NA those regions that have coverage below the threshold.

melissa_data <- filter_by_cpg_coverage(melissa_data, min_cpgcov = 10)

Filter by mean methylation variability across cells

Genomic regions that have not heterogeneity across different cells are often of no interest, for example if we were to use them for identifying cell subpopulations. This way we will both keep only the informative genomic regions and reduce the number of genomic regions for downstream analysis for efficiency.

melissa_data <- filter_by_variability(melissa_data, min_var = 0.2)

Filter by genomic coverage across cells

Genomic regions that have coverage only on a handful of cells are not powerful for sharing information across cells. For example, a specific promoter that has observations in 5 out of the 100 cells, will not contain enough to perform sharing of information, either for imputation or clustering. Hence, regions that are are not covered in at least min_cell_cov_prcg of the cells are filtered out.

melissa_data <- filter_by_coverage_across_cells(melissa_data, 
                                                min_cell_cov_prcg = 0.5)

Store object

saveRDS(file = "melissa_data_obj_filtered.rds", melissa_data)

Case studies

Smallwood et al. (2014)

The Smallwood et al. (2014) [5] dataset can be downloaded with accession number GSE56879. For this dataset we used the already processed coverage files from Bismark. The filtering of cells that do not pass quality control (QC) was done according to the original studies. See supplementary information of [5] for IDs of cells that passed filtering.

Angermueller et al. (2016)

The Angermueller et al. (2016) [6] dataset can be downloaded with accession number GSE74535. For this dataset we used the already processed coverage files from Bismark. The filtering of cells that do not pass quality control (QC) was done according to the original studies. See supplementary information of [5] for IDs of cells that passed filtering.

Bulk WGBS Encode

The analysis on the subsampled ENCODE WGBS data was performed on the bulk GM12878 (GEO GSE86765) and H1-hESC (GEO GSE80911) cell lines. The BAM files for these studies can be obtained directly from the ENCODE project portal.

#=================
# 1. Download BAM data
DATA_DIR="../encode/wgbs/"
# Download GM12878 cell line
wget -P ${DATA_DIR}GM12878/ https://www.encodeproject.org/files/ENCFF681ASN/@@download/ENCFF681ASN.bam
# Download H1-hESC cell line
wget -P ${DATA_DIR}H1hESC/ https://www.encodeproject.org/files/ENCFF546TLK/@@download/ENCFF546TLK.bam

Then we subsample the WGBS data from the BAM file, that is, we are going to remove individual reads instead of individual CpGs to take into account the nature of missing values of scBS-seq data. To do so we will run the samtools view command which subsamples random lines from a BAM file. This way, we can generate artificially 40 pseudo-single cells by keeping only 0.5% of the bulk reads for each single cell.

data_dir="encode/wgbs/GM12878/SRR4235788.bam"
out_dir="encode/wgbs/GM12878/subsampled/GM12878"
for (( i=1; i <= 40; ++i ))
do
    my_command="samtools view -s ${i}.005 -b $data_dir > ${out_dir}_${i}.bam"
    eval $my_command
done

Finally, we run the bismark_methylation_extractor command to obtain the methylation state of each covered CpG fomr the resulting BAM files. The following command will result in files of coverage output and bedGraph output.

data_dir="encode/wgbs/GM12878/subsampled/"
proc_dir="encode/wgbs/GM12878/processed/"
for (( i=1; i <= 40; ++i ))
do
    my_command="bismark_methylation_extractor --ignore 2 --comprehensive --merge_non_CpG --no_header --multicore 4 -o $proc_dir --gzip --bedGraph ${data_dir}GM12878_${i}.bam"
    eval $my_command
done

Bulk RRBS Encode

The analysis on the subsampled ENCODE RRBS data was performed again on the bulk GM12878 H1-hESC cells lines. We can download the the raw fastq files from.

http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeHaibMethylRrbs

and search for GM12878 or H1-hESC and download the fastq files only for the 2nd replicate.

Next we run Bismark. First run the bismark_genome_preparation command to create a genome indexing for hg19.

bismark_genome_preparation hg19/ 

After that we run the bismark command which will create alignment files in bam format.

#=================
# 3. Run bismark
bismark --genome hg19/ encode/wgEncodeHaibMethylRrbsGm12878HaibRawDataRep2.fastq.gz
bismark --genome hg19/ encode/wgEncodeHaibMethylRrbsH1hescHaibRawDataRep2.fastq.gz

After this step, we follow the same process as we did for the bulk ENCODE WGBS data above.

Session Info

This vignette was compiled using:

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.49       BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37       R6_2.5.1            fastmap_1.2.0      
##  [4] xfun_0.49           maketools_1.3.1     cachem_1.1.0       
##  [7] htmltools_0.5.8.1   rmarkdown_2.29      buildtools_1.0.0   
## [10] lifecycle_1.0.4     cli_3.6.3           sass_0.4.9         
## [13] jquerylib_0.1.4     compiler_4.4.2      sys_3.4.3          
## [16] tools_4.4.2         evaluate_1.0.1      bslib_0.8.0        
## [19] yaml_2.3.10         BiocManager_1.30.25 jsonlite_1.8.9     
## [22] rlang_1.1.4

Bibliography

[1] Kapourani, C. A., & Sanguinetti, G. (2019). Melissa: Bayesian clustering and imputation of single cell methylomes. Genome Biology, 20(1), 61, DOI: https://doi.org/10.1186/s13059-019-1665-8

[2] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412, DOI: https://doi.org/10.1093/bioinformatics/btw432

[3] Kapourani, C. A. & Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129

[4] Krueger, F., & Andrews, S. R. (2011). Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics, 27(11), 1571-1572.

[5] Smallwood, S. A., Lee, H. J., Angermueller C., Krueger F., Saadeh H., Peat J., Andrews S. R., Stegle S., Reik W., and Kelsey G. (2014). Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nature methods, 11(8):817.

[6] Angermueller, C., Clark, S.J., Lee, H.J., Macaulay, I.C., Teng, M.J., Hu, T.X., Krueger, F., Smallwood, S.A., Ponting, C.P., Voet, T. and Kelsey, G. (2016). Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nature methods, 13(3), p.229.

Acknowledgements

This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.

This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.