The scMerge algorithm allows batch effect removal and normalisation for single cell RNA-Seq data. It comprises of three key components including:
The purpose of this vignette is to illustrate some uses of
scMerge
and explain its key components.
We will load the scMerge
package. We designed our
package to be consistent with the popular BioConductor’s single cell
analysis framework, namely the SingleCellExperiment
and
scater
package.
We provided an illustrative mouse embryonic stem cell (mESC) data in our package, as well as a set of pre-computed stably expressed gene (SEG) list to be used as negative control genes.
The full curated, unnormalised mESC data can be found here.
The scMerge
package comes with a sub-sampled, two-batches
version of this data (named “batch2” and “batch3” to be consistent with
the full data) .
In this mESC data, we pooled data from 2 different batches from three
different cell types. Using a PCA plot, we can see that despite strong
separation of cell types, there is also a strong separation due to batch
effects. This information is stored in the colData
of
example_sce
.
The first major component of scMerge
is to obtain
negative controls for our normalisation. In this vignette, we will be
using a set of pre-computed SEGs from a single cell mouse data made
available through the segList_ensemblGeneID
data in our
package. For more information about the selection of negative controls
and SEGs, please see Section select SEGs.
## single-cell stably expressed gene list
data("segList_ensemblGeneID", package = "scMerge")
head(segList_ensemblGeneID$mouse$mouse_scSEG)
#> [1] "ENSMUSG00000058835" "ENSMUSG00000026842" "ENSMUSG00000027671"
#> [4] "ENSMUSG00000020152" "ENSMUSG00000054693" "ENSMUSG00000049470"
The second major component of scMerge
is to compute
pseudo-replicates for cells so we can perform normalisation. We offer
three major ways of computing this pseudo-replicate information:
scMerge
In unsupervised scMerge
, we will perform a k-means
clustering to obtain pseudo-replicates. This requires the users to
supply a kmeansK
vector with each element indicating number
of clusters in each of the batches. For example, we know “batch2” and
“batch3” both contain three cell types. Hence,
kmeansK = c(3, 3)
in this case.
scMerge_unsupervised <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_unsupervised")
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
We now colour construct the PCA plot again on our normalised data. We can observe a much better separation by cell type and less separation by batches.
By default, scMerge
only uses 50% of the cells to
perform kmeans clustering. While this is sufficient to perform a
satisfactory normalisation in most cases, users can control if they wish
all cells be used in the kmeans clustering.
scMerge
If all cell type information is available to the
user, then it is possible to use this information to create
pseudo-replicates. This can be done through the cell_type
argument in the scMerge
function.
If the user is only able to access partial cell type
information, then it is still possible to use this information to create
pseudo-replicates. This can be done through the cell_type
and cell_type_inc
arguments in the scMerge
function. cell_type_inc
should contain a vector of indices
indicating which elements in the cell_type
vector should be
used to perform semi-supervised scMerge.
scMerge_semisupervised1 <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3,3),
assay_name = "scMerge_semisupervised1",
cell_type = example_sce$cellTypes,
cell_type_inc = which(example_sce$cellTypes == "2i"),
cell_type_match = FALSE)
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
There is alternative semi-supervised method to create
pseudo-replicates for scMerge
. This uses known cell type
information to identify mutual nearest clusters and it is achieved via
the cell_type
and cell_type_match = TRUE
options in the scMerge
function.
scMerge_semisupervised2 <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_semisupervised2",
cell_type = example_sce$cellTypes,
cell_type_inc = NULL,
cell_type_match = TRUE)
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
In simple terms, a negative control is a gene that has expression values relatively constant across these datasets. The concept of using these negative control genes for normalisation was most widely used in the RUV method family (e.g. Gagnon-Bartsch & Speed (2012) and Risso et. al. (2014)) and there exist multiple methods to find these negative controls. In our paper, we recommened the SEGs as negative controls for scRNA-Seq data and SEGs can be found using either a data-adaptive computational method or external knowledge.
scSEGIndex
to calculate the SEG from a data matrix. The output of this function is
a data.frame
with a SEG index calculated for each gene. See
Lin et.
al. (2018) for more details.exprs_mat = SummarizedExperiment::assay(example_sce, 'counts')
result = scSEGIndex(exprs_mat = exprs_mat)
## SEG list in ensemblGene ID
data("segList_ensemblGeneID", package = "scMerge")
## SEG list in official gene symbols
data("segList", package = "scMerge")
## SEG list for human scRNA-Seq data
head(segList$human$human_scSEG)
#> [1] "AAR2" "AATF" "ABCF3" "ABHD2" "ABT1" "ACAP2"
## SEG list for human bulk microarray data
head(segList$human$bulkMicroarrayHK)
#> [1] "AATF" "ABL1" "ACAT2" "ACTB" "ACTG1" "ACTN4"
## SEG list for human bulk RNASeq data
head(segList$human$bulkRNAseqHK)
#> [1] "AAGAB" "AAMP" "AAR2" "AARS" "AARS2" "AARSD1"
Under most circumstances, scMerge
is fast enough to be
used on a personal laptop for a moderately large data. However, we do
recognise the difficulties associated with computation when dealing with
larger data. To this end, we devised a fast version of
scMerge
. The major difference between the two versions lies
on the noise estimation component, which utilised singular value
decomposition (SVD). In order to speed up scMerge
, we used
BiocSingular
package that offers several SVD speed
improvements. This computational method is able to speed up
scMerge
by obtain a very accurate approximation of the
noise structure in the data. This option is achieved via the option
BSPARAM = IrlbaParam()
or
BSPARAM = RandomParam()
. Additionally, svd_k
is a parameter that controlling the degree of approximations.
We recommend using this option in the case where the number of cells
is large in your single cell data. The speed advantage we obtain for
large single cell data is much more dramatic than on a smaller dataset
like the example mESC data. For example, a single run of normal
scMerge
on a human pancreas
data (23699 features and 4566 cells) takes about 10 minutes whereas
the speed up version takes just under 4 minutes.
library(BiocSingular)
scMerge_fast <- scMerge(
sce_combine = example_sce,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_fast",
BSPARAM = IrlbaParam(),
svd_k = 20)
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
#> scMerge complete!
paste("Normally, scMerge takes ", round(t2 - t1, 2), " seconds")
#> [1] "Normally, scMerge takes 0.76 seconds"
paste("Fast version of scMerge takes ", round(t4 - t3, 2), " seconds")
#> [1] "Fast version of scMerge takes 8.36 seconds"
scMerge_fast = runPCA(scMerge_fast, exprs_values = "scMerge_fast")
scater::plotPCA(
scMerge_fast,
colour_by = "cellTypes",
shape_by = "batch") +
labs(title = "fast scMerge yields similar results to the default version")
scMerge
is implemented with a parallelised computational
option via the BiocParallel
package. You can enable this option using the BPPARAM
argument with various BiocParallelParam
objects that is
suitable for your operating system.
Please note that any parallelisation would incur a small overhead. Hence we recommend you do not use parallelisation for small data.
scMerge
also supports sparse array input, which could be
very helpful in speeding up computations and saving RAM.
scMerge
does not perform internal matrix conversion, so you
may use the following codes as an example of converting typical
matrix
class to sparse matrices before running
scMerge
.
library(Matrix)
library(DelayedArray)
sparse_input = example_sce
assay(sparse_input, "counts") = as(counts(sparse_input), "dgeMatrix")
assay(sparse_input, "logcounts") = as(logcounts(sparse_input), "dgeMatrix")
scMerge_sparse = scMerge(
sce_combine = sparse_input,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_sparse")
HDF5Array
)Bioconductor provides an infrastructure for out-of-memory computation
through HDF5Array
. In simple terms, we can load an on-disk
data into RAM, make computations and write to hard disk. This is
particularly helpful when the data is too large for in-RAM computations.
You may use the following codes as an example of converting typical
matrix
class to HDF5Array
matrices before
running scMerge
.
library(HDF5Array)
library(DelayedArray)
DelayedArray:::set_verbose_block_processing(TRUE) ## To monitor block processing
hdf5_input = example_sce
assay(hdf5_input, "counts") = as(counts(hdf5_input), "HDF5Array")
assay(hdf5_input, "logcounts") = as(logcounts(hdf5_input), "HDF5Array")
scMerge_hdf5 = scMerge(
sce_combine = sparse_input,
ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
kmeansK = c(3, 3),
assay_name = "scMerge_hdf5")
Please check out our paper for detailed analysis and results on multiple scRNA-Seq data. https://doi.org/10.1073/pnas.1820006116.
citation("scMerge")
#> To cite scMerge in publications please use:
#>
#> Lin Y, Ghazanfar S, Wang K, Gagnon-Bartsch J, Lo K, Su X, Han Z,
#> Ormerod J, Speed T, Yang P, Yang J (2019). "scMerge leverages factor
#> analysis, stable expression, and pseudoreplication to merge multiple
#> single-cell RNA-seq datasets." _Proceedings of the National Academy
#> of Sciences_. doi:10.1073/pnas.1820006116
#> <https://doi.org/10.1073/pnas.1820006116>,
#> <http://www.pnas.org/lookup/doi/10.1073/pnas.1820006116>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {{scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets}},
#> author = {Yingxin Lin and Shila Ghazanfar and Kevin Wang and Johann Gagnon-Bartsch and Kitty Lo and Xianbin Su and Ze-Guang Han and John Ormerod and Terence Speed and Pengyi Yang and Jean Yang},
#> year = {2019},
#> journal = {Proceedings of the National Academy of Sciences},
#> doi = {https://doi.org/10.1073/pnas.1820006116},
#> url = {http://www.pnas.org/lookup/doi/10.1073/pnas.1820006116},
#> }
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BiocSingular_1.23.0 scater_1.35.0
#> [3] ggplot2_3.5.1 scuttle_1.17.0
#> [5] scMerge_1.23.0 SingleCellExperiment_1.29.1
#> [7] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [9] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [11] IRanges_2.41.2 S4Vectors_0.45.2
#> [13] BiocGenerics_0.53.3 generics_0.1.3
#> [15] MatrixGenerics_1.19.0 matrixStats_1.5.0
#> [17] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 sys_3.4.3
#> [3] rstudioapi_0.17.1 jsonlite_1.8.9
#> [5] magrittr_2.0.3 ggbeeswarm_0.7.2
#> [7] farver_2.1.2 rmarkdown_2.29
#> [9] vctrs_0.6.5 DelayedMatrixStats_1.29.0
#> [11] base64enc_0.1-3 htmltools_0.5.8.1
#> [13] S4Arrays_1.7.1 curl_6.1.0
#> [15] BiocNeighbors_2.1.2 SparseArray_1.7.2
#> [17] Formula_1.2-5 sass_0.4.9
#> [19] StanHeaders_2.32.10 reldist_1.7-2
#> [21] KernSmooth_2.23-26 bslib_0.8.0
#> [23] htmlwidgets_1.6.4 cachem_1.1.0
#> [25] ResidualMatrix_1.17.0 sfsmisc_1.1-20
#> [27] buildtools_1.0.0 igraph_2.1.3
#> [29] lifecycle_1.0.4 startupmsg_0.9.7
#> [31] pkgconfig_2.0.3 M3Drop_1.33.0
#> [33] rsvd_1.0.5 Matrix_1.7-1
#> [35] R6_2.5.1 fastmap_1.2.0
#> [37] GenomeInfoDbData_1.2.13 digest_0.6.37
#> [39] numDeriv_2016.8-1.1 colorspace_2.1-1
#> [41] dqrng_0.4.1 irlba_2.3.5.1
#> [43] Hmisc_5.2-1 beachmat_2.23.5
#> [45] labeling_0.4.3 httr_1.4.7
#> [47] abind_1.4-8 mgcv_1.9-1
#> [49] compiler_4.4.2 withr_3.0.2
#> [51] htmlTable_2.4.3 backports_1.5.0
#> [53] inline_0.3.20 BiocParallel_1.41.0
#> [55] viridis_0.6.5 QuickJSR_1.4.0
#> [57] pkgbuild_1.4.5 gplots_3.2.0
#> [59] MASS_7.3-64 proxyC_0.4.1
#> [61] DelayedArray_0.33.3 bluster_1.17.0
#> [63] gtools_3.9.5 caTools_1.18.3
#> [65] loo_2.8.0 distr_2.9.5
#> [67] tools_4.4.2 vipor_0.4.7
#> [69] foreign_0.8-87 beeswarm_0.4.0
#> [71] nnet_7.3-20 glue_1.8.0
#> [73] batchelor_1.23.0 nlme_3.1-166
#> [75] cvTools_0.3.3 grid_4.4.2
#> [77] checkmate_2.3.2 cluster_2.1.8
#> [79] gtable_0.3.6 data.table_1.16.4
#> [81] metapod_1.15.0 ScaledMatrix_1.15.0
#> [83] XVector_0.47.2 ggrepel_0.9.6
#> [85] pillar_1.10.1 stringr_1.5.1
#> [87] limma_3.63.2 robustbase_0.99-4-1
#> [89] splines_4.4.2 lattice_0.22-6
#> [91] densEstBayes_1.0-2.2 ruv_0.9.7.1
#> [93] locfit_1.5-9.10 maketools_1.3.1
#> [95] knitr_1.49 gridExtra_2.3
#> [97] V8_6.0.0 edgeR_4.5.1
#> [99] xfun_0.50 statmod_1.5.0
#> [101] DEoptimR_1.1-3-1 rstan_2.32.6
#> [103] stringi_1.8.4 UCSC.utils_1.3.0
#> [105] yaml_2.3.10 evaluate_1.0.1
#> [107] codetools_0.2-20 bbmle_1.0.25.1
#> [109] tibble_3.2.1 BiocManager_1.30.25
#> [111] cli_3.6.3 RcppParallel_5.1.9
#> [113] rpart_4.1.24 munsell_0.5.1
#> [115] jquerylib_0.1.4 Rcpp_1.0.13-1
#> [117] bdsmatrix_1.3-7 parallel_4.4.2
#> [119] rstantools_2.4.0 scran_1.35.0
#> [121] sparseMatrixStats_1.19.0 bitops_1.0-9
#> [123] viridisLite_0.4.2 mvtnorm_1.3-2
#> [125] scales_1.3.0 crayon_1.5.3
#> [127] rlang_1.1.4