Data from different experimental platforms and/or batches exhibit systematic variation – i.e., batch effects. Therefore, when conducting joint analysis of data from different batches, a key first step is to align the datasets.
corralm
is a multi-table adaptation of correspondence
analysis designed for single-cell data, which applies multi-dimensional
optimized scaling and matrix factorization to compute integrated
embeddings across the datasets. These embeddings can then be used in
downstream analyses, such as clustering, cell type classification,
trajectory analysis, etc.
See the vignette for corral
for dimensionality reduction
of a single matrix of single-cell data.
We will use the SCMixology
datasets from the CellBench
package (Tian et al. 2019).
library(corral)
library(SingleCellExperiment)
library(ggplot2)
library(CellBench)
library(MultiAssayExperiment)
scmix_dat <- load_all_data()[1:3]
These datasets include a mixture of three lung cancer cell lines:
which was sequenced using three platforms:
## $sc_10x
## class: SingleCellExperiment
## dim: 16468 902
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(16468): ENSG00000272758 ENSG00000154678 ... ENSG00000054219
## ENSG00000137691
## rowData names(0):
## colnames(902): CELL_000001 CELL_000002 ... CELL_000955 CELL_000965
## colData names(14): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
##
## $sc_celseq
## class: SingleCellExperiment
## dim: 28204 274
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(28204): ENSG00000281131 ENSG00000227456 ... ENSG00000148143
## ENSG00000226887
## rowData names(0):
## colnames(274): A1 A10 ... P8 P9
## colData names(15): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
##
## $sc_dropseq
## class: SingleCellExperiment
## dim: 15127 225
## metadata(3): scPipe Biomart log.exprs.offset
## assays(2): counts logcounts
## rownames(15127): ENSG00000223849 ENSG00000225355 ... ENSG00000133789
## ENSG00000146674
## rowData names(0):
## colnames(225): CELL_000001 CELL_000002 ... CELL_000249 CELL_000302
## colData names(14): unaligned aligned_unmapped ... cell_line sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Each sequencing platform captures a different set of genes. In order to apply this method, the matrices need to be matched by features (i.e., genes). We’ll find the intersect of the three datasets, then subset for that as we proceed.
First, we will prepare the data by: 1. adding to the colData the
sequencing platform (Method
in colData
for
each SCE), and 2. subsetting by the intersect of the genes.
platforms <- c('10X','CELseq2','Dropseq')
for(i in seq_along(scmix_dat)) {
colData(scmix_dat[[i]])$Method<- rep(platforms[i], ncol(scmix_dat[[i]]))
}
scmix_mae <- as(scmix_dat,'MultiAssayExperiment')
scmix_dat <- as.list(MultiAssayExperiment::experiments(MultiAssayExperiment::intersectRows(scmix_mae)))
corralm
can be applied to the following types of
objects:
splitby
(also see documentation of
corralm_matlist
for additional optional arguments that can
be passed), which is a character string for the attribute in
colData
that is tracking the batches. In our case, this
would be the “Method” attribute we just added. The output from this type
of input is the same SingleCellExperiment
, with the result
added to the reducedDim
slot under
corralm
.SingleCellExperiment
s, with the result added to the
reducedDim
slot under corralm
.u
,d
,v
) where v
contains a concatenated vector of the embeddings for the cellsExperimentList
does not require any specific
arguments. corralm
will identify the intersect of the rows,
and use these to match the matrices. The output will be the same as for
a list of matrices.For purposes of illustration, we will walk through using
corralm
with a single SCE, and with a list of matrices.
corralm
on a single SingleCellExperimentFirst, setting up the data to demonstrate this:
colData(scmix_dat[[2]])$non_ERCC_percent <- NULL
# need to remove this column so the objects can be concatenated
scmix_sce <- SingleCellExperiment::cbind(scmix_dat[[1]],
scmix_dat[[2]],
scmix_dat[[3]])
Running corralm
, and specifying the splitby
argument: (Note that the default is for the counts
matrix
to be used. To change this default, use the whichmat
argument.)
Visualizing the results:
plot_embedding_sce(sce = scmix_sce,
which_embedding = 'corralm',
color_attr = 'Method',
color_title = 'platform',
ellipse_attr = 'cell_line',
plot_title = 'corralm on scmix',
saveplot = FALSE)
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
corralm
on a list of matricesAgain, preparing the data to be in this input format:
scmix_matlist <- sce2matlist(sce = scmix_sce,
splitby = 'Method',
whichmat = 'counts')
# for plotting purposes later, while we're here
platforms <- colData(scmix_sce)$Method
cell_lines <- colData(scmix_sce)$cell_line
Running corralm and visualizing output… (the embeddings are in the
v
matrix because these data are matched by genes in the
rows and have cells in the columns; if this were reversed, with cells in
the rows and genes/features in the column, then the cell embeddings
would instead be in the u
matrix.)
## corralm output summary==========================================
## Output "list" includes SVD output (u, d, v) & a table of the
## dimensions of the input matrices (batch_sizes)
## Variance explained----------------------------------------------
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## percent.Var.explained 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01
## cumulative.Var.explained 0.03 0.05 0.05 0.06 0.07 0.07 0.08 0.08
##
## Dimensions of output elements-----------------------------------
## Singular values (d) :: 30
## Left singular vectors (u) :: 13575 30
## Right singular vectors (v) :: 1401 30
## See corralm help for details on each output element.
##
## Original batches & sizes (in order)-----------------------------
## 10X :: 902
## CELseq2 :: 274
## Dropseq :: 225
##
## Use plot_embedding to visualize; see docs for details.
## ================================================================
plot_embedding(embedding = scmix_corralm$v,
plot_title = 'corralm on scmix',
color_vec = platforms,
color_title = 'platform',
ellipse_vec = cell_lines,
saveplot = FALSE)
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
As expected, we get the same results as above. (Note that in
performing SVD, the direction of the axes doesn’t matter and they may be
flipped between runs, as corral
and corralm
use irlba
to perform fast approximation.)
Scaled variance plots provide a simple and fast visual summary of the
integration of embeddings. It can be called using the
scal_var
function, and works on both corralm
objects and custom embeddings (with a vector indicating batch).
When integrating embedding representations across batches, measures for cluster evaluation are effective for assessing group compactness and recovery of cell populations via clustering. However, they do not directly assess how well dataset embeddings are integrated across batches. To focus specifically on batch integration, we developed and applied a heuristic scaled variance metric, which captures the relative dispersion of each batch with respect to the entire dataset. The scaled variance of component dimension d* for the subset of observations in batch b*, SVb*, d, is computed with: $$SV_{b^*,d} = \frac{\mathrm{Var}(\mathbf{E_{b=b^*,d=d^*}})}{\mathrm{Var}(\mathbf{E_{d=d^*}})}$$ where E is the matrix of embeddings, and b indexes the rows (observations by batch) while d indexes the columns to indicate which component dimension to evaluate.
When the datasets are well integrated, SV values for each batch are close to 1, indicating that each batch’s subset has similar dispersion as compared to the entire embedding. In contrast, if there is poorer integration, the scaled variance values will be more extreme away from 1 because the variance within batches will differ more from the variance overall. This metric is appropriate when the types of cells represented in different datasets are expected to be similar, but cannot account for situations where the expected distribution of cell types (and therefore, embeddings) is fundamentally different between batches.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MultiAssayExperiment_1.31.5 CellBench_1.23.0
## [3] tibble_3.2.1 magrittr_2.0.3
## [5] scater_1.33.4 scuttle_1.15.5
## [7] DuoClustering2018_1.23.0 ggplot2_3.5.1
## [9] SingleCellExperiment_1.27.2 SummarizedExperiment_1.35.5
## [11] Biobase_2.67.0 GenomicRanges_1.57.2
## [13] GenomeInfoDb_1.41.2 IRanges_2.39.2
## [15] S4Vectors_0.43.2 BiocGenerics_0.53.0
## [17] MatrixGenerics_1.17.1 matrixStats_1.4.1
## [19] corral_1.17.0 gridExtra_2.3
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 rlang_1.1.4 compiler_4.4.1
## [4] RSQLite_2.3.7 png_0.1-8 vctrs_0.6.5
## [7] maps_3.4.2 reshape2_1.4.4 stringr_1.5.1
## [10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [13] dbplyr_2.5.0 XVector_0.45.0 labeling_0.4.3
## [16] utf8_1.2.4 rmarkdown_2.28 ggbeeswarm_0.7.2
## [19] UCSC.utils_1.1.0 purrr_1.0.2 bit_4.5.0
## [22] xfun_0.48 beachmat_2.23.0 zlibbioc_1.51.2
## [25] cachem_1.1.0 pals_1.9 jsonlite_1.8.9
## [28] blob_1.2.4 highr_0.11 DelayedArray_0.31.14
## [31] BiocParallel_1.39.0 parallel_4.4.1 irlba_2.3.5.1
## [34] R6_2.5.1 bslib_0.8.0 stringi_1.8.4
## [37] lubridate_1.9.3 jquerylib_0.1.4 assertthat_0.2.1
## [40] Rcpp_1.0.13 knitr_1.48 BiocBaseUtils_1.9.0
## [43] FNN_1.1.4.1 timechange_0.3.0 Matrix_1.7-1
## [46] tidyselect_1.2.1 dichromat_2.0-0.1 abind_1.4-8
## [49] yaml_2.3.10 viridis_0.6.5 codetools_0.2-20
## [52] curl_5.2.3 lattice_0.22-6 plyr_1.8.9
## [55] withr_3.0.2 KEGGREST_1.45.1 Rtsne_0.17
## [58] evaluate_1.0.1 BiocFileCache_2.15.0 ExperimentHub_2.13.1
## [61] mclust_6.1.1 Biostrings_2.75.0 pillar_1.9.0
## [64] BiocManager_1.30.25 filelock_1.0.3 generics_0.1.3
## [67] BiocVersion_3.21.1 munsell_0.5.1 scales_1.3.0
## [70] glue_1.8.0 mapproj_1.2.11 maketools_1.3.1
## [73] tools_4.4.1 AnnotationHub_3.15.0 BiocNeighbors_2.1.0
## [76] sys_3.4.3 data.table_1.16.2 ScaledMatrix_1.13.0
## [79] buildtools_1.0.0 grid_4.4.1 tidyr_1.3.1
## [82] AnnotationDbi_1.69.0 colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [85] beeswarm_0.4.0 BiocSingular_1.23.0 vipor_0.4.7
## [88] rsvd_1.0.5 cli_3.6.3 rappdirs_0.3.3
## [91] fansi_1.0.6 ggthemes_5.1.0 S4Arrays_1.5.11
## [94] viridisLite_0.4.2 dplyr_1.1.4 uwot_0.2.2
## [97] gtable_0.3.6 sass_0.4.9 digest_0.6.37
## [100] ggrepel_0.9.6 SparseArray_1.5.45 farver_2.1.2
## [103] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
## [106] httr_1.4.7 mime_0.12 transport_0.15-4
## [109] bit64_4.5.2