Single-cell ’omics analysis enables high-resolution characterization of heterogeneous populations of cells by quantifying measurements in individual cells and thus provides a fuller, more nuanced picture into the complexity and heterogeneity between cells. However, the data also present new and significant challenges as compared to previous approaches, especially as single-cell data are much larger and sparser than data generated from bulk sequencing methods. Dimension reduction is a key step in the single-cell analysis to address the high dimension and sparsity of these data, and to enable the application of more complex, computationally expensive downstream pipelines.
Correspondence analysis (CA) is a matrix factorization method, and is
similar to principal components analysis (PCA). Whereas PCA is designed
for application to continuous, approximately normally distributed data,
CA is appropriate for non-negative, count-based data that are in the
same additive scale. corral
implements CA for
dimensionality reduction of a single matrix of single-cell data.
See the vignette for corralm
for the multi-table
adaptation of CA for single-cell batch alignment/integration.
corral can be used with various types of input. When called on a
matrix (or other matrix-like object), it returns a list with the SVD
output, principal coordinates, and standard coordinates. When called on
a SingleCellExperiment,
it returns the SingleCellExperiment
with the corral embeddings in the reducedDim
slot named
corral
. To retrieve the full list output from a
SingleCellExperiment
input, the fullout
argument can be set to TRUE
.
We will use the Zhengmix4eq
dataset from the DuoClustering2018
package.
library(corral)
library(SingleCellExperiment)
library(ggplot2)
library(DuoClustering2018)
zm4eq.sce <- sce_full_Zhengmix4eq()
zm8eq <- sce_full_Zhengmix8eq()
This dataset includes approximately 4,000 pre-sorted and annotated cells of 4 types mixed by Duo et al. in approximately equal proportions (Duò, Robinson, and Soneson, n.d.). The cells were sampled from a “Massively parallel digital transcriptional profiling of single cells” (Zheng et al. 2017).
## class: SingleCellExperiment
## dim: 15568 3994
## metadata(1): log.exprs.offset
## assays(3): counts logcounts normcounts
## rownames(15568): ENSG00000237683 ENSG00000228327 ... ENSG00000215700
## ENSG00000215699
## rowData names(10): id symbol ... total_counts log10_total_counts
## colnames(3994): b.cells1147 b.cells6276 ... regulatory.t1084
## regulatory.t9696
## colData names(14): dataset barcode ... libsize.drop feature.drop
## reducedDimNames(2): PCA TSNE
## mainExpName: NULL
## altExpNames(0):
##
## b.cells cd14.monocytes naive.cytotoxic regulatory.t
## 999 1000 998 997
corral
on SingleCellExperimentWe will run corral
directly on the raw count data:
## class: SingleCellExperiment
## dim: 15568 3994
## metadata(1): log.exprs.offset
## assays(3): counts logcounts normcounts
## rownames(15568): ENSG00000237683 ENSG00000228327 ... ENSG00000215700
## ENSG00000215699
## rowData names(10): id symbol ... total_counts log10_total_counts
## colnames(3994): b.cells1147 b.cells6276 ... regulatory.t1084
## regulatory.t9696
## colData names(14): dataset barcode ... libsize.drop feature.drop
## reducedDimNames(3): PCA TSNE corral
## mainExpName: NULL
## altExpNames(0):
We can use plot_embedding
to visualize the output:
plot_embedding_sce(sce = zm4eq.sce,
which_embedding = 'corral',
plot_title = 'corral on Zhengmix4eq',
color_attr = 'phenoid',
color_title = 'cell type',
saveplot = FALSE)
Using the scater
package, we can also add and visualize
umap
and tsne
embeddings based on the
corral
output:
## Loading required package: scuttle
library(gridExtra) # so we can arrange the plots side by side
zm4eq.sce <- runUMAP(zm4eq.sce,
dimred = 'corral',
name = 'corral_UMAP')
zm4eq.sce <- runTSNE(zm4eq.sce,
dimred = 'corral',
name = 'corral_TSNE')
ggplot_umap <- plot_embedding_sce(sce = zm4eq.sce,
which_embedding = 'corral_UMAP',
plot_title = 'Zhengmix4eq corral with UMAP',
color_attr = 'phenoid',
color_title = 'cell type',
returngg = TRUE,
showplot = FALSE,
saveplot = FALSE)
ggplot_tsne <- plot_embedding_sce(sce = zm4eq.sce,
which_embedding = 'corral_TSNE',
plot_title = 'Zhengmix4eq corral with tSNE',
color_attr = 'phenoid',
color_title = 'cell type',
returngg = TRUE,
showplot = FALSE,
saveplot = FALSE)
gridExtra::grid.arrange(ggplot_umap, ggplot_tsne, ncol = 2)
The corral
embeddings stored in the
reducedDim
slot can be used in downstream analysis, such as
for clustering or trajectory analysis.
corral
can also be run on a
SummarizedExperiment
object.
corral
on matrixcorral
can also be performed on a matrix (or
matrix-like) input.
The output is in a list
format, including the SVD output
(u
,d
,v
), the standard coordinates
(SCu
,SCv
), and the principal coordinates
(PCu
,PCv
).
## corral output summary===========================================
## Output "list" includes standard coordinates (SCu, SCv),
## principal coordinates (PCu, PCv), & SVD output (u, d, v)
## Variance explained----------------------------------------------
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## percent.Var.explained 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## cumulative.Var.explained 0.01 0.02 0.02 0.02 0.03 0.03 0.03 0.03
##
## Dimensions of output elements-----------------------------------
## Singular values (d) :: 30
## Left singular vectors & coordinates (u, SCu, PCu) :: 15568 30
## Right singular vectors & coordinates (v, SCv, PCv) :: 3994 30
## See corral help for details on each output element.
## Use plot_embedding to visualize; see docs for details.
## ================================================================
We can use plot_embedding
to visualize the output: (the
embeddings are in the v
matrix because these data are by
genes in the rows and have cells in the columns; if this were reversed,
with cells in the rows and genes/features in the column, then the cell
embeddings would instead be in the u
matrix.)
celltype_vec <- zm4eq.sce$phenoid
plot_embedding(embedding = zm4eq.countcorral$v,
plot_title = 'corral on Zhengmix4eq',
color_vec = celltype_vec,
color_title = 'cell type',
saveplot = FALSE)
The output is the same as above with the
SingleCellExperiment
, and can be passed as the
low-dimension embedding for downstream analysis. Similarly, UMAP and
tSNE can be computed for visualization. (Note that in performing SVD,
the direction of the axes doesn’t matter so they may be flipped between
runs, as corral
and corralm
use
irlba
to perform fast approximation.)
Correspondence analysis is known to be sensitive to “rare objects” (Greenacre, 2013). Sometimes this can be beneficial because the method can detect small perturbations of rare populations. However, in other cases, a couple outlier cells can be allowed to exert undue influence on a particular dimension.
In the corral
manuscript, we describe three general
approaches, included below; see our manuscript for more details and
results. In this vignette we also present a fourth approach (Trimming
extreme values with smooth
mode)
rtype
)Standard correspondence analysis decomposes Pearson χ2 residuals, computed with the formula: $$r_{p; ij} = \frac{\mathrm{observed} - \mathrm{expected}}{\sqrt{\mathrm{expected}}} = \frac{p_{ij} - p_{i.} \ p_{.j}}{\sqrt{p_{i.} \ p_{.j}}}$$
where $p_{ij} = \frac{x_{ij}}{N}$, $N = \sum_{i=1}^m \sum_{j=1}^n x_{ij}$, $p_{i.} = \mathrm{row \ weights} = \sum_{i=1}^m p_{ij}$, and $p_{.j} = \mathrm{col \ weights} = \sum_{j=1}^n p_{ij}$.
In corral
, this is the default setting. It can also be
explicitly selected by setting rtype = 'standardized'
or
rtype = 'pearson'
.
Another χ2 residual is the Freeman-Tukey: $$r_{f; ij} = \sqrt{p_{ij}} + \sqrt{p_{ij} + \frac{1}{N}} - \sqrt{4 p_{i.} \ p_{.j} + \frac{1}{N}}$$
It is more robust to overdispersion than the Pearson residuals, and therefore outperforms standard CA in many scRNAseq datasets.
In corral
, this option can be selected by setting
rtype = 'freemantukey'
.
vst_mth
)Another approach for addressing overdispersed counts is to apply a variance stabilizing transformation. The options included in the package:
vst_mth = 'sqrt'
vst_mth = 'anscombe'
vst_mth = 'freemantukey'
**Note that
this option is different from setting the rtype
parameter
to 'freemantukey'
powdef_alpha
)To apply a smoothing effect to the χ2 residuals, another
approach is to transform the residual matrix by a power of α ∈ (0, 1). To achieve a “soft”
smoothing effect, we suggest α ∈ [0.9, 0.99]. This option is
controlled with the powdef_alpha
parameter, which takes the
default value of NULL
(not used). To set it, use this
parameter and set it equal to the desired value for α as a numeric. e.g.,
powdef_alpha = 0.95
would be including this option and
setting α = 0.95.
smooth
mode)One adaptation (not described in the manuscript) that addresses
unduly influential outliers is to apply an alternative smoothing
procedure that narrows the range of the χ2-transformed values by
symmetrically trimming the top n fraction of extreme values (n defaults to .01 and can be set with the
pct_trim
argument). Since the corral
matrix
pre-processing procedure transforms the values into standardized χ2 space, they can be
considered proportional to the significance of difference between
observed and expected abundance for a given gene in a given cell. This
approach differs from power deflation in that it only adjusts the most
extreme values, and explicitly so, whereas power deflation shifts the
distribution of all values to be less extreme.
This additional pre-processing step can be applied in
corral
by setting the smooth
argument to
TRUE
(it defaults to FALSE
), and this mode
only works with standardized and indexed residuals options.
Reduced dimension embeddings are often used to find relationships
between observations. corral
can be used to additionally
explore relationships between the features and the observations. By
plotting both embeddings into the same space, the biplot reveals
potential associations between the “rows” and “columns” – in this case,
cells and genes.
Influential features in a given dimension will be further from the origin, and features strongly associated to a particular cluster of observations will be close in terms of vector direction (e.g., cosine similarity).
gene_names <- rowData(zm4eq.sce)$symbol
biplot_corral(corral_obj = zm4eq.countcorral, color_vec = celltype_vec, text_vec = gene_names)
For example, in this case, in the general direction of the B cells, there is the MALAT1 gene, which is a B cell marker. Considering the side of the plot with the CD14 monocytes, by showing which genes are closer to which subpopulation, the biplot can help characterize the different groups of cells.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.35.0 scuttle_1.17.0
## [3] DuoClustering2018_1.24.0 ggplot2_3.5.1
## [5] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [7] Biobase_2.67.0 GenomicRanges_1.59.1
## [9] GenomeInfoDb_1.43.2 IRanges_2.41.1
## [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [13] generics_0.1.3 MatrixGenerics_1.19.0
## [15] matrixStats_1.4.1 corral_1.17.0
## [17] gridExtra_2.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 rlang_1.1.4
## [3] magrittr_2.0.3 compiler_4.4.2
## [5] RSQLite_2.3.8 png_0.1-8
## [7] vctrs_0.6.5 maps_3.4.2.1
## [9] reshape2_1.4.4 stringr_1.5.1
## [11] pkgconfig_2.0.3 crayon_1.5.3
## [13] fastmap_1.2.0 dbplyr_2.5.0
## [15] XVector_0.47.0 labeling_0.4.3
## [17] utf8_1.2.4 rmarkdown_2.29
## [19] ggbeeswarm_0.7.2 UCSC.utils_1.3.0
## [21] purrr_1.0.2 bit_4.5.0
## [23] xfun_0.49 MultiAssayExperiment_1.33.1
## [25] beachmat_2.23.2 zlibbioc_1.52.0
## [27] cachem_1.1.0 pals_1.9
## [29] jsonlite_1.8.9 blob_1.2.4
## [31] DelayedArray_0.33.2 BiocParallel_1.41.0
## [33] parallel_4.4.2 irlba_2.3.5.1
## [35] R6_2.5.1 bslib_0.8.0
## [37] stringi_1.8.4 jquerylib_0.1.4
## [39] Rcpp_1.0.13-1 knitr_1.49
## [41] FNN_1.1.4.1 Matrix_1.7-1
## [43] tidyselect_1.2.1 dichromat_2.0-0.1
## [45] abind_1.4-8 yaml_2.3.10
## [47] viridis_0.6.5 codetools_0.2-20
## [49] curl_6.0.1 lattice_0.22-6
## [51] tibble_3.2.1 plyr_1.8.9
## [53] withr_3.0.2 KEGGREST_1.47.0
## [55] Rtsne_0.17 evaluate_1.0.1
## [57] BiocFileCache_2.15.0 ExperimentHub_2.15.0
## [59] mclust_6.1.1 Biostrings_2.75.1
## [61] pillar_1.9.0 BiocManager_1.30.25
## [63] filelock_1.0.3 BiocVersion_3.21.1
## [65] munsell_0.5.1 scales_1.3.0
## [67] glue_1.8.0 mapproj_1.2.11
## [69] maketools_1.3.1 tools_4.4.2
## [71] AnnotationHub_3.15.0 BiocNeighbors_2.1.1
## [73] sys_3.4.3 data.table_1.16.2
## [75] ScaledMatrix_1.15.0 buildtools_1.0.0
## [77] grid_4.4.2 tidyr_1.3.1
## [79] AnnotationDbi_1.69.0 colorspace_2.1-1
## [81] GenomeInfoDbData_1.2.13 beeswarm_0.4.0
## [83] BiocSingular_1.23.0 vipor_0.4.7
## [85] rsvd_1.0.5 cli_3.6.3
## [87] rappdirs_0.3.3 fansi_1.0.6
## [89] ggthemes_5.1.0 S4Arrays_1.7.1
## [91] viridisLite_0.4.2 dplyr_1.1.4
## [93] uwot_0.2.2 gtable_0.3.6
## [95] sass_0.4.9 digest_0.6.37
## [97] ggrepel_0.9.6 SparseArray_1.7.2
## [99] farver_2.1.2 memoise_2.0.1
## [101] htmltools_0.5.8.1 lifecycle_1.0.4
## [103] httr_1.4.7 mime_0.12
## [105] transport_0.15-4 bit64_4.5.2