CEllular Latent Dirichlet Allocation (celda) is a collection of Bayesian hierarchical models to perform feature and cell bi-clustering for count data generated by single-cell platforms. This algorithm is an extension of the Latent Dirichlet Allocation (LDA) topic modeling framework that has been popular in text mining applications and has shown good performance with sparse data. celda simultaneously clusters features (i.e. gene expression) into modules based on co-expression patterns across cells and cells into subpopulations based on the probabilities of the feature modules within each cell.
Starting from Bioconductor release 3.12 (celda
version
1.6.0), celda
makes use of SingleCellExperiment
(SCE) objects for storing data and results. In this vignette we will
demonstrate how to use celda to perform cell and feature clustering with
a simple, small simulated dataset. This vignette does not include
upstream importing of data, quality control, or filtering. To see a more
complete analysis of larger real-world datasets, visit camplab.net/celda for
additional vignettes.
celda can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("celda")
To load the package, type the following:
A complete list of help files are accessible using the help command
with the package
option.
To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/celda. To ask questions about running celda, post a thread on Bioconductor support site at https://support.bioconductor.org/.
celda will take a matrix of counts where each row is a feature, each column is a cell, and each entry in the matrix is the number of counts of each feature in each cell. To illustrate the utility of celda, we will apply it to a simulated dataset.
In the function simulateCells
, the K
parameter designates the number of cell clusters, the L
parameter determines the number of feature modules, the
S parameter determines the number of samples in the
simulated dataset, the G parameter determines the
number of features to be simulated, and CRange
specifies the lower and upper bounds of the number of cells to be
generated in each sample.
To simulate a dataset of 5 samples with 5 cell populations, 10
feature modules, 200 features, and between 30 to 50 cells per sample
using celda_CG
model:
The counts
assay slot in simsce
contains
the counts matrix. The dimensions of counts matrix:
## [1] 200 207
Columns celda_sample_label
and
celda_cell_cluster
in colData(simsce)
contain
sample labels and celda cell population cluster labels. Here are the
numbers of cells in each subpopulation and in each sample:
##
## 1 2 3 4 5
## 42 44 40 47 34
##
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5
## 43 48 45 40 31
Column celda_feature_module
in
rowData(simsce)
contains feature module labels. Here is the
number of features in each feature module:
##
## 1 2 3 4 5 6 7 8 9 10
## 23 39 17 15 21 22 19 12 4 28
A simple heuristic feature selection is performed to reduce the size
of features used for clustering. To speed up the process, only features
with at least 3 counts in at least 3 cells are included in downstream
clustering for this data. A subset SingleCellExperiment
object with filtered features is stored in
altExp(simsce, "featureSubset")
slot by default.
If the number of features is still too large, then a smaller subset of features can be obtained by selecting the top number of most variable genes. For an example code, see the PBMC3K tutorial in the online celda documentation.
There are currently three models within celda package:
celda_C
will cluster cells, celda_G
will
cluster features, and celda_CG
will simultaneously cluster
cells and features. Within the functions the K
parameter
will be the number of cell populations to be estimated, while the
L
parameter will be the number of feature modules to be
estimated in the output model.
Here is a comparison between the true cluster labels and the estimated cluster labels.
##
## 1 2 3 4 5
## 1 0 44 0 0 0
## 2 42 0 0 0 0
## 3 0 0 40 0 0
## 4 0 0 0 47 0
## 5 0 0 0 0 34
##
## 1 2 3 4 5 6 7 8 9 10
## 1 0 32 0 0 0 0 0 0 0 0
## 2 19 0 0 0 0 0 0 0 0 0
## 3 0 0 15 0 0 0 0 0 0 0
## 4 0 0 0 13 0 0 0 0 0 0
## 5 0 0 0 0 21 0 0 0 0 0
## 6 0 0 0 0 0 19 0 0 0 0
## 7 0 0 0 0 0 0 0 12 0 0
## 8 0 0 0 0 0 0 17 0 0 0
## 9 0 0 0 0 0 0 0 0 3 0
## 10 0 0 0 0 0 0 0 0 0 20
celda contains its own wrapper function for tSNE and UMAP called
celdaTsne
and celdaUmap
, respectively. Both of
these functions can be used to embed cells into 2-dimensions. The output
can be used in the downstream plotting functions
plotDimReduceCluster
, plotDimReduceModule
, and
plotDimReduceFeature
to show cell population clusters,
module probabilities, and expression of individual features,
respectively.
The clustering results can be viewed with a heatmap of the normalized
counts using the function celdaHeatmap
. The top
nfeatures
in each module will be selected according to the
factorized module probability matrix.
The relationships between feature modules and cell populations can be
visualized with celdaProbabilityMap
. The absolute
probabilities of each feature module in each cellular subpopulation is
shown on the left. The normalized and z-scored expression of each module
in each cell population is shown on the right.
moduleHeatmap
creates a heatmap using only the features
from a specific feature module. Cells are ordered from those with the
lowest probability of the module to the highest. If more than one module
is used, then cells will be ordered by the probabilities of the first
module.
In the previous example, the best K
(the number of cell
clusters) and L
(the number of feature modules) was already
known. However, the optimal K
and L
for each
new dataset will likely not be known beforehand and multiple choices of
K
and L
may need to be tried and compared.
celda offers two sets of functions to determine the optimum
K
and L
,
recursiveSplitModule
/recursiveSplitCell
, and
celdaGridSearch
.
Functions recursiveSplitModule
and
recursiveSplitCell
offer a fast method to generate a celda
model with optimum K
and L
. First,
recursiveSplitModule
is used to determine the optimal
L
. recursiveSplitModule
first splits features
into however many modules are specified in initialL
. The
module labels are then recursively split in a way that would generate
the highest log-likelihood, all the way up to maxL
.
Perplexity is a statistical measure of how well a probability model
can predict new data. Lower perplexity indicates a better model. The
perplexity of each model can be visualized with
plotGridSearchPerplexity
. In general, visual inspection of
the plot can be used to select the optimal number of modules
(L
) or cell populations (K
) by identifying the
“elbow” - where the rate of decrease in the perplexity starts to drop
off.
In this example, the perplexity for L
stops decreasing
at L = 10, thus L = 10 would be a good choice. Sometimes the perplexity
alone does not show a clear elbow or “leveling off”. However, the rate
of perplexity change (RPC) can be more informative to determine when
adding new modules does not add much additional information Zhao
et al., 2015). An RPC closer to zero indicates that the addition of
new modules or cell clusters is not substantially decreasing the
perplexity. The RPC of models can be visualized using function
plotRPC
:
Once you have identified the optimal L
(in this case, L
is selected to be 10), the module labels are used for initialization in
recursiveSplitCell
. Similarly to
recursiveSplitModule
, cells are initially split into a
small number of subpopulations, and the subpopulations are recursively
split up.
moduleSplitSelect <- subsetCeldaList(moduleSplit, params = list(L = 10))
cellSplit <- recursiveSplitCell(moduleSplitSelect,
initialK = 3,
maxK = 12,
yInit = celdaModules(moduleSplitSelect))
In this plot, the perplexity for K stops decreasing at K = 5, with a
final K/L combination of K = 5, L = 10. Generally, this method can be
used to pick a reasonable L
and a potential range of
K
. However, manual review of specific selections of
K
is often required to ensure results are biologically
coherent.
Once users have chosen the K/L parameters for further analysis, the
subsetCeldaList
function can be used to subset the celda
list SCE object to a single model SCE object.
Alternativley to recursive splitting, celda is able to run multiple
combinations of K and L with multiple chains in parallel via the
celdaGridSearch
function.
cgs <- celdaGridSearch(simsce,
paramsTest = list(K = seq(4, 6), L = seq(9, 11)),
cores = 1,
model = "celda_CG",
nchains = 2,
maxIter = 100,
verbose = FALSE,
bestOnly = TRUE)
Setting verbose
to TRUE
will print the
output of each model to a text file. These results can be visualized
with plotGridSearchPerplexity
. The major goal is to pick
the lowest K
and L
combination with relatively
good perplexity. In general, visual inspection of the plot can be used
to select the number of modules (L
) or cell populations
(K
) where the rate of decrease in the perplexity starts to
drop off. bestOnly = TRUE
indicates that only the chain
with the best log likelihood will be returned for each K/L
combination.
In this example, the perplexity for L
stops decreasing
at L = 10 for the majority of K
values. For the line
corresponding to L = 10, the perplexity stops decreasing at K = 5. Thus
L = 10 and K = 5 would be a good choice. Again, manual review of
specific selections of K is often be required to ensure results are
biologically coherent.
Once users have chosen the K/L parameters for further analysis, the
subsetCeldaList
function can be used to subset the celda
list SCE object to a single model SCE object.
If the “bestOnly” parameter is set to FALSE in the
celdaGridSearch
, then the selectBestModel
function can be used to select the chains with the lowest log
likelihoods within each combination of parameters. Alternatively, users
can select a specific chain by specifying the index within the
subsetCeldaList
function.
cgs <- celdaGridSearch(simsce,
paramsTest = list(K = seq(4, 6), L = seq(9, 11)),
cores = 1,
model = "celda_CG",
nchains = 2,
maxIter = 100,
verbose = FALSE,
bestOnly = FALSE)
cgs <- resamplePerplexity(cgs, celdaList = cgs, resample = 2)
cgsK5L10 <- subsetCeldaList(cgs, params = list(K = 5, L = 10))
sce <- selectBestModel(cgsK5L10)
celda also contains several utility functions for the users’ convenience.
featureModuleLookup
can be used to look up the module a
specific feature was clustered to.
## Gene_99
## 4
recodeClusterZ
and recodeClusterY
allows
the user to recode the cell and feature cluster labels,
respectively.
The model prior to reordering cell labels compared to after reordering cell labels:
##
## 1 2 3 4 5
## 1 0 44 0 0 0
## 2 42 0 0 0 0
## 3 0 0 40 0 0
## 4 0 0 0 47 0
## 5 0 0 0 0 34
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] celda_1.23.0 Matrix_1.7-1
## [3] SingleCellExperiment_1.27.2 SummarizedExperiment_1.35.5
## [5] Biobase_2.65.1 GenomicRanges_1.57.2
## [7] GenomeInfoDb_1.41.2 IRanges_2.39.2
## [9] S4Vectors_0.43.2 BiocGenerics_0.51.3
## [11] MatrixGenerics_1.17.1 matrixStats_1.4.1
## [13] BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 rlang_1.1.4 magrittr_2.0.3
## [4] clue_0.3-65 GetoptLong_1.0.5 compiler_4.4.1
## [7] png_0.1-8 vctrs_0.6.5 reshape2_1.4.4
## [10] combinat_0.0-8 stringr_1.5.1 pkgconfig_2.0.3
## [13] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
## [16] magick_2.8.5 XVector_0.45.0 labeling_0.4.3
## [19] utf8_1.2.4 rmarkdown_2.28 UCSC.utils_1.1.0
## [22] xfun_0.48 WriteXLS_6.7.0 zlibbioc_1.51.2
## [25] cachem_1.1.0 jsonlite_1.8.9 highr_0.11
## [28] DelayedArray_0.31.14 cluster_2.1.6 irlba_2.3.5.1
## [31] parallel_4.4.1 R6_2.5.1 bslib_0.8.0
## [34] stringi_1.8.4 RColorBrewer_1.1-3 MCMCprecision_0.4.0
## [37] jquerylib_0.1.4 Rcpp_1.0.13 iterators_1.0.14
## [40] knitr_1.48 FNN_1.1.4.1 tidyselect_1.2.1
## [43] abind_1.4-8 yaml_2.3.10 enrichR_3.2
## [46] doParallel_1.0.17 codetools_0.2-20 curl_5.2.3
## [49] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
## [52] withr_3.0.2 evaluate_1.0.1 Rtsne_0.17
## [55] RcppEigen_0.3.4.0.2 circlize_0.4.16 pillar_1.9.0
## [58] BiocManager_1.30.25 foreach_1.5.2 generics_0.1.3
## [61] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0
## [64] glue_1.8.0 maketools_1.3.1 tools_4.4.1
## [67] sys_3.4.3 data.table_1.16.2 buildtools_1.0.0
## [70] Cairo_1.6-2 grid_4.4.1 colorspace_2.1-1
## [73] GenomeInfoDbData_1.2.13 cli_3.6.3 fansi_1.0.6
## [76] S4Arrays_1.5.11 ComplexHeatmap_2.21.1 dplyr_1.1.4
## [79] uwot_0.2.2 gtable_0.3.6 sass_0.4.9
## [82] digest_0.6.37 SparseArray_1.5.45 ggrepel_0.9.6
## [85] rjson_0.2.23 farver_2.1.2 htmltools_0.5.8.1
## [88] lifecycle_1.0.4 httr_1.4.7 GlobalOptions_0.1.2