This tutorial shows how to apply CDI to select optimal clustering labels among different candidate cell labels for UMI counts of single-cell RNA-sequencing dataset. Sections 1-3 are for datasets from one batch; Section 4 is for datasets from multiple batches. Datasets used in this tutorial were simulated for illustration purpose.
CDI can be installed from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CDI")
The latest version can be installed from Github:
CDI provides simulated toy single-cell RNA-seq UMI count matrices for illustration purpose. To use other datasets, please change the code in the following two blocks. Filtering of cells/genes can be applied before feature gene selection; normalization/imputation is not applicable here, as CDI models the raw UMI counts.
Inputs:
## [1] 3000 2000
## one_batch_matrix_celltype
## type1 type2 type3 type4 type5
## 400 400 400 400 400
Here we provide 12 label sets generated from PCA+KMeans and Seurat v3.1.5, where the number of clusters range from 2 to 7.
For better visualization and comparison, the column names of this data frame indicate the clustering method and the number of clusters. For example, “KMeans_k5” refers to the set of cell labels generated by KMeans with 5 clusters.
data(one_batch_matrix_label_df, package = "CDI")
knitr::kable(head(one_batch_matrix_label_df[,c("KMeans_k2", "KMeans_k4", "Seurat_k2", "Seurat_k3")], 3))
KMeans_k2 | KMeans_k4 | Seurat_k2 | Seurat_k3 |
---|---|---|---|
2 | 3 | 1 | 2 |
2 | 3 | 1 | 2 |
2 | 3 | 1 | 2 |
feature_gene_selection
Feature genes are those genes that express differently across cell types. Several methods are available for feature gene selection. Here we propose a new method called working dispersion score (WDS). We select genes with highest WDS as the feature genes.
calculate_CDI
First, calculate the size factor of each gene with
size_factor
. The output of size_factor
will be
one of the inputs of calculate_CDI
.
Second, calculate CDI for each candidate set of cell labels:
start_time <- Sys.time()
CDI_return1 <- calculate_CDI(
X = sub_one_batch_matrix,
cand_lab_df = one_batch_matrix_label_df,
batch_label = NULL,
cell_size_factor = one_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))
## Time difference of 35.38791 secs
Label_name | Cluster_method | CDI_AIC | CDI_BIC | neg_llk_val | N_cluster | |
---|---|---|---|---|---|---|
KMeans_k2 | KMeans_k2 | KMeans | 1135287 | 1146489 | 565643.4 | 2 |
KMeans_k3 | KMeans_k3 | KMeans | 1120096 | 1136899 | 557047.9 | 3 |
KMeans_k4 | KMeans_k4 | KMeans | 1112561 | 1134964 | 552280.4 | 4 |
KMeans_k5 | KMeans_k5 | KMeans | 1101970 | 1129974 | 545984.9 | 5 |
KMeans_k6 | KMeans_k6 | KMeans | 1103150 | 1136756 | 545575.1 | 6 |
KMeans_k7 | KMeans_k7 | KMeans | 1105022 | 1144229 | 545511.2 | 7 |
Seurat_k2 | Seurat_k2 | Seurat | 1129063 | 1140264 | 562531.3 | 2 |
Seurat_k3 | Seurat_k3 | Seurat | 1120002 | 1136805 | 557001.2 | 3 |
Seurat_k4 | Seurat_k4 | Seurat | 1109115 | 1131518 | 550557.3 | 4 |
Seurat_k5 | Seurat_k5 | Seurat | 1102321 | 1130325 | 546160.4 | 5 |
Seurat_k6 | Seurat_k6 | Seurat | 1103152 | 1136751 | 545575.9 | 6 |
Seurat_k7 | Seurat_k7 | Seurat | 1104099 | 1143292 | 545049.6 | 7 |
CDI counts the number of unique characters in each label set as the
“N_cluster”. If the column names of label set data frame are provided
with the format “[ClusteringMethod]_k[NumberOfClusters]” such as
“KMeans_K5, calculate_CDI
will extract
the”[ClusteringMethod]” as the Cluster_method. The clustering method can
also be provided in the argument clustering_method
for each
label set.
The outputs of calculate_CDI
include CDI-AIC and
CDI-BIC. CDI calculates AIC and BIC of cell-type-specific gene-specific
NB model for UMI counts, where the cell types are based on each
candidate label set, and only the selected subset of genes are
considered. Whether to use CDI-AIC or CDI-BIC depend on the goals. We
suggest using CDI-BIC to select optimal main cell types and using
CDI-AIC to select optimal subtypes, because BIC puts more penalty on the
complexity of models (number of clusters).
Output visualization
The outputs of CDI are demonstrated with a lineplot. The x-axis is for the number of clusters. The y-axis is for the CDI values. Different colors represent different clustering methods: The orange line represents label sets from Seurat; the blue line represents label sets from K-Means. The red triangle represents the optimal clustering result corresponding to the smallest CDI value.
The cell population in this simulated dataset doesn’t have a hierarchical structure. We use CDI-BIC to select the optimal set of cell labels. The label set “KMeans_k5” has the smallest CDI-BIC and is selected as the optimal.
A benchmark label set refers to human annotated cell type label set from biomarkers for real data or true label set for simulated data. If such label set is available, we can also compare the optimal clustering label set selected by CDI with it. Here we provide the heatmap of contingency table between benchmark label set and the optimal clustering label set selected by CDI. Each row represents a cell type in the benchmark label set, and each column represents a cluster in the optimal clustering label set selected by CDI. Each rectangle is color-scaled by the proportion of the cells in the given cluster coming from the benchmark types. Each column sums to 1.
contingency_heatmap(benchmark_label = one_batch_matrix_celltype,
candidate_label = one_batch_matrix_label_df$KMeans_k5,
rename_candidate_clusters = TRUE,
candidate_cluster_names = paste0('cluster', seq_len(length(unique(one_batch_matrix_label_df$KMeans_k5)))))
We can also calculate the CDI for the benchmark label set as the following:
benchmark_return1 <- calculate_CDI(X = sub_one_batch_matrix,
cand_lab_df = one_batch_matrix_celltype,
batch_label = NULL,
cell_size_factor = one_batch_matrix_size_factor)
The results of benchmark label set can be added to the CDI lineplot:
CDI_lineplot(cdi_dataframe = CDI_return1,
cdi_type = "CDI_BIC",
benchmark_celltype_cdi = benchmark_return1,
benchmark_celltype_ncluster = length(unique(one_batch_matrix_celltype)))
The purple star represents the CDI value for the benchmark label set. The result shows that the benchmark label set achieves lowest CDI-BIC among all label sets. Therefore, if the benchmark label set is available as one candidate label set, it will be selected by CDI-BIC as the optimal.
CDI accepts datasets from multiple batches. The candidate label sets can be obtained after batch effect correction, but CDI will be calculated based on raw UMI count matrices. The procedure of applying CDI to multi-batch datasets is similar to that of one batch, except for one extra input of batch labels.
## two_batch_matrix_celltype
## type1 type2 type3 type4 type5
## 400 400 400 400 400
## two_batch_matrix_batch
## batch1 batch2
## 1000 1000
## [1] 3000 2000
data(two_batch_matrix_label_df, package = "CDI")
knitr::kable(head(two_batch_matrix_label_df[,c("KMeans_k5", "KMeans_k6", "Seurat_k5", "Seurat_k6")], 3))
KMeans_k5 | KMeans_k6 | Seurat_k5 | Seurat_k6 |
---|---|---|---|
3 | 3 | 2 | 2 |
3 | 3 | 2 | 2 |
3 | 3 | 2 | 2 |
two_batch_matrix_size_factor <- size_factor(two_batch_matrix)
start_time <- Sys.time()
CDI_return2 <- calculate_CDI(
X = sub_two_batch_matrix,
cand_lab_df = two_batch_matrix_label_df,
batch_label = two_batch_matrix_batch,
cell_size_factor = two_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))
## Time difference of 3.508361 mins
Label_name | Cluster_method | CDI_AIC | CDI_BIC | neg_llk_val | N_cluster | |
---|---|---|---|---|---|---|
KMeans_k3 | KMeans_k3 | KMeans | 1116010 | 1133295 | 554919.1 | 3 |
KMeans_k4 | KMeans_k4 | KMeans | 1090471 | 1114398 | 540963.7 | 4 |
KMeans_k5 | KMeans_k5 | KMeans | 1094696 | 1124000 | 542115.9 | 5 |
KMeans_k6 | KMeans_k6 | KMeans | 1084735 | 1119774 | 536111.6 | 6 |
KMeans_k7 | KMeans_k7 | KMeans | 1085682 | 1125975 | 535646.8 | 7 |
KMeans_k8 | KMeans_k8 | KMeans | 1086246 | 1132224 | 534910.8 | 8 |
KMeans_k9 | KMeans_k9 | KMeans | 1086179 | 1137386 | 533943.5 | 9 |
KMeans_k10 | KMeans_k10 | KMeans | 1089714 | 1146373 | 534741.1 | 10 |
KMeans_k11 | KMeans_k11 | KMeans | 1090071 | 1151995 | 533979.7 | 11 |
KMeans_k12 | KMeans_k12 | KMeans | 1091465 | 1159185 | 533640.6 | 12 |
KMeans_k13 | KMeans_k13 | KMeans | 1093719 | 1167259 | 533729.7 | 13 |
KMeans_k14 | KMeans_k14 | KMeans | 1093773 | 1172530 | 532822.4 | 14 |
KMeans_k15 | KMeans_k15 | KMeans | 1097256 | 1182247 | 533451.8 | 15 |
Seurat_k3 | Seurat_k3 | Seurat | 1098939 | 1117153 | 546217.3 | 3 |
Seurat_k4 | Seurat_k4 | Seurat | 1090520 | 1114458 | 540986.0 | 4 |
Seurat_k5 | Seurat_k5 | Seurat | 1083559 | 1113412 | 536449.6 | 5 |
Seurat_k6 | Seurat_k6 | Seurat | 1084476 | 1119538 | 535978.0 | 6 |
Seurat_k7 | Seurat_k7 | Seurat | 1084354 | 1124707 | 534970.9 | 7 |
Seurat_k8 | Seurat_k8 | Seurat | 1086408 | 1132145 | 535037.9 | 8 |
Seurat_k9 | Seurat_k9 | Seurat | 1087299 | 1138390 | 534527.4 | 9 |
Seurat_k10 | Seurat_k10 | Seurat | 1087046 | 1143337 | 533471.2 | 10 |
Seurat_k11 | Seurat_k11 | Seurat | 1089098 | 1150966 | 533501.1 | 11 |
Seurat_k12 | Seurat_k12 | Seurat | 1090061 | 1157517 | 532984.3 | 12 |
Seurat_k13 | Seurat_k13 | Seurat | 1090988 | 1164020 | 532449.9 | 13 |
Seurat_k14 | Seurat_k14 | Seurat | 1091273 | 1170009 | 531572.7 | 14 |
Seurat_k15 | Seurat_k15 | Seurat | 1091733 | 1176071 | 530800.4 | 15 |
Calculate the benchmark label set (if available) CDI:
benchmark_return <- calculate_CDI(
X = sub_two_batch_matrix,
cand_lab_df = two_batch_matrix_celltype,
batch_label = two_batch_matrix_batch,
cell_size_factor = two_batch_matrix_size_factor)
Output visualization:
CDI_lineplot(cdi_dataframe = CDI_return2,
cdi_type = "CDI_BIC",
benchmark_celltype_cdi = benchmark_return,
benchmark_celltype_ncluster = length(unique(two_batch_matrix_celltype)))
Compare the optimal clustering label set selected by CDI with the benchmark cell type labels:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] CDI_1.5.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] spatstat.utils_3.1-1 farver_2.1.2
## [7] rmarkdown_2.29 zlibbioc_1.52.0
## [9] vctrs_0.6.5 ROCR_1.0-11
## [11] spatstat.explore_3.3-3 S4Arrays_1.7.1
## [13] htmltools_0.5.8.1 SparseArray_1.7.2
## [15] sass_0.4.9 sctransform_0.4.1
## [17] parallelly_1.39.0 KernSmooth_2.23-24
## [19] bslib_0.8.0 htmlwidgets_1.6.4
## [21] ica_1.0-3 plyr_1.8.9
## [23] plotly_4.10.4 zoo_1.8-12
## [25] cachem_1.1.0 buildtools_1.0.0
## [27] igraph_2.1.1 mime_0.12
## [29] lifecycle_1.0.4 pkgconfig_2.0.3
## [31] Matrix_1.7-1 R6_2.5.1
## [33] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [35] MatrixGenerics_1.19.0 fitdistrplus_1.2-1
## [37] future_1.34.0 shiny_1.9.1
## [39] digest_0.6.37 colorspace_2.1-1
## [41] S4Vectors_0.45.2 patchwork_1.3.0
## [43] Seurat_5.1.0 tensor_1.5
## [45] RSpectra_0.16-2 irlba_2.3.5.1
## [47] GenomicRanges_1.59.1 labeling_0.4.3
## [49] progressr_0.15.1 fansi_1.0.6
## [51] spatstat.sparse_3.1-0 httr_1.4.7
## [53] polyclip_1.10-7 abind_1.4-8
## [55] compiler_4.4.2 withr_3.0.2
## [57] BiocParallel_1.41.0 fastDummies_1.7.4
## [59] MASS_7.3-61 DelayedArray_0.33.2
## [61] ggsci_3.2.0 tools_4.4.2
## [63] lmtest_0.9-40 httpuv_1.6.15
## [65] future.apply_1.11.3 goftest_1.2-3
## [67] glue_1.8.0 nlme_3.1-166
## [69] promises_1.3.2 grid_4.4.2
## [71] Rtsne_0.17 cluster_2.1.6
## [73] reshape2_1.4.4 generics_0.1.3
## [75] gtable_0.3.6 spatstat.data_3.1-4
## [77] tidyr_1.3.1 data.table_1.16.2
## [79] XVector_0.47.0 sp_2.1-4
## [81] utf8_1.2.4 BiocGenerics_0.53.3
## [83] spatstat.geom_3.3-4 RcppAnnoy_0.0.22
## [85] ggrepel_0.9.6 RANN_2.6.2
## [87] pillar_1.9.0 stringr_1.5.1
## [89] spam_2.11-0 RcppHNSW_0.6.0
## [91] later_1.4.1 splines_4.4.2
## [93] dplyr_1.1.4 lattice_0.22-6
## [95] survival_3.7-0 deldir_2.0-4
## [97] tidyselect_1.2.1 SingleCellExperiment_1.29.1
## [99] maketools_1.3.1 miniUI_0.1.1.1
## [101] pbapply_1.7-2 knitr_1.49
## [103] gridExtra_2.3 IRanges_2.41.1
## [105] SummarizedExperiment_1.37.0 scattermore_1.2
## [107] stats4_4.4.2 xfun_0.49
## [109] Biobase_2.67.0 matrixStats_1.4.1
## [111] UCSC.utils_1.3.0 stringi_1.8.4
## [113] lazyeval_0.2.2 yaml_2.3.10
## [115] evaluate_1.0.1 codetools_0.2-20
## [117] tibble_3.2.1 BiocManager_1.30.25
## [119] cli_3.6.3 uwot_0.2.2
## [121] xtable_1.8-4 reticulate_1.40.0
## [123] munsell_0.5.1 jquerylib_0.1.4
## [125] GenomeInfoDb_1.43.2 Rcpp_1.0.13-1
## [127] globals_0.16.3 spatstat.random_3.3-2
## [129] png_0.1-8 spatstat.univar_3.1-1
## [131] parallel_4.4.2 ggplot2_3.5.1
## [133] dotCall64_1.2 listenv_0.9.1
## [135] viridisLite_0.4.2 scales_1.3.0
## [137] ggridges_0.5.6 crayon_1.5.3
## [139] SeuratObject_5.0.2 leiden_0.4.3.1
## [141] purrr_1.0.2 rlang_1.1.4
## [143] cowplot_1.1.3