Introduction

This tutorial shows how to apply CDI to select optimal clustering labels among different candidate cell labels for UMI counts of single-cell RNA-sequencing dataset. Sections 1-3 are for datasets from one batch; Section 4 is for datasets from multiple batches. Datasets used in this tutorial were simulated for illustration purpose.

Installation

CDI can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("CDI")

The latest version can be installed from Github:

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("jichunxie/CDI", build_vignettes = TRUE)

Load datasets

CDI provides simulated toy single-cell RNA-seq UMI count matrices for illustration purpose. To use other datasets, please change the code in the following two blocks. Filtering of cells/genes can be applied before feature gene selection; normalization/imputation is not applicable here, as CDI models the raw UMI counts.

Inputs:

one_batch_matrix: a single-cell RNA-seq UMI count matrix from one batch (each row represents a gene and each column represent a cell). To apply CDI to datasets from multiple batches, please check section 4.
one_batch_matrix_label_df: the label sets of the one_batch_matrix, where each row represents a cell, and each column represents a set of cell labels generated by existing clustering methods such as K-Means.
one_batch_matrix_celltype: a vector of characters indicating the benchmark cell types of cells in one_batch_matrix (not necessary except for evaluation purpose).

data(one_batch_matrix, package = "CDI")
dim(one_batch_matrix)

## [1] 3000 2000

data(one_batch_matrix_celltype, package = "CDI")
table(one_batch_matrix_celltype)

## one_batch_matrix_celltype
## type1 type2 type3 type4 type5 
##   400   400   400   400   400

Here we provide 12 label sets generated from PCA+KMeans and Seurat v3.1.5, where the number of clusters range from 2 to 7.

For better visualization and comparison, the column names of this data frame indicate the clustering method and the number of clusters. For example, “KMeans_k5” refers to the set of cell labels generated by KMeans with 5 clusters.

data(one_batch_matrix_label_df, package = "CDI")
knitr::kable(head(one_batch_matrix_label_df[,c("KMeans_k2", "KMeans_k4", "Seurat_k2", "Seurat_k3")], 3))

KMeans_k2	KMeans_k4	Seurat_k2	Seurat_k3
2	3	1	2
2	3	1	2
2	3	1	2

Select feature genes with `feature_gene_selection`

Feature genes are those genes that express differently across cell types. Several methods are available for feature gene selection. Here we propose a new method called working dispersion score (WDS). We select genes with highest WDS as the feature genes.

feature_gene_indx <- feature_gene_selection(
    X = one_batch_matrix, 
    batch_label = NULL, 
    method = "wds", 
    nfeature = 500)
sub_one_batch_matrix <- one_batch_matrix[feature_gene_indx,]

`calculate_CDI`

First, calculate the size factor of each gene with size_factor. The output of size_factor will be one of the inputs of calculate_CDI.

one_batch_matrix_size_factor <- size_factor(X = one_batch_matrix)

Second, calculate CDI for each candidate set of cell labels:

start_time <- Sys.time()
CDI_return1 <- calculate_CDI(
    X = sub_one_batch_matrix, 
    cand_lab_df = one_batch_matrix_label_df, 
    batch_label = NULL, 
    cell_size_factor = one_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))

## Time difference of 36.45411 secs

Select the optimal label set with minimum CDI-AIC/CDI-BIC

knitr::kable(CDI_return1)

	Label_name	Cluster_method	CDI_AIC	CDI_BIC	neg_llk_val	N_cluster
KMeans_k2	KMeans_k2	KMeans	1135287	1146489	565643.4	2
KMeans_k3	KMeans_k3	KMeans	1120096	1136899	557047.9	3
KMeans_k4	KMeans_k4	KMeans	1112561	1134964	552280.4	4
KMeans_k5	KMeans_k5	KMeans	1101970	1129974	545984.9	5
KMeans_k6	KMeans_k6	KMeans	1103150	1136756	545575.1	6
KMeans_k7	KMeans_k7	KMeans	1105022	1144229	545511.2	7
Seurat_k2	Seurat_k2	Seurat	1129063	1140264	562531.3	2
Seurat_k3	Seurat_k3	Seurat	1120002	1136805	557001.2	3
Seurat_k4	Seurat_k4	Seurat	1109115	1131518	550557.3	4
Seurat_k5	Seurat_k5	Seurat	1102321	1130325	546160.4	5
Seurat_k6	Seurat_k6	Seurat	1103152	1136751	545575.9	6
Seurat_k7	Seurat_k7	Seurat	1104099	1143292	545049.6	7

CDI counts the number of unique characters in each label set as the “N_cluster”. If the column names of label set data frame are provided with the format “[ClusteringMethod]_k[NumberOfClusters]” such as “KMeans_K5, calculate_CDI will extract the”[ClusteringMethod]” as the Cluster_method. The clustering method can also be provided in the argument clustering_method for each label set.

The outputs of calculate_CDI include CDI-AIC and CDI-BIC. CDI calculates AIC and BIC of cell-type-specific gene-specific NB model for UMI counts, where the cell types are based on each candidate label set, and only the selected subset of genes are considered. Whether to use CDI-AIC or CDI-BIC depend on the goals. We suggest using CDI-BIC to select optimal main cell types and using CDI-AIC to select optimal subtypes, because BIC puts more penalty on the complexity of models (number of clusters).

Output visualization

The outputs of CDI are demonstrated with a lineplot. The x-axis is for the number of clusters. The y-axis is for the CDI values. Different colors represent different clustering methods: The orange line represents label sets from Seurat; the blue line represents label sets from K-Means. The red triangle represents the optimal clustering result corresponding to the smallest CDI value.

The cell population in this simulated dataset doesn’t have a hierarchical structure. We use CDI-BIC to select the optimal set of cell labels. The label set “KMeans_k5” has the smallest CDI-BIC and is selected as the optimal.

CDI_lineplot(cdi_dataframe = CDI_return1, cdi_type = "CDI_BIC")

A benchmark label set refers to human annotated cell type label set from biomarkers for real data or true label set for simulated data. If such label set is available, we can also compare the optimal clustering label set selected by CDI with it. Here we provide the heatmap of contingency table between benchmark label set and the optimal clustering label set selected by CDI. Each row represents a cell type in the benchmark label set, and each column represents a cluster in the optimal clustering label set selected by CDI. Each rectangle is color-scaled by the proportion of the cells in the given cluster coming from the benchmark types. Each column sums to 1.

contingency_heatmap(benchmark_label = one_batch_matrix_celltype,
    candidate_label = one_batch_matrix_label_df$KMeans_k5,
    rename_candidate_clusters = TRUE,
    candidate_cluster_names = paste0('cluster', seq_len(length(unique(one_batch_matrix_label_df$KMeans_k5)))))

We can also calculate the CDI for the benchmark label set as the following:

benchmark_return1 <- calculate_CDI(X = sub_one_batch_matrix,
    cand_lab_df = one_batch_matrix_celltype, 
    batch_label = NULL, 
    cell_size_factor = one_batch_matrix_size_factor)

The results of benchmark label set can be added to the CDI lineplot:

CDI_lineplot(cdi_dataframe = CDI_return1,
    cdi_type = "CDI_BIC",
    benchmark_celltype_cdi = benchmark_return1,
    benchmark_celltype_ncluster = length(unique(one_batch_matrix_celltype)))

The purple star represents the CDI value for the benchmark label set. The result shows that the benchmark label set achieves lowest CDI-BIC among all label sets. Therefore, if the benchmark label set is available as one candidate label set, it will be selected by CDI-BIC as the optimal.

Apply CDI to datasets from multiple batches

CDI accepts datasets from multiple batches. The candidate label sets can be obtained after batch effect correction, but CDI will be calculated based on raw UMI count matrices. The procedure of applying CDI to multi-batch datasets is similar to that of one batch, except for one extra input of batch labels.

Inputs:

two_batch_matrix: a single-cell RNA-seq UMI count matrix from the multiple batches (each row represents a gene and each column represent a cell).
two_batch_matrix_batch: a vector of characters indicating the batch labels of cells in two_batch_matrix. The number of batches can be greater than 2.
two_batch_matrix_label_df: the label sets of the two_batch_matrix, where each row represents a cell, and each column represents a label set generated by existing clustering methods such as K-Means after batch effect correction.
two_batch_matrix_celltype: a vector of characters indicating the benchmark cell types of cells in two_batch_matrix (not necessary unless for evaluation purpose).

Data loading

data(two_batch_matrix_celltype, package = "CDI")
table(two_batch_matrix_celltype)

## two_batch_matrix_celltype
## type1 type2 type3 type4 type5 
##   400   400   400   400   400

data(two_batch_matrix_batch, package = "CDI")
table(two_batch_matrix_batch)

## two_batch_matrix_batch
## batch1 batch2 
##   1000   1000

data(two_batch_matrix, package = "CDI")
dim(two_batch_matrix)

## [1] 3000 2000

data(two_batch_matrix_label_df, package = "CDI")
knitr::kable(head(two_batch_matrix_label_df[,c("KMeans_k5", "KMeans_k6", "Seurat_k5", "Seurat_k6")], 3))

KMeans_k5	KMeans_k6	Seurat_k5	Seurat_k6
3	3	2	2
3	3	2	2
3	3	2	2

Feature gene selection

feature_gene_indx <- feature_gene_selection(
    X = two_batch_matrix,
    batch_label = two_batch_matrix_batch,
    method = "wds",
    nfeature = 500)
sub_two_batch_matrix <- two_batch_matrix[feature_gene_indx,]

Calculate CDI

two_batch_matrix_size_factor <- size_factor(two_batch_matrix)

start_time <- Sys.time()
CDI_return2 <- calculate_CDI(
    X = sub_two_batch_matrix,
    cand_lab_df = two_batch_matrix_label_df, 
    batch_label = two_batch_matrix_batch,
    cell_size_factor = two_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))

## Time difference of 3.613106 mins

knitr::kable(CDI_return2)

	Label_name	Cluster_method	CDI_AIC	CDI_BIC	neg_llk_val	N_cluster
KMeans_k3	KMeans_k3	KMeans	1116010	1133295	554919.1	3
KMeans_k4	KMeans_k4	KMeans	1090471	1114398	540963.7	4
KMeans_k5	KMeans_k5	KMeans	1094696	1124000	542115.9	5
KMeans_k6	KMeans_k6	KMeans	1084735	1119774	536111.6	6
KMeans_k7	KMeans_k7	KMeans	1085682	1125975	535646.8	7
KMeans_k8	KMeans_k8	KMeans	1086246	1132224	534910.8	8
KMeans_k9	KMeans_k9	KMeans	1086179	1137386	533943.5	9
KMeans_k10	KMeans_k10	KMeans	1089714	1146373	534741.1	10
KMeans_k11	KMeans_k11	KMeans	1090071	1151995	533979.7	11
KMeans_k12	KMeans_k12	KMeans	1091465	1159185	533640.6	12
KMeans_k13	KMeans_k13	KMeans	1093719	1167259	533729.7	13
KMeans_k14	KMeans_k14	KMeans	1093773	1172530	532822.4	14
KMeans_k15	KMeans_k15	KMeans	1097256	1182247	533451.8	15
Seurat_k3	Seurat_k3	Seurat	1098939	1117153	546217.3	3
Seurat_k4	Seurat_k4	Seurat	1090520	1114458	540986.0	4
Seurat_k5	Seurat_k5	Seurat	1083559	1113412	536449.6	5
Seurat_k6	Seurat_k6	Seurat	1084476	1119538	535978.0	6
Seurat_k7	Seurat_k7	Seurat	1084354	1124707	534970.9	7
Seurat_k8	Seurat_k8	Seurat	1086408	1132145	535037.9	8
Seurat_k9	Seurat_k9	Seurat	1087299	1138390	534527.4	9
Seurat_k10	Seurat_k10	Seurat	1087046	1143337	533471.2	10
Seurat_k11	Seurat_k11	Seurat	1089098	1150966	533501.1	11
Seurat_k12	Seurat_k12	Seurat	1090061	1157517	532984.3	12
Seurat_k13	Seurat_k13	Seurat	1090988	1164020	532449.9	13
Seurat_k14	Seurat_k14	Seurat	1091273	1170009	531572.7	14
Seurat_k15	Seurat_k15	Seurat	1091733	1176071	530800.4	15

Calculate the benchmark label set (if available) CDI:

benchmark_return <- calculate_CDI(
    X = sub_two_batch_matrix,
    cand_lab_df = two_batch_matrix_celltype,
    batch_label = two_batch_matrix_batch,
    cell_size_factor = two_batch_matrix_size_factor)

Output visualization:

CDI_lineplot(cdi_dataframe = CDI_return2,
    cdi_type = "CDI_BIC",
    benchmark_celltype_cdi = benchmark_return,
    benchmark_celltype_ncluster = length(unique(two_batch_matrix_celltype)))

Compare the optimal clustering label set selected by CDI with the benchmark cell type labels:

contingency_heatmap(
    benchmark_label = two_batch_matrix_celltype,
    candidate_label = two_batch_matrix_label_df$Seurat_k5,
    rename_candidate_clusters = TRUE,
    candidate_cluster_names = paste0('cluster', seq_len(length(unique(one_batch_matrix_label_df$Seurat_k5)))))

Session Information

sessionInfo()

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] CDI_1.5.0        BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          sys_3.4.3                  
##   [3] jsonlite_2.0.0              magrittr_2.0.3             
##   [5] spatstat.utils_3.1-3        farver_2.1.2               
##   [7] rmarkdown_2.29              vctrs_0.6.5                
##   [9] ROCR_1.0-11                 spatstat.explore_3.4-2     
##  [11] S4Arrays_1.7.3              htmltools_0.5.8.1          
##  [13] SparseArray_1.7.7           sass_0.4.9                 
##  [15] sctransform_0.4.1           parallelly_1.43.0          
##  [17] KernSmooth_2.23-26          bslib_0.9.0                
##  [19] htmlwidgets_1.6.4           ica_1.0-3                  
##  [21] plyr_1.8.9                  plotly_4.10.4              
##  [23] zoo_1.8-13                  cachem_1.1.0               
##  [25] buildtools_1.0.0            igraph_2.1.4               
##  [27] mime_0.13                   lifecycle_1.0.4            
##  [29] pkgconfig_2.0.3             Matrix_1.7-3               
##  [31] R6_2.6.1                    fastmap_1.2.0              
##  [33] GenomeInfoDbData_1.2.13     MatrixGenerics_1.19.1      
##  [35] fitdistrplus_1.2-2          future_1.34.0              
##  [37] shiny_1.10.0                digest_0.6.37              
##  [39] colorspace_2.1-1            patchwork_1.3.0            
##  [41] S4Vectors_0.45.4            Seurat_5.2.1               
##  [43] tensor_1.5                  RSpectra_0.16-2            
##  [45] irlba_2.3.5.1               GenomicRanges_1.59.1       
##  [47] labeling_0.4.3              progressr_0.15.1           
##  [49] spatstat.sparse_3.1-0       httr_1.4.7                 
##  [51] polyclip_1.10-7             abind_1.4-8                
##  [53] compiler_4.4.3              withr_3.0.2                
##  [55] BiocParallel_1.41.2         fastDummies_1.7.5          
##  [57] MASS_7.3-65                 DelayedArray_0.33.6        
##  [59] ggsci_3.2.0                 tools_4.4.3                
##  [61] lmtest_0.9-40               httpuv_1.6.15              
##  [63] future.apply_1.11.3         goftest_1.2-3              
##  [65] glue_1.8.0                  nlme_3.1-167               
##  [67] promises_1.3.2              grid_4.4.3                 
##  [69] Rtsne_0.17                  cluster_2.1.8.1            
##  [71] reshape2_1.4.4              generics_0.1.3             
##  [73] gtable_0.3.6                spatstat.data_3.1-6        
##  [75] tidyr_1.3.1                 data.table_1.17.0          
##  [77] XVector_0.47.2              sp_2.2-0                   
##  [79] BiocGenerics_0.53.6         spatstat.geom_3.3-6        
##  [81] RcppAnnoy_0.0.22            ggrepel_0.9.6              
##  [83] RANN_2.6.2                  pillar_1.10.1              
##  [85] stringr_1.5.1               spam_2.11-1                
##  [87] RcppHNSW_0.6.0              later_1.4.1                
##  [89] splines_4.4.3               dplyr_1.1.4                
##  [91] lattice_0.22-6              survival_3.8-3             
##  [93] deldir_2.0-4                tidyselect_1.2.1           
##  [95] SingleCellExperiment_1.29.2 maketools_1.3.2            
##  [97] miniUI_0.1.1.1              pbapply_1.7-2              
##  [99] knitr_1.50                  gridExtra_2.3              
## [101] IRanges_2.41.3              SummarizedExperiment_1.37.0
## [103] scattermore_1.2             stats4_4.4.3               
## [105] xfun_0.51                   Biobase_2.67.0             
## [107] matrixStats_1.5.0           UCSC.utils_1.3.1           
## [109] stringi_1.8.7               lazyeval_0.2.2             
## [111] yaml_2.3.10                 evaluate_1.0.3             
## [113] codetools_0.2-20            tibble_3.2.1               
## [115] BiocManager_1.30.25         cli_3.6.4                  
## [117] uwot_0.2.3                  xtable_1.8-4               
## [119] reticulate_1.42.0           munsell_0.5.1              
## [121] jquerylib_0.1.4             GenomeInfoDb_1.43.4        
## [123] Rcpp_1.0.14                 globals_0.16.3             
## [125] spatstat.random_3.3-3       png_0.1-8                  
## [127] spatstat.univar_3.1-2       parallel_4.4.3             
## [129] ggplot2_3.5.1               dotCall64_1.2              
## [131] listenv_0.9.1               viridisLite_0.4.2          
## [133] scales_1.3.0                ggridges_0.5.6             
## [135] crayon_1.5.3                SeuratObject_5.0.2         
## [137] purrr_1.0.4                 rlang_1.1.5                
## [139] cowplot_1.1.3

Clustering Deviation Index (CDI) Tutorial