Clustering Deviation Index (CDI) Tutorial

Introduction

This tutorial shows how to apply CDI to select optimal clustering labels among different candidate cell labels for UMI counts of single-cell RNA-sequencing dataset. Sections 1-3 are for datasets from one batch; Section 4 is for datasets from multiple batches. Datasets used in this tutorial were simulated for illustration purpose.

Installation

CDI can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("CDI")

The latest version can be installed from Github:

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
remotes::install_github("jichunxie/CDI", build_vignettes = TRUE) 

Load datasets

CDI provides simulated toy single-cell RNA-seq UMI count matrices for illustration purpose. To use other datasets, please change the code in the following two blocks. Filtering of cells/genes can be applied before feature gene selection; normalization/imputation is not applicable here, as CDI models the raw UMI counts.

Inputs:

  • one_batch_matrix: a single-cell RNA-seq UMI count matrix from one batch (each row represents a gene and each column represent a cell). To apply CDI to datasets from multiple batches, please check section 4.
  • one_batch_matrix_label_df: the label sets of the one_batch_matrix, where each row represents a cell, and each column represents a set of cell labels generated by existing clustering methods such as K-Means.
  • one_batch_matrix_celltype: a vector of characters indicating the benchmark cell types of cells in one_batch_matrix (not necessary except for evaluation purpose).
data(one_batch_matrix, package = "CDI")
dim(one_batch_matrix)
## [1] 3000 2000
data(one_batch_matrix_celltype, package = "CDI")
table(one_batch_matrix_celltype)
## one_batch_matrix_celltype
## type1 type2 type3 type4 type5 
##   400   400   400   400   400

Here we provide 12 label sets generated from PCA+KMeans and Seurat v3.1.5, where the number of clusters range from 2 to 7.

For better visualization and comparison, the column names of this data frame indicate the clustering method and the number of clusters. For example, “KMeans_k5” refers to the set of cell labels generated by KMeans with 5 clusters.

data(one_batch_matrix_label_df, package = "CDI")
knitr::kable(head(one_batch_matrix_label_df[,c("KMeans_k2", "KMeans_k4", "Seurat_k2", "Seurat_k3")], 3))
KMeans_k2 KMeans_k4 Seurat_k2 Seurat_k3
2 3 1 2
2 3 1 2
2 3 1 2

Select feature genes with feature_gene_selection

Feature genes are those genes that express differently across cell types. Several methods are available for feature gene selection. Here we propose a new method called working dispersion score (WDS). We select genes with highest WDS as the feature genes.

feature_gene_indx <- feature_gene_selection(
    X = one_batch_matrix, 
    batch_label = NULL, 
    method = "wds", 
    nfeature = 500)
sub_one_batch_matrix <- one_batch_matrix[feature_gene_indx,]

calculate_CDI

First, calculate the size factor of each gene with size_factor. The output of size_factor will be one of the inputs of calculate_CDI.

one_batch_matrix_size_factor <- size_factor(X = one_batch_matrix)

Second, calculate CDI for each candidate set of cell labels:

start_time <- Sys.time()
CDI_return1 <- calculate_CDI(
    X = sub_one_batch_matrix, 
    cand_lab_df = one_batch_matrix_label_df, 
    batch_label = NULL, 
    cell_size_factor = one_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))
## Time difference of 35.38791 secs

Select the optimal label set with minimum CDI-AIC/CDI-BIC

knitr::kable(CDI_return1)
Label_name Cluster_method CDI_AIC CDI_BIC neg_llk_val N_cluster
KMeans_k2 KMeans_k2 KMeans 1135287 1146489 565643.4 2
KMeans_k3 KMeans_k3 KMeans 1120096 1136899 557047.9 3
KMeans_k4 KMeans_k4 KMeans 1112561 1134964 552280.4 4
KMeans_k5 KMeans_k5 KMeans 1101970 1129974 545984.9 5
KMeans_k6 KMeans_k6 KMeans 1103150 1136756 545575.1 6
KMeans_k7 KMeans_k7 KMeans 1105022 1144229 545511.2 7
Seurat_k2 Seurat_k2 Seurat 1129063 1140264 562531.3 2
Seurat_k3 Seurat_k3 Seurat 1120002 1136805 557001.2 3
Seurat_k4 Seurat_k4 Seurat 1109115 1131518 550557.3 4
Seurat_k5 Seurat_k5 Seurat 1102321 1130325 546160.4 5
Seurat_k6 Seurat_k6 Seurat 1103152 1136751 545575.9 6
Seurat_k7 Seurat_k7 Seurat 1104099 1143292 545049.6 7

CDI counts the number of unique characters in each label set as the “N_cluster”. If the column names of label set data frame are provided with the format “[ClusteringMethod]_k[NumberOfClusters]” such as “KMeans_K5, calculate_CDI will extract the”[ClusteringMethod]” as the Cluster_method. The clustering method can also be provided in the argument clustering_method for each label set.

The outputs of calculate_CDI include CDI-AIC and CDI-BIC. CDI calculates AIC and BIC of cell-type-specific gene-specific NB model for UMI counts, where the cell types are based on each candidate label set, and only the selected subset of genes are considered. Whether to use CDI-AIC or CDI-BIC depend on the goals. We suggest using CDI-BIC to select optimal main cell types and using CDI-AIC to select optimal subtypes, because BIC puts more penalty on the complexity of models (number of clusters).

Output visualization

The outputs of CDI are demonstrated with a lineplot. The x-axis is for the number of clusters. The y-axis is for the CDI values. Different colors represent different clustering methods: The orange line represents label sets from Seurat; the blue line represents label sets from K-Means. The red triangle represents the optimal clustering result corresponding to the smallest CDI value.

The cell population in this simulated dataset doesn’t have a hierarchical structure. We use CDI-BIC to select the optimal set of cell labels. The label set “KMeans_k5” has the smallest CDI-BIC and is selected as the optimal.

CDI_lineplot(cdi_dataframe = CDI_return1, cdi_type = "CDI_BIC")

A benchmark label set refers to human annotated cell type label set from biomarkers for real data or true label set for simulated data. If such label set is available, we can also compare the optimal clustering label set selected by CDI with it. Here we provide the heatmap of contingency table between benchmark label set and the optimal clustering label set selected by CDI. Each row represents a cell type in the benchmark label set, and each column represents a cluster in the optimal clustering label set selected by CDI. Each rectangle is color-scaled by the proportion of the cells in the given cluster coming from the benchmark types. Each column sums to 1.

contingency_heatmap(benchmark_label = one_batch_matrix_celltype,
    candidate_label = one_batch_matrix_label_df$KMeans_k5,
    rename_candidate_clusters = TRUE,
    candidate_cluster_names = paste0('cluster', seq_len(length(unique(one_batch_matrix_label_df$KMeans_k5)))))

We can also calculate the CDI for the benchmark label set as the following:

benchmark_return1 <- calculate_CDI(X = sub_one_batch_matrix,
    cand_lab_df = one_batch_matrix_celltype, 
    batch_label = NULL, 
    cell_size_factor = one_batch_matrix_size_factor)

The results of benchmark label set can be added to the CDI lineplot:

CDI_lineplot(cdi_dataframe = CDI_return1,
    cdi_type = "CDI_BIC",
    benchmark_celltype_cdi = benchmark_return1,
    benchmark_celltype_ncluster = length(unique(one_batch_matrix_celltype)))

The purple star represents the CDI value for the benchmark label set. The result shows that the benchmark label set achieves lowest CDI-BIC among all label sets. Therefore, if the benchmark label set is available as one candidate label set, it will be selected by CDI-BIC as the optimal.

Apply CDI to datasets from multiple batches

CDI accepts datasets from multiple batches. The candidate label sets can be obtained after batch effect correction, but CDI will be calculated based on raw UMI count matrices. The procedure of applying CDI to multi-batch datasets is similar to that of one batch, except for one extra input of batch labels.

Inputs:

  • two_batch_matrix: a single-cell RNA-seq UMI count matrix from the multiple batches (each row represents a gene and each column represent a cell).
  • two_batch_matrix_batch: a vector of characters indicating the batch labels of cells in two_batch_matrix. The number of batches can be greater than 2.
  • two_batch_matrix_label_df: the label sets of the two_batch_matrix, where each row represents a cell, and each column represents a label set generated by existing clustering methods such as K-Means after batch effect correction.
  • two_batch_matrix_celltype: a vector of characters indicating the benchmark cell types of cells in two_batch_matrix (not necessary unless for evaluation purpose).

Data loading

data(two_batch_matrix_celltype, package = "CDI")
table(two_batch_matrix_celltype)
## two_batch_matrix_celltype
## type1 type2 type3 type4 type5 
##   400   400   400   400   400
data(two_batch_matrix_batch, package = "CDI")
table(two_batch_matrix_batch)
## two_batch_matrix_batch
## batch1 batch2 
##   1000   1000
data(two_batch_matrix, package = "CDI")
dim(two_batch_matrix)
## [1] 3000 2000
data(two_batch_matrix_label_df, package = "CDI")
knitr::kable(head(two_batch_matrix_label_df[,c("KMeans_k5", "KMeans_k6", "Seurat_k5", "Seurat_k6")], 3))
KMeans_k5 KMeans_k6 Seurat_k5 Seurat_k6
3 3 2 2
3 3 2 2
3 3 2 2

Feature gene selection

feature_gene_indx <- feature_gene_selection(
    X = two_batch_matrix,
    batch_label = two_batch_matrix_batch,
    method = "wds",
    nfeature = 500)
sub_two_batch_matrix <- two_batch_matrix[feature_gene_indx,]

Calculate CDI

two_batch_matrix_size_factor <- size_factor(two_batch_matrix)

start_time <- Sys.time()
CDI_return2 <- calculate_CDI(
    X = sub_two_batch_matrix,
    cand_lab_df = two_batch_matrix_label_df, 
    batch_label = two_batch_matrix_batch,
    cell_size_factor = two_batch_matrix_size_factor)
end_time <- Sys.time()
print(difftime(end_time, start_time))
## Time difference of 3.508361 mins
knitr::kable(CDI_return2)
Label_name Cluster_method CDI_AIC CDI_BIC neg_llk_val N_cluster
KMeans_k3 KMeans_k3 KMeans 1116010 1133295 554919.1 3
KMeans_k4 KMeans_k4 KMeans 1090471 1114398 540963.7 4
KMeans_k5 KMeans_k5 KMeans 1094696 1124000 542115.9 5
KMeans_k6 KMeans_k6 KMeans 1084735 1119774 536111.6 6
KMeans_k7 KMeans_k7 KMeans 1085682 1125975 535646.8 7
KMeans_k8 KMeans_k8 KMeans 1086246 1132224 534910.8 8
KMeans_k9 KMeans_k9 KMeans 1086179 1137386 533943.5 9
KMeans_k10 KMeans_k10 KMeans 1089714 1146373 534741.1 10
KMeans_k11 KMeans_k11 KMeans 1090071 1151995 533979.7 11
KMeans_k12 KMeans_k12 KMeans 1091465 1159185 533640.6 12
KMeans_k13 KMeans_k13 KMeans 1093719 1167259 533729.7 13
KMeans_k14 KMeans_k14 KMeans 1093773 1172530 532822.4 14
KMeans_k15 KMeans_k15 KMeans 1097256 1182247 533451.8 15
Seurat_k3 Seurat_k3 Seurat 1098939 1117153 546217.3 3
Seurat_k4 Seurat_k4 Seurat 1090520 1114458 540986.0 4
Seurat_k5 Seurat_k5 Seurat 1083559 1113412 536449.6 5
Seurat_k6 Seurat_k6 Seurat 1084476 1119538 535978.0 6
Seurat_k7 Seurat_k7 Seurat 1084354 1124707 534970.9 7
Seurat_k8 Seurat_k8 Seurat 1086408 1132145 535037.9 8
Seurat_k9 Seurat_k9 Seurat 1087299 1138390 534527.4 9
Seurat_k10 Seurat_k10 Seurat 1087046 1143337 533471.2 10
Seurat_k11 Seurat_k11 Seurat 1089098 1150966 533501.1 11
Seurat_k12 Seurat_k12 Seurat 1090061 1157517 532984.3 12
Seurat_k13 Seurat_k13 Seurat 1090988 1164020 532449.9 13
Seurat_k14 Seurat_k14 Seurat 1091273 1170009 531572.7 14
Seurat_k15 Seurat_k15 Seurat 1091733 1176071 530800.4 15

Calculate the benchmark label set (if available) CDI:

benchmark_return <- calculate_CDI(
    X = sub_two_batch_matrix,
    cand_lab_df = two_batch_matrix_celltype,
    batch_label = two_batch_matrix_batch,
    cell_size_factor = two_batch_matrix_size_factor)

Output visualization:

CDI_lineplot(cdi_dataframe = CDI_return2,
    cdi_type = "CDI_BIC",
    benchmark_celltype_cdi = benchmark_return,
    benchmark_celltype_ncluster = length(unique(two_batch_matrix_celltype))) 

Compare the optimal clustering label set selected by CDI with the benchmark cell type labels:

contingency_heatmap(
    benchmark_label = two_batch_matrix_celltype,
    candidate_label = two_batch_matrix_label_df$Seurat_k5,
    rename_candidate_clusters = TRUE,
    candidate_cluster_names = paste0('cluster', seq_len(length(unique(one_batch_matrix_label_df$Seurat_k5)))))

Session Information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] CDI_1.5.0        BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          sys_3.4.3                  
##   [3] jsonlite_1.8.9              magrittr_2.0.3             
##   [5] spatstat.utils_3.1-1        farver_2.1.2               
##   [7] rmarkdown_2.29              zlibbioc_1.52.0            
##   [9] vctrs_0.6.5                 ROCR_1.0-11                
##  [11] spatstat.explore_3.3-3      S4Arrays_1.7.1             
##  [13] htmltools_0.5.8.1           SparseArray_1.7.2          
##  [15] sass_0.4.9                  sctransform_0.4.1          
##  [17] parallelly_1.39.0           KernSmooth_2.23-24         
##  [19] bslib_0.8.0                 htmlwidgets_1.6.4          
##  [21] ica_1.0-3                   plyr_1.8.9                 
##  [23] plotly_4.10.4               zoo_1.8-12                 
##  [25] cachem_1.1.0                buildtools_1.0.0           
##  [27] igraph_2.1.1                mime_0.12                  
##  [29] lifecycle_1.0.4             pkgconfig_2.0.3            
##  [31] Matrix_1.7-1                R6_2.5.1                   
##  [33] fastmap_1.2.0               GenomeInfoDbData_1.2.13    
##  [35] MatrixGenerics_1.19.0       fitdistrplus_1.2-1         
##  [37] future_1.34.0               shiny_1.9.1                
##  [39] digest_0.6.37               colorspace_2.1-1           
##  [41] S4Vectors_0.45.2            patchwork_1.3.0            
##  [43] Seurat_5.1.0                tensor_1.5                 
##  [45] RSpectra_0.16-2             irlba_2.3.5.1              
##  [47] GenomicRanges_1.59.1        labeling_0.4.3             
##  [49] progressr_0.15.1            fansi_1.0.6                
##  [51] spatstat.sparse_3.1-0       httr_1.4.7                 
##  [53] polyclip_1.10-7             abind_1.4-8                
##  [55] compiler_4.4.2              withr_3.0.2                
##  [57] BiocParallel_1.41.0         fastDummies_1.7.4          
##  [59] MASS_7.3-61                 DelayedArray_0.33.2        
##  [61] ggsci_3.2.0                 tools_4.4.2                
##  [63] lmtest_0.9-40               httpuv_1.6.15              
##  [65] future.apply_1.11.3         goftest_1.2-3              
##  [67] glue_1.8.0                  nlme_3.1-166               
##  [69] promises_1.3.2              grid_4.4.2                 
##  [71] Rtsne_0.17                  cluster_2.1.6              
##  [73] reshape2_1.4.4              generics_0.1.3             
##  [75] gtable_0.3.6                spatstat.data_3.1-4        
##  [77] tidyr_1.3.1                 data.table_1.16.2          
##  [79] XVector_0.47.0              sp_2.1-4                   
##  [81] utf8_1.2.4                  BiocGenerics_0.53.3        
##  [83] spatstat.geom_3.3-4         RcppAnnoy_0.0.22           
##  [85] ggrepel_0.9.6               RANN_2.6.2                 
##  [87] pillar_1.9.0                stringr_1.5.1              
##  [89] spam_2.11-0                 RcppHNSW_0.6.0             
##  [91] later_1.4.1                 splines_4.4.2              
##  [93] dplyr_1.1.4                 lattice_0.22-6             
##  [95] survival_3.7-0              deldir_2.0-4               
##  [97] tidyselect_1.2.1            SingleCellExperiment_1.29.1
##  [99] maketools_1.3.1             miniUI_0.1.1.1             
## [101] pbapply_1.7-2               knitr_1.49                 
## [103] gridExtra_2.3               IRanges_2.41.1             
## [105] SummarizedExperiment_1.37.0 scattermore_1.2            
## [107] stats4_4.4.2                xfun_0.49                  
## [109] Biobase_2.67.0              matrixStats_1.4.1          
## [111] UCSC.utils_1.3.0            stringi_1.8.4              
## [113] lazyeval_0.2.2              yaml_2.3.10                
## [115] evaluate_1.0.1              codetools_0.2-20           
## [117] tibble_3.2.1                BiocManager_1.30.25        
## [119] cli_3.6.3                   uwot_0.2.2                 
## [121] xtable_1.8-4                reticulate_1.40.0          
## [123] munsell_0.5.1               jquerylib_0.1.4            
## [125] GenomeInfoDb_1.43.2         Rcpp_1.0.13-1              
## [127] globals_0.16.3              spatstat.random_3.3-2      
## [129] png_0.1-8                   spatstat.univar_3.1-1      
## [131] parallel_4.4.2              ggplot2_3.5.1              
## [133] dotCall64_1.2               listenv_0.9.1              
## [135] viridisLite_0.4.2           scales_1.3.0               
## [137] ggridges_0.5.6              crayon_1.5.3               
## [139] SeuratObject_5.0.2          leiden_0.4.3.1             
## [141] purrr_1.0.2                 rlang_1.1.4                
## [143] cowplot_1.1.3