Assorted clustering diagnostics

Introduction

The bluster package provides a few diagnostics for quantitatively examining the cluster output. These diagnostics We will demonstrate on another dataset from the scRNAseq package, clustered with graph-based methods via the clusterRows() generic as described in the previous vignette.

library(scRNAseq)
sce <- GrunPancreasData()

# Quality control to remove bad cells.
library(scuttle)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, sub.fields="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]

# Normalization by library size.
sce <- logNormCounts(sce)

# Feature selection.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

# Dimensionality reduction.
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)

# Clustering.
library(bluster)
mat <- reducedDim(sce)
clust.info <- clusterRows(mat, NNGraphParam(), full=TRUE)
clusters <- clust.info$clusters
table(clusters)
## clusters
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 285 171 161  59 174  49  70 137  69  65  28  23

Computing the silhouette width

The silhouette width is a standard metric to quantify the separation between clusters generated by any procedure. A cell with a large positive width is closer to other cells from the same cluster compared to cells from different clusters. On the other hand, low or negative widths indicate that cells from different clusters are not well separated.

The exact silhouette calculation is rather computationally intensive so bluster implements an approximation instead. This is provided in the approxSilhouette() function, which returns the width for each cell and its closest (non-self) cluster. Clusters consisting of cells with lower widths may warrant some more care during interpretation.

sil <- approxSilhouette(mat, clusters)
sil
## DataFrame with 1291 rows and 3 columns
##             cluster    other     width
##            <factor> <factor> <numeric>
## D2ex_1            1        7  0.375510
## D2ex_2            1        7  0.370596
## D2ex_3            1        7  0.409271
## D2ex_4            5        3  0.257069
## D2ex_5            5        3  0.249883
## ...             ...      ...       ...
## D17TGFB_91        6        2  0.126607
## D17TGFB_92        1        5  0.403106
## D17TGFB_93        1        7  0.384365
## D17TGFB_94        6        2  0.243933
## D17TGFB_95        6        2  0.204860
boxplot(split(sil$width, clusters))

The function also returns the identity of the closest “other” cluster for each cell. This can be helpful to identify which clusters are easily confused to each other, based on how many of one cluster’s cells are closer to the other cluster.

best.choice <- ifelse(sil$width > 0, clusters, sil$other)
table(Assigned=clusters, Closest=best.choice)
##         Closest
## Assigned   1   2   3   4   5   6   7   8   9  10  11  12
##       1  275   0   7   0   3   0   0   0   0   0   0   0
##       2    0 171   0   0   0   0   0   0   0   0   0   0
##       3    0   0 150   0  11   0   0   0   0   0   0   0
##       4    0   0   0  59   0   0   0   0   0   0   0   0
##       5    0   0  21   0 153   0   0   0   0   0   0   0
##       6    0  26   0   0   0  23   0   0   0   0   0   0
##       7   14   0   8  10   2   0  18   0   7   2   8   1
##       8    0   0   0   1   0   0   0 136   0   0   0   0
##       9    0   0   0   0   0   0   0  10  58   0   0   1
##       10   0   0   0   0   0   0   0   0   0  65   0   0
##       11   0   0   0   0   0   0   0   0   0   0  28   0
##       12   0   0   0   0   0   0   0   0   0   0   0  23

Computing the neighborhood purity

Another diagnostic uses the percentage of neighbors for each cell that belong to the same cluster. Well-separated clusters should exhibit high percentages (i.e., “purities”) as cells from different clusters do not mix. Low purities are symptomatic of overclustering where cluster boundaries become more ambiguous.

The neighborPurity() function computes the purity of the neighborhood for each cell. Clusters with systematically low purities may warrant some more care during interpretation. By default, we perform some weighting so that large clusters do not have large purities simply because there are few cells assigned to other clusters in the dataset.

pure <- neighborPurity(mat, clusters)
pure
## DataFrame with 1291 rows and 2 columns
##               purity  maximum
##            <numeric> <factor>
## D2ex_1      1.000000        1
## D2ex_2      1.000000        1
## D2ex_3      1.000000        1
## D2ex_4      0.951603        5
## D2ex_5      1.000000        5
## ...              ...      ...
## D17TGFB_91  0.959277        6
## D17TGFB_92  1.000000        1
## D17TGFB_93  1.000000        1
## D17TGFB_94  0.986538        6
## D17TGFB_95  0.898817        6
boxplot(split(pure$purity, clusters))

The function also returns the identity of the other cluster with the highest percentage. This can again be useful to identify the relationships between clusters based on which pairs have the greatest intermingling in their neighborhoods.

table(Assigned=clusters, Max=pure$maximum)
##         Max
## Assigned   1   2   3   4   5   6   7   8   9  10  11  12
##       1  279   0   0   0   0   0   6   0   0   0   0   0
##       2    0 170   0   0   0   0   0   0   0   0   0   1
##       3    0   0 158   0   3   0   0   0   0   0   0   0
##       4    0   0   0  59   0   0   0   0   0   0   0   0
##       5    1   0   6   0 165   0   2   0   0   0   0   0
##       6    0  18   0   0   0  31   0   0   0   0   0   0
##       7    0   0   0   1   0   0  67   0   1   0   1   0
##       8    0   0   0   0   0   0   0 136   1   0   0   0
##       9    0   0   0   0   0   0   0   5  64   0   0   0
##       10   0   0   0   0   0   0   0   0   0  65   0   0
##       11   0   0   0   0   0   0   0   0   0   0  28   0
##       12   0   0   0   0   0   0   0   0   0   0   0  23

Computing per-cluster RMSD

The root-mean-squared deviation (RMSD) for each cluster represents the dispersion of cells within each cluster. A large RMSD value indicates that a cluster has high internal heterogeneity, making it a good candidate for further subclustering.

rmsd <- clusterRMSD(mat, clusters)
barplot(rmsd)

Alternatively, we can compute the within-cluster sum of squares (WCSS), a metric commonly seen in k-means clustering. One could pick a “sensible” choice for k by computing the WCSS for a range of values and picking the value the WCSS begins to plateau - see Section @ref(clustering-parameter-sweeps) for more details.

clusterRMSD(mat, clusters, sum=TRUE)
##         1         2         3         4         5         6         7         8 
## 60368.794  9537.387 26566.200 19009.475 30701.306  8244.534 31829.182 14682.089 
##         9        10        11        12 
## 10652.096  6094.771  2644.843  1956.403

Computing graph modularity

For graph-based methods, we can compute the cluster modularity within clusters and between pairs of clusters. Specifically, we examine the ratio of observed to expected edge weights for each pair of clusters (closely related to the modularity score used in many cluster_* functions from igraph). We would usually expect to see high observed weights between cells in the same cluster with minimal weights between clusters, indicating that the clusters are well-separated. Large off-diagonal entries indicate that the corresponding pair of clusters are closely related.

g <- clust.info$objects$graph
ratio <- pairwiseModularity(g, clusters, as.ratio=TRUE)

library(pheatmap)
pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,
    col=rev(heat.colors(100)))

This may be better visualized with a force-directed layout:

cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1), 
    mode="upper", weighted=TRUE, diag=FALSE)

# Increasing the weight to increase the visibility of the lines.
set.seed(1100101)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5,
    layout=igraph::layout_with_lgl)

We can also tune the resolution of the clustering post hoc with the mergeCommunities() function. This will iteratively merge the most closely related pair of clusters together until the desired number of clusters is reached. For example, if we wanted to whittle down the number of clusters to 10, we could do:

merged <- mergeCommunities(g, clusters, number=10)
table(merged)
## merged
##   1   2   3   5   6   7   9  10  11  12 
## 285 171 161 174  49 129 206  65  28  23

Comparing two clusterings

To compare two clusterings, the pairwiseRand() function computes the adjusted Rand index (ARI). High ARIs indicate that the two clusterings are similar with respect to how they partition the observations, and an ARI of 1 means that the clusterings are identical.

hclusters <- clusterRows(mat, HclustParam(cut.dynamic=TRUE))
pairwiseRand(clusters, hclusters, mode="index")
## [1] 0.4512108

Of course, a single number is not particularly useful, so clusterRand() also provides the capability to break down the ARI into its contributions from each cluster or cluster pair. Specifically, for each cluster or cluster pair in a “reference” clustering (here, clusters), we see whether it is preserved in the “alternative” clustering (here, hclusters). Large values on the diagonal indicate that the reference cluster is recapitulated; large values off the diagonal indicate that the separation between the corresponding pair of clusters is also maintained. Conversely, low diagonal values indicate that the corresponding cluster is fragmented in the alternative, and low off-diagonal values can be used as a diagnostic for loss of separation.

ratio <- pairwiseRand(clusters, hclusters, mode="ratio")

library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
    col=viridis::viridis(100), breaks=seq(-1, 1, length=101))

Explicit mappings between two clusterings can be performed using linkClusters() (see Section @ref(linking-clusters)). Alternatively, we can quantify the degree of “nesting” of one clustering within another with nestedClusters(); this can be useful for verifying that the higher-resolution clustering is indeed nested within its coarser counterpart.

Bootstrapping cluster stability

We can use bootstrapping to evaluate the effect of sampling noise on the stability of a clustering procedure. The bootstrapStability() function will return the ARI of the original clusters against those generated from bootstrap replicates, averaged across multiple bootstrap iterations. High values indicate that the clustering is robust to sample noise.

set.seed(1001010)
ari <-bootstrapStability(mat, clusters=clusters, 
    mode="index", BLUSPARAM=NNGraphParam())
ari
## [1] 0.6824535

Advanced users may also set mode="ratio" to obtain a more detailed breakdown of the effect of noise on each cluster (pair).

set.seed(1001010)
ratio <-bootstrapStability(mat, clusters=clusters,
    mode="ratio", BLUSPARAM=NNGraphParam())

library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
    col=viridis::viridis(100), breaks=seq(-1, 1, length=101))

Clustering parameter sweeps

The clusterSweep() function provides a convenient way to test multiple combinations of parameter settings. Given a BlusterParam object and a set of values for each parameter, the function will repeat the clustering ith each combination of parameters. The example below uses graph-based clustering with a variety of k as well as different community detection algorithms.

combinations <- clusterSweep(mat, BLUSPARAM=SNNGraphParam(),
    k=c(5L, 10L, 15L, 20L), cluster.fun=c("walktrap", "louvain", "infomap"))

This yields a list containing all clusterings and the corresponding parameter combinations used to generate them. The function will attempt to generate some sensible name for each combination, though this may require some manual curation for large numbers of parameters.

colnames(combinations$clusters)
##  [1] "k.5_cluster.fun.walktrap"  "k.10_cluster.fun.walktrap"
##  [3] "k.15_cluster.fun.walktrap" "k.20_cluster.fun.walktrap"
##  [5] "k.5_cluster.fun.louvain"   "k.10_cluster.fun.louvain" 
##  [7] "k.15_cluster.fun.louvain"  "k.20_cluster.fun.louvain" 
##  [9] "k.5_cluster.fun.infomap"   "k.10_cluster.fun.infomap" 
## [11] "k.15_cluster.fun.infomap"  "k.20_cluster.fun.infomap"
combinations$parameters
## DataFrame with 12 rows and 2 columns
##                                   k cluster.fun
##                           <integer> <character>
## k.5_cluster.fun.walktrap          5    walktrap
## k.10_cluster.fun.walktrap        10    walktrap
## k.15_cluster.fun.walktrap        15    walktrap
## k.20_cluster.fun.walktrap        20    walktrap
## k.5_cluster.fun.louvain           5     louvain
## ...                             ...         ...
## k.20_cluster.fun.louvain         20     louvain
## k.5_cluster.fun.infomap           5     infomap
## k.10_cluster.fun.infomap         10     infomap
## k.15_cluster.fun.infomap         15     infomap
## k.20_cluster.fun.infomap         20     infomap

We can combine this with some of the metrics defined above to quantify cluster separation as a function of the clustering parameters. This allows us to quickly determine which parameters have a noticeable impact on the results. In principle, we could choose the clustering with the greatest separation for further analysis; however, this tends to be disappointing as it often favors overly broad clusters.

set.seed(10)
nclusters <- 3:25
kcombos <- clusterSweep(mat, BLUSPARAM=KmeansParam(centers=5), centers=nclusters)

sil <- vapply(as.list(kcombos$clusters), function(x) mean(approxSilhouette(mat, x)$width), 0)
plot(nclusters, sil, xlab="Number of clusters", ylab="Average silhouette width")

pur <- vapply(as.list(kcombos$clusters), function(x) mean(neighborPurity(mat, x)$purity), 0)
plot(nclusters, pur, xlab="Number of clusters", ylab="Average purity")

wcss <- vapply(as.list(kcombos$clusters), function(x) sum(clusterRMSD(mat, x, sum=TRUE)), 0)
plot(nclusters, wcss, xlab="Number of clusters", ylab="Within-cluster sum of squares")

Linking clusters

If we have many clusterings, we can identify corresponding clusters with the linkClusters() function. This constructs a graph where edges are formed between pairs of clusters from different clusterings, based on the number of cells assigned to both clusters. Re-using some of the clusterings from our previous sweep, we might do:

linked <- linkClusters(
    list(
        walktrap=combinations$clusters$k.10_cluster.fun.walktrap,
        louvain=combinations$clusters$k.10_cluster.fun.louvain,
        infomap=combinations$clusters$k.10_cluster.fun.infomap
    )
)
linked
## IGRAPH 77309e1 UNW- 46 91 -- 
## + attr: name (v/c), weight (e/n)
## + edges from 77309e1 (vertex names):
##  [1] walktrap.1 --louvain.1  walktrap.7 --louvain.1  walktrap.1 --louvain.2 
##  [4] walktrap.3 --louvain.2  walktrap.5 --louvain.2  walktrap.3 --louvain.3 
##  [7] walktrap.5 --louvain.3  walktrap.7 --louvain.3  walktrap.11--louvain.3 
## [10] walktrap.3 --louvain.4  walktrap.4 --louvain.5  walktrap.5 --louvain.5 
## [13] walktrap.7 --louvain.5  walktrap.6 --louvain.6  walktrap.8 --louvain.6 
## [16] walktrap.10--louvain.7  walktrap.2 --louvain.8  walktrap.6 --louvain.8 
## [19] walktrap.3 --louvain.9  walktrap.5 --louvain.9  walktrap.8 --louvain.10
## [22] walktrap.9 --louvain.10 walktrap.12--louvain.11 walktrap.1 --louvain.12
## + ... omitted several edges

This can be used to visualize the relationships between clusters, or to identify metaclusters across clusterings with community detection algorithms:

meta <- igraph::cluster_walktrap(linked)
plot(linked, mark.groups=meta)

By default, the edge weights are computed by dividing the number of shared cells with the smaller of the total number of cells in either cluster. This favors strong edges between a large cluster in one clustering and smaller subcluster in another (finer) clustering. Alternative weighting schemes will favour a 1:1 mapping between clusterings, which can be easier to interpret.

Comparing multiple clusterings

The compareClusterings() function will return a symmetric matrix of the ARIs between pairs of different clusterings. This is helpful for visualizing the relationships between different clusterings, e.g., to see which parameters most contribute to differences between clusterings.

aris <- compareClusterings(combinations$clusters)
g <- igraph::graph.adjacency(aris, mode="undirected", weighted=TRUE)
meta2 <- igraph::cluster_walktrap(g)
plot(g, mark.groups=meta2)

We can also identify groups of clusterings, typically corresponding to parameter combinations that yield more-or-less similar results. This allows us to prune out combinations that are largely redundant prior to downstream analyses.

Session information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             bluster_1.17.0             
##  [3] scater_1.33.4               ggplot2_3.5.1              
##  [5] scran_1.33.2                scuttle_1.15.5             
##  [7] scRNAseq_2.19.1             SingleCellExperiment_1.27.2
##  [9] SummarizedExperiment_1.35.5 Biobase_2.65.1             
## [11] GenomicRanges_1.57.2        GenomeInfoDb_1.41.2        
## [13] IRanges_2.39.2              S4Vectors_0.43.2           
## [15] BiocGenerics_0.51.3         MatrixGenerics_1.17.1      
## [17] matrixStats_1.4.1           BiocStyle_2.33.1           
## 
## loaded via a namespace (and not attached):
##   [1] BiocIO_1.15.2            bitops_1.0-9             filelock_1.0.3          
##   [4] tibble_3.2.1             XML_3.99-0.17            lifecycle_1.0.4         
##   [7] httr2_1.0.5              edgeR_4.3.21             doParallel_1.0.17       
##  [10] lattice_0.22-6           ensembldb_2.29.1         alabaster.base_1.7.0    
##  [13] magrittr_2.0.3           limma_3.61.12            sass_0.4.9              
##  [16] rmarkdown_2.28           jquerylib_0.1.4          yaml_2.3.10             
##  [19] metapod_1.13.0           DBI_1.2.3                buildtools_1.0.0        
##  [22] RColorBrewer_1.1-3       abind_1.4-8              zlibbioc_1.51.2         
##  [25] AnnotationFilter_1.31.0  RCurl_1.98-1.16          rappdirs_0.3.3          
##  [28] GenomeInfoDbData_1.2.13  ggrepel_0.9.6            irlba_2.3.5.1           
##  [31] kohonen_3.0.12           alabaster.sce_1.7.0      maketools_1.3.1         
##  [34] dqrng_0.4.1              codetools_0.2-20         DelayedArray_0.31.14    
##  [37] tidyselect_1.2.1         UCSC.utils_1.1.0         farver_2.1.2            
##  [40] gmp_0.7-5                ScaledMatrix_1.13.0      viridis_0.6.5           
##  [43] BiocFileCache_2.13.2     dynamicTreeCut_1.63-1    GenomicAlignments_1.41.0
##  [46] jsonlite_1.8.9           BiocNeighbors_1.99.3     iterators_1.0.14        
##  [49] foreach_1.5.2            tools_4.4.1              Rcpp_1.0.13             
##  [52] glue_1.8.0               gridExtra_2.3            SparseArray_1.5.45      
##  [55] xfun_0.48                dplyr_1.1.4              HDF5Array_1.33.8        
##  [58] gypsum_1.1.6             withr_3.0.2              BiocManager_1.30.25     
##  [61] fastmap_1.2.0            rhdf5filters_1.17.0      fansi_1.0.6             
##  [64] digest_0.6.37            rsvd_1.0.5               R6_2.5.1                
##  [67] colorspace_2.1-1         RSQLite_2.3.7            utf8_1.2.4              
##  [70] generics_0.1.3           rtracklayer_1.65.0       FNN_1.1.4.1             
##  [73] httr_1.4.7               S4Arrays_1.5.11          uwot_0.2.2              
##  [76] pkgconfig_2.0.3          gtable_0.3.6             blob_1.2.4              
##  [79] XVector_0.45.0           sys_3.4.3                htmltools_0.5.8.1       
##  [82] ProtGenerics_1.37.1      scales_1.3.0             alabaster.matrix_1.7.0  
##  [85] ClusterR_1.3.3           png_0.1-8                knitr_1.48              
##  [88] rjson_0.2.23             curl_5.2.3               cachem_1.1.0            
##  [91] rhdf5_2.49.0             BiocVersion_3.20.0       parallel_4.4.1          
##  [94] vipor_0.4.7              AnnotationDbi_1.69.0     restfulr_0.0.15         
##  [97] pillar_1.9.0             grid_4.4.1               alabaster.schemas_1.7.0 
## [100] vctrs_0.6.5              BiocSingular_1.21.4      dbplyr_2.5.0            
## [103] beachmat_2.21.9          cluster_2.1.6            beeswarm_0.4.0          
## [106] evaluate_1.0.1           GenomicFeatures_1.57.1   cli_3.6.3               
## [109] locfit_1.5-9.10          compiler_4.4.1           Rsamtools_2.21.2        
## [112] rlang_1.1.4              crayon_1.5.3             mbkmeans_1.21.0         
## [115] labeling_0.4.3           ggbeeswarm_0.7.2         alabaster.se_1.7.0      
## [118] viridisLite_0.4.2        BiocParallel_1.39.0      munsell_0.5.1           
## [121] Biostrings_2.73.2        lazyeval_0.2.2           Matrix_1.7-1            
## [124] ExperimentHub_2.13.1     benchmarkme_1.0.8        bit64_4.5.2             
## [127] Rhdf5lib_1.27.0          KEGGREST_1.45.1          statmod_1.5.0           
## [130] alabaster.ranges_1.7.0   highr_0.11               apcluster_1.4.13        
## [133] AnnotationHub_3.15.0     igraph_2.1.1             memoise_2.0.1           
## [136] bslib_0.8.0              benchmarkmeData_1.0.4    bit_4.5.0