The bluster
package provides a few diagnostics for quantitatively examining the
cluster output. These diagnostics We will demonstrate on another dataset
from the scRNAseq
package, clustered with graph-based methods via the
clusterRows()
generic as described in the previous
vignette.
library(scRNAseq)
sce <- GrunPancreasData()
# Quality control to remove bad cells.
library(scuttle)
qcstats <- perCellQCMetrics(sce)
qcfilter <- quickPerCellQC(qcstats, sub.fields="altexps_ERCC_percent")
sce <- sce[,!qcfilter$discard]
# Normalization by library size.
sce <- logNormCounts(sce)
# Feature selection.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)
# Dimensionality reduction.
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
# Clustering.
library(bluster)
mat <- reducedDim(sce)
clust.info <- clusterRows(mat, NNGraphParam(), full=TRUE)
clusters <- clust.info$clusters
table(clusters)
## clusters
## 1 2 3 4 5 6 7 8 9 10 11 12
## 285 171 161 59 174 49 70 137 69 65 28 23
The silhouette width is a standard metric to quantify the separation between clusters generated by any procedure. A cell with a large positive width is closer to other cells from the same cluster compared to cells from different clusters. On the other hand, low or negative widths indicate that cells from different clusters are not well separated.
The exact silhouette calculation is rather computationally intensive
so bluster
implements an approximation instead. This is provided in the
approxSilhouette()
function, which returns the width for
each cell and its closest (non-self) cluster. Clusters consisting of
cells with lower widths may warrant some more care during
interpretation.
## DataFrame with 1291 rows and 3 columns
## cluster other width
## <factor> <factor> <numeric>
## D2ex_1 1 7 0.375510
## D2ex_2 1 7 0.370596
## D2ex_3 1 7 0.409271
## D2ex_4 5 3 0.257069
## D2ex_5 5 3 0.249883
## ... ... ... ...
## D17TGFB_91 6 2 0.126607
## D17TGFB_92 1 5 0.403106
## D17TGFB_93 1 7 0.384365
## D17TGFB_94 6 2 0.243933
## D17TGFB_95 6 2 0.204860
The function also returns the identity of the closest “other” cluster for each cell. This can be helpful to identify which clusters are easily confused to each other, based on how many of one cluster’s cells are closer to the other cluster.
best.choice <- ifelse(sil$width > 0, clusters, sil$other)
table(Assigned=clusters, Closest=best.choice)
## Closest
## Assigned 1 2 3 4 5 6 7 8 9 10 11 12
## 1 275 0 7 0 3 0 0 0 0 0 0 0
## 2 0 171 0 0 0 0 0 0 0 0 0 0
## 3 0 0 150 0 11 0 0 0 0 0 0 0
## 4 0 0 0 59 0 0 0 0 0 0 0 0
## 5 0 0 21 0 153 0 0 0 0 0 0 0
## 6 0 26 0 0 0 23 0 0 0 0 0 0
## 7 14 0 8 10 2 0 18 0 7 2 8 1
## 8 0 0 0 1 0 0 0 136 0 0 0 0
## 9 0 0 0 0 0 0 0 10 58 0 0 1
## 10 0 0 0 0 0 0 0 0 0 65 0 0
## 11 0 0 0 0 0 0 0 0 0 0 28 0
## 12 0 0 0 0 0 0 0 0 0 0 0 23
Another diagnostic uses the percentage of neighbors for each cell that belong to the same cluster. Well-separated clusters should exhibit high percentages (i.e., “purities”) as cells from different clusters do not mix. Low purities are symptomatic of overclustering where cluster boundaries become more ambiguous.
The neighborPurity()
function computes the purity of the
neighborhood for each cell. Clusters with systematically low purities
may warrant some more care during interpretation. By default, we perform
some weighting so that large clusters do not have large purities simply
because there are few cells assigned to other clusters in the
dataset.
## DataFrame with 1291 rows and 2 columns
## purity maximum
## <numeric> <factor>
## D2ex_1 1.000000 1
## D2ex_2 1.000000 1
## D2ex_3 1.000000 1
## D2ex_4 0.951603 5
## D2ex_5 1.000000 5
## ... ... ...
## D17TGFB_91 0.959277 6
## D17TGFB_92 1.000000 1
## D17TGFB_93 1.000000 1
## D17TGFB_94 0.986538 6
## D17TGFB_95 0.898817 6
The function also returns the identity of the other cluster with the highest percentage. This can again be useful to identify the relationships between clusters based on which pairs have the greatest intermingling in their neighborhoods.
## Max
## Assigned 1 2 3 4 5 6 7 8 9 10 11 12
## 1 279 0 0 0 0 0 6 0 0 0 0 0
## 2 0 170 0 0 0 0 0 0 0 0 0 1
## 3 0 0 158 0 3 0 0 0 0 0 0 0
## 4 0 0 0 59 0 0 0 0 0 0 0 0
## 5 1 0 6 0 165 0 2 0 0 0 0 0
## 6 0 18 0 0 0 31 0 0 0 0 0 0
## 7 0 0 0 1 0 0 67 0 1 0 1 0
## 8 0 0 0 0 0 0 0 136 1 0 0 0
## 9 0 0 0 0 0 0 0 5 64 0 0 0
## 10 0 0 0 0 0 0 0 0 0 65 0 0
## 11 0 0 0 0 0 0 0 0 0 0 28 0
## 12 0 0 0 0 0 0 0 0 0 0 0 23
The root-mean-squared deviation (RMSD) for each cluster represents the dispersion of cells within each cluster. A large RMSD value indicates that a cluster has high internal heterogeneity, making it a good candidate for further subclustering.
Alternatively, we can compute the within-cluster sum of squares (WCSS), a metric commonly seen in k-means clustering. One could pick a “sensible” choice for k by computing the WCSS for a range of values and picking the value the WCSS begins to plateau - see Section @ref(clustering-parameter-sweeps) for more details.
## 1 2 3 4 5 6 7 8
## 60368.794 9537.387 26566.200 19009.475 30701.306 8244.534 31829.182 14682.089
## 9 10 11 12
## 10652.096 6094.771 2644.843 1956.403
For graph-based methods, we can compute the cluster modularity within
clusters and between pairs of clusters. Specifically, we examine the
ratio of observed to expected edge weights for each pair of clusters
(closely related to the modularity score used in many
cluster_*
functions from igraph). We
would usually expect to see high observed weights between cells in the
same cluster with minimal weights between clusters, indicating that the
clusters are well-separated. Large off-diagonal entries indicate that
the corresponding pair of clusters are closely related.
g <- clust.info$objects$graph
ratio <- pairwiseModularity(g, clusters, as.ratio=TRUE)
library(pheatmap)
pheatmap(log10(ratio+1), cluster_cols=FALSE, cluster_rows=FALSE,
col=rev(heat.colors(100)))
This may be better visualized with a force-directed layout:
cluster.gr <- igraph::graph_from_adjacency_matrix(log2(ratio+1),
mode="upper", weighted=TRUE, diag=FALSE)
# Increasing the weight to increase the visibility of the lines.
set.seed(1100101)
plot(cluster.gr, edge.width=igraph::E(cluster.gr)$weight*5,
layout=igraph::layout_with_lgl)
We can also tune the resolution of the clustering post hoc
with the mergeCommunities()
function. This will iteratively
merge the most closely related pair of clusters together until the
desired number of clusters is reached. For example, if we wanted to
whittle down the number of clusters to 10, we could do:
## merged
## 1 2 3 5 6 7 9 10 11 12
## 285 171 161 174 49 129 206 65 28 23
To compare two clusterings, the pairwiseRand()
function
computes the adjusted Rand index (ARI). High ARIs indicate that the two
clusterings are similar with respect to how they partition the
observations, and an ARI of 1 means that the clusterings are
identical.
hclusters <- clusterRows(mat, HclustParam(cut.dynamic=TRUE))
pairwiseRand(clusters, hclusters, mode="index")
## [1] 0.4512108
Of course, a single number is not particularly useful, so
clusterRand()
also provides the capability to break down
the ARI into its contributions from each cluster or cluster pair.
Specifically, for each cluster or cluster pair in a “reference”
clustering (here, clusters
), we see whether it is preserved
in the “alternative” clustering (here, hclusters
). Large
values on the diagonal indicate that the reference cluster is
recapitulated; large values off the diagonal indicate that the
separation between the corresponding pair of clusters is also
maintained. Conversely, low diagonal values indicate that the
corresponding cluster is fragmented in the alternative, and low
off-diagonal values can be used as a diagnostic for loss of
separation.
ratio <- pairwiseRand(clusters, hclusters, mode="ratio")
library(pheatmap)
pheatmap(ratio, cluster_cols=FALSE, cluster_rows=FALSE,
col=viridis::viridis(100), breaks=seq(-1, 1, length=101))
Explicit mappings between two clusterings can be performed using
linkClusters()
(see Section @ref(linking-clusters)).
Alternatively, we can quantify the degree of “nesting” of one clustering
within another with nestedClusters()
; this can be useful
for verifying that the higher-resolution clustering is indeed nested
within its coarser counterpart.
We can use bootstrapping to evaluate the effect of sampling noise on
the stability of a clustering procedure. The
bootstrapStability()
function will return the ARI of the
original clusters against those generated from bootstrap replicates,
averaged across multiple bootstrap iterations. High values indicate that
the clustering is robust to sample noise.
set.seed(1001010)
ari <-bootstrapStability(mat, clusters=clusters,
mode="index", BLUSPARAM=NNGraphParam())
ari
## [1] 0.6824535
Advanced users may also set mode="ratio"
to obtain a
more detailed breakdown of the effect of noise on each cluster
(pair).
The clusterSweep()
function provides a convenient way to
test multiple combinations of parameter settings. Given a
BlusterParam
object and a set of values for each parameter,
the function will repeat the clustering ith each combination of
parameters. The example below uses graph-based clustering with a variety
of k
as well as different community detection
algorithms.
combinations <- clusterSweep(mat, BLUSPARAM=SNNGraphParam(),
k=c(5L, 10L, 15L, 20L), cluster.fun=c("walktrap", "louvain", "infomap"))
This yields a list containing all clusterings and the corresponding parameter combinations used to generate them. The function will attempt to generate some sensible name for each combination, though this may require some manual curation for large numbers of parameters.
## [1] "k.5_cluster.fun.walktrap" "k.10_cluster.fun.walktrap"
## [3] "k.15_cluster.fun.walktrap" "k.20_cluster.fun.walktrap"
## [5] "k.5_cluster.fun.louvain" "k.10_cluster.fun.louvain"
## [7] "k.15_cluster.fun.louvain" "k.20_cluster.fun.louvain"
## [9] "k.5_cluster.fun.infomap" "k.10_cluster.fun.infomap"
## [11] "k.15_cluster.fun.infomap" "k.20_cluster.fun.infomap"
## DataFrame with 12 rows and 2 columns
## k cluster.fun
## <integer> <character>
## k.5_cluster.fun.walktrap 5 walktrap
## k.10_cluster.fun.walktrap 10 walktrap
## k.15_cluster.fun.walktrap 15 walktrap
## k.20_cluster.fun.walktrap 20 walktrap
## k.5_cluster.fun.louvain 5 louvain
## ... ... ...
## k.20_cluster.fun.louvain 20 louvain
## k.5_cluster.fun.infomap 5 infomap
## k.10_cluster.fun.infomap 10 infomap
## k.15_cluster.fun.infomap 15 infomap
## k.20_cluster.fun.infomap 20 infomap
We can combine this with some of the metrics defined above to quantify cluster separation as a function of the clustering parameters. This allows us to quickly determine which parameters have a noticeable impact on the results. In principle, we could choose the clustering with the greatest separation for further analysis; however, this tends to be disappointing as it often favors overly broad clusters.
set.seed(10)
nclusters <- 3:25
kcombos <- clusterSweep(mat, BLUSPARAM=KmeansParam(centers=5), centers=nclusters)
sil <- vapply(as.list(kcombos$clusters), function(x) mean(approxSilhouette(mat, x)$width), 0)
plot(nclusters, sil, xlab="Number of clusters", ylab="Average silhouette width")
If we have many clusterings, we can identify corresponding clusters
with the linkClusters()
function. This constructs a graph
where edges are formed between pairs of clusters from different
clusterings, based on the number of cells assigned to both clusters.
Re-using some of the clusterings from our previous sweep, we might
do:
linked <- linkClusters(
list(
walktrap=combinations$clusters$k.10_cluster.fun.walktrap,
louvain=combinations$clusters$k.10_cluster.fun.louvain,
infomap=combinations$clusters$k.10_cluster.fun.infomap
)
)
linked
## IGRAPH 77309e1 UNW- 46 91 --
## + attr: name (v/c), weight (e/n)
## + edges from 77309e1 (vertex names):
## [1] walktrap.1 --louvain.1 walktrap.7 --louvain.1 walktrap.1 --louvain.2
## [4] walktrap.3 --louvain.2 walktrap.5 --louvain.2 walktrap.3 --louvain.3
## [7] walktrap.5 --louvain.3 walktrap.7 --louvain.3 walktrap.11--louvain.3
## [10] walktrap.3 --louvain.4 walktrap.4 --louvain.5 walktrap.5 --louvain.5
## [13] walktrap.7 --louvain.5 walktrap.6 --louvain.6 walktrap.8 --louvain.6
## [16] walktrap.10--louvain.7 walktrap.2 --louvain.8 walktrap.6 --louvain.8
## [19] walktrap.3 --louvain.9 walktrap.5 --louvain.9 walktrap.8 --louvain.10
## [22] walktrap.9 --louvain.10 walktrap.12--louvain.11 walktrap.1 --louvain.12
## + ... omitted several edges
This can be used to visualize the relationships between clusters, or to identify metaclusters across clusterings with community detection algorithms:
By default, the edge weights are computed by dividing the number of shared cells with the smaller of the total number of cells in either cluster. This favors strong edges between a large cluster in one clustering and smaller subcluster in another (finer) clustering. Alternative weighting schemes will favour a 1:1 mapping between clusterings, which can be easier to interpret.
The compareClusterings()
function will return a
symmetric matrix of the ARIs between pairs of different clusterings.
This is helpful for visualizing the relationships between different
clusterings, e.g., to see which parameters most contribute to
differences between clusterings.
aris <- compareClusterings(combinations$clusters)
g <- igraph::graph.adjacency(aris, mode="undirected", weighted=TRUE)
meta2 <- igraph::cluster_walktrap(g)
plot(g, mark.groups=meta2)
We can also identify groups of clusterings, typically corresponding to parameter combinations that yield more-or-less similar results. This allows us to prune out combinations that are largely redundant prior to downstream analyses.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pheatmap_1.0.12 bluster_1.17.0
## [3] scater_1.33.4 ggplot2_3.5.1
## [5] scran_1.33.2 scuttle_1.15.5
## [7] scRNAseq_2.19.1 SingleCellExperiment_1.27.2
## [9] SummarizedExperiment_1.35.5 Biobase_2.65.1
## [11] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
## [13] IRanges_2.39.2 S4Vectors_0.43.2
## [15] BiocGenerics_0.51.3 MatrixGenerics_1.17.1
## [17] matrixStats_1.4.1 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] BiocIO_1.15.2 bitops_1.0-9 filelock_1.0.3
## [4] tibble_3.2.1 XML_3.99-0.17 lifecycle_1.0.4
## [7] httr2_1.0.5 edgeR_4.3.21 doParallel_1.0.17
## [10] lattice_0.22-6 ensembldb_2.29.1 alabaster.base_1.7.0
## [13] magrittr_2.0.3 limma_3.61.12 sass_0.4.9
## [16] rmarkdown_2.28 jquerylib_0.1.4 yaml_2.3.10
## [19] metapod_1.13.0 DBI_1.2.3 buildtools_1.0.0
## [22] RColorBrewer_1.1-3 abind_1.4-8 zlibbioc_1.51.2
## [25] AnnotationFilter_1.31.0 RCurl_1.98-1.16 rappdirs_0.3.3
## [28] GenomeInfoDbData_1.2.13 ggrepel_0.9.6 irlba_2.3.5.1
## [31] kohonen_3.0.12 alabaster.sce_1.7.0 maketools_1.3.1
## [34] dqrng_0.4.1 codetools_0.2-20 DelayedArray_0.31.14
## [37] tidyselect_1.2.1 UCSC.utils_1.1.0 farver_2.1.2
## [40] gmp_0.7-5 ScaledMatrix_1.13.0 viridis_0.6.5
## [43] BiocFileCache_2.13.2 dynamicTreeCut_1.63-1 GenomicAlignments_1.41.0
## [46] jsonlite_1.8.9 BiocNeighbors_1.99.3 iterators_1.0.14
## [49] foreach_1.5.2 tools_4.4.1 Rcpp_1.0.13
## [52] glue_1.8.0 gridExtra_2.3 SparseArray_1.5.45
## [55] xfun_0.48 dplyr_1.1.4 HDF5Array_1.33.8
## [58] gypsum_1.1.6 withr_3.0.2 BiocManager_1.30.25
## [61] fastmap_1.2.0 rhdf5filters_1.17.0 fansi_1.0.6
## [64] digest_0.6.37 rsvd_1.0.5 R6_2.5.1
## [67] colorspace_2.1-1 RSQLite_2.3.7 utf8_1.2.4
## [70] generics_0.1.3 rtracklayer_1.65.0 FNN_1.1.4.1
## [73] httr_1.4.7 S4Arrays_1.5.11 uwot_0.2.2
## [76] pkgconfig_2.0.3 gtable_0.3.6 blob_1.2.4
## [79] XVector_0.45.0 sys_3.4.3 htmltools_0.5.8.1
## [82] ProtGenerics_1.37.1 scales_1.3.0 alabaster.matrix_1.7.0
## [85] ClusterR_1.3.3 png_0.1-8 knitr_1.48
## [88] rjson_0.2.23 curl_5.2.3 cachem_1.1.0
## [91] rhdf5_2.49.0 BiocVersion_3.20.0 parallel_4.4.1
## [94] vipor_0.4.7 AnnotationDbi_1.69.0 restfulr_0.0.15
## [97] pillar_1.9.0 grid_4.4.1 alabaster.schemas_1.7.0
## [100] vctrs_0.6.5 BiocSingular_1.21.4 dbplyr_2.5.0
## [103] beachmat_2.21.9 cluster_2.1.6 beeswarm_0.4.0
## [106] evaluate_1.0.1 GenomicFeatures_1.57.1 cli_3.6.3
## [109] locfit_1.5-9.10 compiler_4.4.1 Rsamtools_2.21.2
## [112] rlang_1.1.4 crayon_1.5.3 mbkmeans_1.21.0
## [115] labeling_0.4.3 ggbeeswarm_0.7.2 alabaster.se_1.7.0
## [118] viridisLite_0.4.2 BiocParallel_1.39.0 munsell_0.5.1
## [121] Biostrings_2.73.2 lazyeval_0.2.2 Matrix_1.7-1
## [124] ExperimentHub_2.13.1 benchmarkme_1.0.8 bit64_4.5.2
## [127] Rhdf5lib_1.27.0 KEGGREST_1.45.1 statmod_1.5.0
## [130] alabaster.ranges_1.7.0 highr_0.11 apcluster_1.4.13
## [133] AnnotationHub_3.15.0 igraph_2.1.1 memoise_2.0.1
## [136] bslib_0.8.0 benchmarkmeData_1.0.4 bit_4.5.0