UMAP is commonly used in scRNA-seq data analysis as a visualization
tool projecting high dimensional data onto 2 dimensions to visualize
cell clustering. However, UMAP is prone to showing spurious clustering
and distorting distances (Chari, Banerjee, and
Pachter 2021). Moreover, UMAP shows clustering that seems to
correspond to graph-based clusters from Louvain and Leiden because the k
nearest neighbor graph is used in both clustering and UMAP. We have
developed concordex
as a quantitative alternative to UMAP
cluster visualization without the misleading problems of UMAP. This
package is the R implementation of the original Python command line
tool.
In a nutshell, concordex
finds the proportion of cells
among the k-nearest neighbors of each cell with the same cluster or
label as the cell itself. This is computed across all labels and the
average of all labels is returned as a metric that indicates the quality
of clustering. If the clustering separates cells well, then the observed
similarity matrix should be diagonal dominant.
In this vignette, we demonstrate the usage of concordex
on a human peripheral blood mononuclear cells (PBMC) scRNA-seq dataset
from 10X Genomics. The data is loaded as a
SingleCellExperiment
object.
sce <- TENxPBMCData("pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
Here we plot the standard QC metrics: total number of UMIs detected
per cell (nCounts
), number of genes detected
(nGenes
), and percentage of UMIs from mitochondrially
encoded genes (pct_mito
).
sce$nCounts <- colSums(counts(sce))
sce$nGenes <- colSums(counts(sce) > 0)
mito_inds <- grepl("^MT-", rowData(sce)$Symbol_TENx)
sce$pct_mito <- colSums(counts(sce)[mito_inds,])/sce$nCounts * 100
p1 <- plotColData(sce, x = "nCounts", y = "nGenes") +
geom_density2d()
p2 <- plotColData(sce, x = "nCounts", y = "pct_mito") +
geom_density2d()
p1 + p2
Remove the outliers and cells with high percentage of mitochondrial counts as the high percentage is not expected biologically from the cell type:
Then normalize the data:
For simplicity, the top 500 highly variable genes are used to perform PCA:
See the number of PCs to use later from the elbow plot:
Percentage of variance explained drops sharply from PC1 to PC5, and
definitely levels off after PC10, so we use the top 10 PCs for
clustering here. The graph based Leiden clustering uses a k nearest
neighbor graph. For demonstration here, we use k = 10
.
set.seed(29)
sce$cluster <- clusterRows(reducedDim(sce, "PCA")[,seq_len(10)],
NNGraphParam(k = 10, cluster.fun = "leiden",
cluster.args = list(
objective_function = "modularity"
)))
See what the clusters look like in PCA space:
plotPCA(sce, color_by = "cluster", ncomponents = 4)
#> Warning in data.frame(gg1$all, df_to_plot[, -reddim_cols]): row names were
#> found from a short variable and have been discarded
Some of the clusters seem well-separated along the first 4 PCs.
Since UMAP is commonly used to visualize the clusters, we plot UMAP
here although we don’t recommend UMAP because it’s prone to showing
spurious clusters and distorting distances. UMAP also uses a k nearest
neighbor graph, and we use the same k = 10
here:
For the most part, the clusters are clearly separated on UMAP.
concordex
Since UMAP is prone to showing spurious clusters, we’ll see what the
concordex
metric says about the clustering and whether it
agrees with UMAP visualization. Here we explicitly obtain the k nearest
neighbor graph, as clustering and UMAP above did not store the graph
itself.
The result here is a list of two n
(number of cell) by
k
matrices. The first is the indices of each cell’s
neighbors, as in an adjacency list that can be matrix here due to the
fixed number of neighbors, and the second is the distances between each
cell and its neighbors. For concordex
, only the first
matrix is relevant. An adjacency matrix, either sparse of dense, as
stored in the Seurat
object, can also be used. Here the
cluster labels are permuted 100 times.
Here the argument compute_similarity
indicates that we
concordex will return the cluster-cluster similarity matrix. The entries
in this matrix itself represent the proportion of cells with each label
in the neighborhood of other cells with the same label.
sim <- attr(res, "similarity")
round(sim, 2)
#> 1 2 3 4 5 6 7 8
#> 1 0.80 0.00 0.00 0.00 0.03 0.00 0.18 0.00
#> 2 0.01 0.98 0.00 0.00 0.00 0.00 0.00 0.00
#> 3 0.00 0.00 0.97 0.00 0.00 0.03 0.00 0.00
#> 4 0.00 0.00 0.00 0.94 0.06 0.00 0.00 0.00
#> 5 0.15 0.00 0.00 0.02 0.76 0.00 0.07 0.00
#> 6 0.00 0.00 0.07 0.00 0.00 0.93 0.00 0.00
#> 7 0.12 0.00 0.00 0.00 0.00 0.00 0.88 0.00
#> 8 0.00 0.00 0.28 0.29 0.00 0.08 0.00 0.33
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] patchwork_1.3.0 scater_1.35.0
#> [3] ggplot2_3.5.1 scuttle_1.17.0
#> [5] bluster_1.17.0 BiocNeighbors_2.1.1
#> [7] TENxPBMCData_1.24.0 HDF5Array_1.35.2
#> [9] rhdf5_2.51.0 DelayedArray_0.33.3
#> [11] SparseArray_1.7.2 S4Arrays_1.7.1
#> [13] abind_1.4-8 Matrix_1.7-1
#> [15] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#> [17] Biobase_2.67.0 GenomicRanges_1.59.1
#> [19] GenomeInfoDb_1.43.2 IRanges_2.41.2
#> [21] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [23] generics_0.1.3 MatrixGenerics_1.19.0
#> [25] matrixStats_1.4.1 concordexR_1.7.0
#> [27] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 gridExtra_2.3 rlang_1.1.4
#> [4] magrittr_2.0.3 compiler_4.4.2 RSQLite_2.3.9
#> [7] png_0.1-8 vctrs_0.6.5 pkgconfig_2.0.3
#> [10] SpatialExperiment_1.17.0 crayon_1.5.3 fastmap_1.2.0
#> [13] dbplyr_2.5.0 magick_2.8.5 XVector_0.47.0
#> [16] labeling_0.4.3 utf8_1.2.4 rmarkdown_2.29
#> [19] ggbeeswarm_0.7.2 UCSC.utils_1.3.0 purrr_1.0.2
#> [22] bit_4.5.0.1 xfun_0.49 zlibbioc_1.52.0
#> [25] cachem_1.1.0 beachmat_2.23.4 jsonlite_1.8.9
#> [28] blob_1.2.4 rhdf5filters_1.19.0 Rhdf5lib_1.29.0
#> [31] BiocParallel_1.41.0 irlba_2.3.5.1 parallel_4.4.2
#> [34] cluster_2.1.6 R6_2.5.1 bslib_0.8.0
#> [37] jquerylib_0.1.4 Rcpp_1.0.13-1 knitr_1.49
#> [40] FNN_1.1.4.1 igraph_2.1.1 tidyselect_1.2.1
#> [43] viridis_0.6.5 yaml_2.3.10 codetools_0.2-20
#> [46] curl_6.0.1 lattice_0.22-6 tibble_3.2.1
#> [49] withr_3.0.2 KEGGREST_1.47.0 evaluate_1.0.1
#> [52] isoband_0.2.7 BiocFileCache_2.15.0 ExperimentHub_2.15.0
#> [55] Biostrings_2.75.2 pillar_1.9.0 BiocManager_1.30.25
#> [58] filelock_1.0.3 BiocVersion_3.21.1 sparseMatrixStats_1.19.0
#> [61] munsell_0.5.1 scales_1.3.0 glue_1.8.0
#> [64] maketools_1.3.1 tools_4.4.2 AnnotationHub_3.15.0
#> [67] sys_3.4.3 ScaledMatrix_1.15.0 buildtools_1.0.0
#> [70] grid_4.4.2 AnnotationDbi_1.69.0 colorspace_2.1-1
#> [73] GenomeInfoDbData_1.2.13 beeswarm_0.4.0 BiocSingular_1.23.0
#> [76] vipor_0.4.7 rsvd_1.0.5 cli_3.6.3
#> [79] rappdirs_0.3.3 fansi_1.0.6 viridisLite_0.4.2
#> [82] dplyr_1.1.4 uwot_0.2.2 gtable_0.3.6
#> [85] sass_0.4.9 digest_0.6.37 ggrepel_0.9.6
#> [88] farver_2.1.2 rjson_0.2.23 memoise_2.0.1
#> [91] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
#> [94] mime_0.12 MASS_7.3-61 bit64_4.5.2