Vignette for Dune: merging clusters to improve replicability through ARI merging

Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("Dune")

We use a subset of the Allen Smart-Seq nuclei dataset. Run ?Dune::nuclei for more details on pre-processing.

suppressPackageStartupMessages({
  library(RColorBrewer)
  library(dplyr)
  library(ggplot2)
  library(tidyr)
  library(knitr)
  library(purrr)
  library(Dune)
})
data("nuclei", package = "Dune")
theme_set(theme_classic())

Initial visualization

We have a dataset of 1744 cells, with the results from 3 clustering algorithms: Seurat3, Monocle3 and SC3. The Allen Institute also produce hand-picked cluster and subclass labels. Finally, we included the coordinates from a t-SNE representation, for visualization.

ggplot(nuclei, aes(x = x, y = y, col = subclass_label)) +
  geom_point()

We can also see how the three clustering algorithm partitioned the dataset initially:

walk(c("SC3", "Seurat", "Monocle"), function(clus_algo){
  df <- nuclei
  df$clus_algo <- nuclei[, clus_algo]
  p <- ggplot(df, aes(x = x, y = y, col = as.character(clus_algo))) +
    geom_point(size = 1.5) +
    # guides(color = FALSE) +
    labs(title = clus_algo, col = "clusters") +
    theme(legend.position = "bottom")
  print(p)
})

Merging with Dune

Initial ARI

The adjusted Rand Index between the three methods can be computed.

plotARIs(nuclei %>% select(SC3, Seurat, Monocle))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## â„ą The deprecated feature was likely used in the Dune package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

As we can see, the ARI between the three methods is initially quite low.

Actual merging

We can now try to merge clusters with the Dune function. At each step, the algorithm will print which clustering label is merged (by its number, so 1~SC3 and so on), as well as the pair of clusters that get merged.

merger <- Dune(clusMat = nuclei %>% select(SC3, Seurat, Monocle), verbose = TRUE)
## [1] "SC3" "21"  "20" 
## [1] "Monocle" "20"      "4"      
## [1] "SC3" "11"  "12" 
## [1] "SC3" "30"  "28" 
## [1] "SC3" "11"  "24"

The output from Dune is a list with four components:

names(merger)
## [1] "initialMat" "currentMat" "merges"     "ImpMetric"  "metric"

initialMat is the initial matrix. of cluster labels. currentMat is the final matrix of cluster labels. merges is a matrix that recapitulates what has been printed above, while ImpARI list the ARI improvement over the merges.

ARI improvement

We can now see how much the ARI has improved:

plotARIs(clusMat = merger$currentMat)

The methods now look much more similar, as can be expected.

We can also see how the number of clusters got reduced.

plotPrePost(merger)

For SC3 for example, we can visualize how the clusters got merged:

ConfusionPlot(merger$initialMat[, "SC3"], merger$currentMat[, "SC3"]) +
  labs(x = "Before merging", y = "After merging")

Finally, the ARIImp function tracks mean ARI improvement as pairs of clusters get merged down.

ARItrend(merger)

Session

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Dune_1.19.0        purrr_1.0.2        knitr_1.48         tidyr_1.3.1       
## [5] ggplot2_3.5.1      dplyr_1.1.4        RColorBrewer_1.1-3 rmarkdown_2.28    
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.35.5 gtable_0.3.6               
##  [3] xfun_0.48                   bslib_0.8.0                
##  [5] Biobase_2.67.0              lattice_0.22-6             
##  [7] vctrs_0.6.5                 tools_4.4.1                
##  [9] generics_0.1.3              stats4_4.4.1               
## [11] parallel_4.4.1              tibble_3.2.1               
## [13] fansi_1.0.6                 highr_0.11                 
## [15] pkgconfig_2.0.3             Matrix_1.7-1               
## [17] S4Vectors_0.43.2            lifecycle_1.0.4            
## [19] GenomeInfoDbData_1.2.13     farver_2.1.2               
## [21] compiler_4.4.1              progress_1.2.3             
## [23] munsell_0.5.1               aricode_1.0.3              
## [25] codetools_0.2-20            GenomeInfoDb_1.41.2        
## [27] htmltools_0.5.8.1           sys_3.4.3                  
## [29] buildtools_1.0.0            sass_0.4.9                 
## [31] yaml_2.3.10                 pillar_1.9.0               
## [33] crayon_1.5.3                jquerylib_0.1.4            
## [35] BiocParallel_1.39.0         cachem_1.1.0               
## [37] DelayedArray_0.31.14        abind_1.4-8                
## [39] tidyselect_1.2.1            digest_0.6.37              
## [41] stringi_1.8.4               labeling_0.4.3             
## [43] maketools_1.3.1             fastmap_1.2.0              
## [45] grid_4.4.1                  colorspace_2.1-1           
## [47] cli_3.6.3                   SparseArray_1.5.45         
## [49] magrittr_2.0.3              S4Arrays_1.5.11            
## [51] utf8_1.2.4                  withr_3.0.2                
## [53] prettyunits_1.2.0           scales_1.3.0               
## [55] UCSC.utils_1.1.0            XVector_0.45.0             
## [57] httr_1.4.7                  matrixStats_1.4.1          
## [59] hms_1.1.3                   evaluate_1.0.1             
## [61] GenomicRanges_1.57.2        IRanges_2.39.2             
## [63] viridisLite_0.4.2           gganimate_1.0.9            
## [65] rlang_1.1.4                 Rcpp_1.0.13                
## [67] glue_1.8.0                  tweenr_2.0.3               
## [69] BiocGenerics_0.53.0         jsonlite_1.8.9             
## [71] R6_2.5.1                    MatrixGenerics_1.17.1      
## [73] zlibbioc_1.51.2