ClustAll User’s Guide

Introduction

ClustAll is an R package designed for patient stratification in complex diseases. It addresses common challenges encountered in clinical data analysis and provides a versatile framework for identifying patient subgroups.

Patient stratification is essential in biomedical research for understanding disease heterogeneity, identifying prognostic factors, and guiding personalized treatment strategies. The ClustAll underlying concept is that a robust stratification should be reproducible through various clustering methods. ClustAll employs diverse distance metrics (Correlation-based distance and Gower distance) and clustering methods (K-Means, K-Medoids, and H-Clust).

ClustAll key features:

  • Handles Diverse Data Types, including missing values, mixed data, and correlated variables.
  • Provides Multiple Stratification Solutions, enabling exploration of different clustering algorithms and parameters.
  • Robustness Analysis, to identify stable and reproducible clusters.
  • Validation , for assessing the reliability of clustering results using clinical phenotypes (ground truth) if available.
  • Visualization functions for interpreting clustering results and comparing different stratifications.

Interpreting ClustAll Stratification Output

The names of ClustAll stratification outputs consist of a letter followed by a number, such as cuts_a_9. The letter denotes the combination of distance metric and clustering method utilized to generate the particular stratification, while the number corresponds to the embedding derived from the depth at which the dendrogram with grouped variables was cut.

ClustAll Stratification Output Interpretation
Nomenclature Distance.Metric Clustering.Method
a Correlation K-means
b Correlation Hierarchical Clustering
c Gower K-medoids
d Gower Hierarchical-Clustering
Schematic representation of the ClustAll pipeline
Schematic representation of the ClustAll pipeline

Installation

ClustAll is developed using S4 object-oriented programming, and requires R (>=4.2.0). It utilizes other R packages that are currently available from CRAN and Bioconductor.

You can find the package repository on GitHub, ClustAll.

The ClustAll package can be downloaded and installed by running the following code within R:

if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager ")

BiocManager::install("ClustAll")

After installation, load the ClustAll package:

library(ClustAll)

ClustAll Application Example

We will use the data provided in the data package ClustAll to demonstrate how to stratify a population using clinical data.

Breast Cancer Wisconsin (Diagnostic) ClustAll includes a real dataset of breast cancer, described at doi: 10.24432/C5DW2B. This dataset comprises two types of features —categorical and numerical— derived from a digitized image of a fine needle aspirate (FNA) of a breast mass from 659 patients. Each patient is characterized by 30 features (10x3) and belongs to one of two target classes: ‘malignant’ or ‘benign’. To showcase ClustAll’s when dealing with missing data, a modification with random missing values was applied to the dataset, demonstrating the package’s resilience in handling missing data. The breast cancer dataset includes the following features:

  1. radius: Mean of distances from the center to points on the perimeter.
  2. texture: Standard deviation of gray-scale values.
  3. perimeter: Perimeter of the breast mass affected by the cancer.
  4. area: Area of the breast mass affected by the cancer
  5. smoothness: Local variation in radius lengths.
  6. compactness: (Perimeter^2 / Area) - 1.0.
  7. concavity: Severity of concave portions of the contour.
  8. concave points: Number of concave portions of the contour.
  9. symmetry: Degree of symmetry in the shape and structure of the breast mass, with higher values indicating greater symmetry and lower values indicating asymmetry.
  10. fractal dimension: “Coastline approximation” - 1.

The dataset also includes the patient ID and diagnosis (M = malignant, B = benign).

We denote the data set as BreastCancerWisconsin (wdbc).

Get data from example

We load the breast cancer dataset, which is available in Kaggle. The data set can be loaded with the following code:

data("BreastCancerWisconsin", package = "ClustAll") 

data_use <- subset(wdbc, select=-ID)

An initial exploration reveals the absence of missing values. The dataset comprises 30 numerical features and one categorical feature (the ground truth). As the initial data does not contain missing values we will apply the ClustAll workflow accordingly.

sum(is.na(data_use)) 
#> [1] 0
dim(data_use)
#> [1] 569  31

Create the ClustAll object

First, the ClustAllObject is created and stored. In this step, we indicate if there is a feature that contains the ground truth (true labels) in the argument colValidation. This feature is not consider to compute the stratification. In this specific case, it corresponds to “Diagnosis”.

obj_noNA <- createClustAll(data = data_use, nImputation = NULL, 
                           dataImputed = NULL, colValidation = "Diagnosis")
#> The dataset contains character values.
#> They will be transformed into categorical (more than one class) or binary (one class).
#> Before continuing, check that the transformation has been processed correctly.
#> 
#> ClustALL object was created successfully. You can run runClustAll.

Execute the ClustAll algorithm

Next, we apply the ClustAll algorithm. The output is stored in a ClustAllObject, which contains the clustering results.

obj_noNA1 <- runClustAll(Object = obj_noNA, threads = 2, simplify = FALSE)
#>       ______ __              __   ___     __     __
#>      / ____// /__  __ _____ / /_ /   |   / /    / /
#>     / /    / // / / // ___// __// /| |  / /    / /
#>    / /___ / // /_/ /(__  )/ /_ / ___ | / /___ / /___
#>   /_____//_/ |__,_//____/ |__//_/  |_|/_____//_____/
#> Running Data Complexity Reduction and Stratification Process.
#> This step may take some time...
#> 
#> 
#> Calculating the correlation distance matrix of the stratifications...
#> 
#> 
#> Filtering non-robust stratifications...
#> 
#> ClustAll pipeline finished successfully!

We show the object:

obj_noNA1
#> ClustAllObject
#> Data: Number of variables: 30. Number of patients: 569
#> Imputated: NO.
#> Number of imputations: 0
#> Processed: TRUE
#> Number of stratifications: 67

Represent the Jaccard Distance between population-robust stratifications

To display population-robust stratifications (>85% bootstrapping stability), we call plotJaccard, using the ClustAllObject as input. In addition, we specify the threshold to consider similar a pair of stratifications in the stratification_similarity argument.

In this specific case, a similarity of 0.9 reveals three different groups of alternatives for stratifying the population, indicated by the the red rectangles:


plotJACCARD(Object = obj_noNA1, stratification_similarity = 0.9, paint = TRUE)
Correlation matrix heatmap. It depcits the similarity between population-robust stratifications. The discontinuous red rectangles highlight alternative stratifications solutions based on those stratifications that exhibit certain level of similarity. The heatmap row annotation describes the combination of parameters —distance metric, clustering method, and embedding number— from which each stratification is derived.

Correlation matrix heatmap. It depcits the similarity between population-robust stratifications. The discontinuous red rectangles highlight alternative stratifications solutions based on those stratifications that exhibit certain level of similarity. The heatmap row annotation describes the combination of parameters —distance metric, clustering method, and embedding number— from which each stratification is derived.

Retrieve stratification representatives

We can displayed the centroids (a representative) from each group of alternative stratification solutions (highlighted in red squares in the previous step) using resStratification. Each representative stratification illustrates the number of clusters and the patients belonging to each cluster.

In this case, the alternative stratifications have been computed with the following specifications:

  • cuts_a_28: This stratification was generated using Embedding 28 with the correlation distance metric and the kmeans clustering algorithm. It consists of two clusters, with 183 and 386 patients, respectively.
  • cuts_c_9: This stratification was generated using Embedding 9 with the gower distance metric and the kmedoids clustering algorithm. It consists of two clusters, with 197 and 372 patients, respectively.
  • cuts_c_4: This stratification was generated using Embedding 4 with the Gower distance metric and the kmedoids algorithm. It consists of two clusters, with 199 and 370 patients, respectively.
resStratification(Object = obj_noNA1, population = 0.05, 
                  stratification_similarity = 0.9, all = FALSE)
#> $cuts_a_27
#> $cuts_a_27[[1]]
#> 
#>   1   2 
#> 183 386 
#> 
#> 
#> $cuts_c_9
#> $cuts_c_9[[1]]
#> 
#>   1   2 
#> 197 372 
#> 
#> 
#> $cuts_c_4
#> $cuts_c_4[[1]]
#> 
#>   1   2 
#> 199 370

Generate Sankey diagrams comparing pairs of stratifications, or a stratification with the ground truth

In order to compare two sets of representative stratifications, plotSankey was implemented. The ClustAllObject is used as input. In addition, we specify the pairs of stratifications we want to compare in the clusters argument.

In this case, the first Sankey plot illustrates patient transitions between two sets of representative stratifications (cuts_c_9 and cuts_a_28), revealing the flow and distribution of patients across the clusters. The second Sankey plot illustrates patient transitions between a representative stratifications (cuts_a_28) and the ground truth, revealing the flow and distribution of patients across the clusters.

Flow and distribution of patients across clusters. Patient transitions between representative stratifications cuts_c_3 and cuts_a_9.

Flow and distribution of patients across clusters. Patient transitions between representative stratifications cuts_c_9 and the ground truth.

Retrieve the original dataset with the selected ClustAll stratification(s)

The stratification representatives can be added to the initial dataset to facilitate further exploration.

In this case, we add the three stratification representatives to the dataset. For simplicity, we show the two top rows of the dataset:

df <- cluster2data(Object = obj_noNA1,
                   stratificationName = c("cuts_c_9","cuts_a_28","cuts_c_4"))
head(df,3)
#>   radius1 texture1 perimeter1 area1 smoothness1 compactness1 concavity1
#> 1   17.99    10.38      122.8  1001     0.11840      0.27760     0.3001
#> 2   20.57    17.77      132.9  1326     0.08474      0.07864     0.0869
#> 3   19.69    21.25      130.0  1203     0.10960      0.15990     0.1974
#>   concave_points1 symmetry1 fractal_dimension1 radius2 texture2 perimeter2
#> 1         0.14710    0.2419            0.07871  1.0950   0.9053      8.589
#> 2         0.07017    0.1812            0.05667  0.5435   0.7339      3.398
#> 3         0.12790    0.2069            0.05999  0.7456   0.7869      4.585
#>    area2 smoothness2 compactness2 concavity2 concave_points2 symmetry2
#> 1 153.40    0.006399      0.04904    0.05373         0.01587   0.03003
#> 2  74.08    0.005225      0.01308    0.01860         0.01340   0.01389
#> 3  94.03    0.006150      0.04006    0.03832         0.02058   0.02250
#>   fractal_dimension2 radius3 texture3 perimeter3 area3 smoothness3 compactness3
#> 1           0.006193   25.38    17.33      184.6  2019      0.1622       0.6656
#> 2           0.003532   24.99    23.41      158.8  1956      0.1238       0.1866
#> 3           0.004571   23.57    25.53      152.5  1709      0.1444       0.4245
#>   concavity3 concave_points3 symmetry3 fractal_dimension3 cuts_c_9 cuts_a_28
#> 1     0.7119          0.2654    0.4601            0.11890        1         1
#> 2     0.2416          0.1860    0.2750            0.08902        1         1
#> 3     0.4504          0.2430    0.3613            0.08758        1         1
#>   cuts_c_4
#> 1        1
#> 2        1
#> 3        1

Assess the results the sensitivity and specifity of the selected ClustAll stratifications against ground truth (if available)

To evaluate the performance of the selected ClustAll stratifications against ground truth, sensitivity and specificity can be assessed using validateStratification. Higher values indicate greater precision in the stratification process.

In this particular case, our method retrieves three stratification representatives with a sensitivity and specificity exceeding 80% and 90%, respectively, despite being computed using different methods. These results underscore the notion that a robust stratification should be consistent across diverse clustering methods.

# STRATIFICATION 1
validateStratification(obj_noNA1, "cuts_a_28")
#> sensitivity specificity 
#>   0.8207547   0.9747899
# STRATIFICATION 2
validateStratification(obj_noNA1, "cuts_c_9")
#> sensitivity specificity 
#>   0.8584906   0.9579832
# STRATIFICATION 3
validateStratification(obj_noNA1, "cuts_c_4")
#> sensitivity specificity 
#>   0.8820755   0.9663866

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ClustAll_1.3.0   BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] pbapply_1.7-2         rlang_1.1.4           magrittr_2.0.3       
#>   [4] clue_0.3-66           GetoptLong_1.0.5      matrixStats_1.4.1    
#>   [7] compiler_4.4.2        flexmix_2.3-19        png_0.1-8            
#>  [10] vctrs_0.6.5           rmutil_1.1.10         pkgconfig_2.0.3      
#>  [13] shape_1.4.6.1         crayon_1.5.3          fastmap_1.2.0        
#>  [16] backports_1.5.0       utf8_1.2.4            modeest_2.4.0        
#>  [19] rmarkdown_2.29        ps_1.8.1              nloptr_2.1.1         
#>  [22] bit_4.5.0             purrr_1.0.2           jomo_2.7-6           
#>  [25] glmnet_4.1-8          xfun_0.49             modeltools_0.2-23    
#>  [28] cachem_1.1.0          rmio_0.4.0            jsonlite_1.8.9       
#>  [31] flashClust_1.01-2     pan_1.9               fpc_2.2-13           
#>  [34] broom_1.0.7           parallel_4.4.2        prabclus_2.3-4       
#>  [37] cluster_2.1.6         R6_2.5.1              bslib_0.8.0          
#>  [40] RColorBrewer_1.1-3    rpart_4.1.23          boot_1.3-31          
#>  [43] jquerylib_0.1.4       diptest_0.77-1        estimability_1.5.1   
#>  [46] Rcpp_1.0.13-1         iterators_1.0.14      knitr_1.49           
#>  [49] snow_0.4-4            IRanges_2.41.1        igraph_2.1.1         
#>  [52] splines_4.4.2         Matrix_1.7-1          nnet_7.3-19          
#>  [55] tidyselect_1.2.1      yaml_2.3.10           timeDate_4041.110    
#>  [58] doParallel_1.0.17     codetools_0.2-20      lattice_0.22-6       
#>  [61] tibble_3.2.1          stable_1.1.6          evaluate_1.0.1       
#>  [64] survival_3.7-0        circlize_0.4.16       mclust_6.1.1         
#>  [67] kernlab_0.9-33        pillar_1.9.0          BiocManager_1.30.25  
#>  [70] mice_3.16.0           DT_0.33               foreach_1.5.2        
#>  [73] stats4_4.4.2          bigassertr_0.1.6      generics_0.1.3       
#>  [76] S4Vectors_0.45.2      ggplot2_3.5.1         munsell_0.5.1        
#>  [79] scales_1.3.0          ff_4.5.0              timeSeries_4041.111  
#>  [82] minqa_1.2.8           leaps_3.2             class_7.3-22         
#>  [85] glue_1.8.0            statip_0.2.3          clValid_0.7          
#>  [88] emmeans_1.10.5        scatterplot3d_0.3-44  maketools_1.3.1      
#>  [91] tools_4.4.2           robustbase_0.99-4-1   sys_3.4.3            
#>  [94] spatial_7.3-17        lme4_1.1-35.5         fBasics_4041.97      
#>  [97] buildtools_1.0.0      mvtnorm_1.3-2         cowplot_1.1.3        
#> [100] grid_4.4.2            tidyr_1.3.1           colorspace_2.1-1     
#> [103] networkD3_0.4         nlme_3.1-166          flock_0.7            
#> [106] cli_3.6.3             bigparallelr_0.3.2    fansi_1.0.6          
#> [109] ComplexHeatmap_2.23.0 dplyr_1.1.4           doSNOW_1.0.20        
#> [112] gtable_0.3.6          DEoptimR_1.1-3        stabledist_0.7-2     
#> [115] sass_0.4.9            digest_0.6.37         BiocGenerics_0.53.3  
#> [118] ggrepel_0.9.6         FactoMineR_2.11       rjson_0.2.23         
#> [121] htmlwidgets_1.6.4     htmltools_0.5.8.1     lifecycle_1.0.4      
#> [124] multcompView_0.1-10   mitml_0.4-5           GlobalOptions_0.1.2  
#> [127] bigstatsr_1.6.1       MASS_7.3-61