Clustering is a method to identify common pattern in highly dimensional data. This can be for example genes or proteins with similar quantitative changes, thus providing insights into the affected biological pathways.
Despite of numerous clustering algorithms, they do not account for feature variance, i.e. the uncertainty in the measurements across the different experimental conditions. VSClust determines the characteristic patterns in high-dimensional data while accounting for feature variance that is given through replicated measurements.
Here, we present an example script to run the full clustering
analysis using the vsclust
library. The same can be done by
running the Shiny app (e.g. via its docker image or on ), or the
corresponding command line script. For the source code, see .
Use the common Bioconductor commands for installation:
# Uncomment in case you have not installed vsclust yet
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("vsclust")
library(vsclust)
The full functionality of this vignette can be obtained by
additionally installing and loading the packages
matrixStats
and clusterProfiler
Here, we define the different parameters for the example data set
protein_expressions
. In the command-line version of VSClust
(“runVSClust.R”), they can be given via yaml file.
Comments:
A. Data sets with different numbers of replicates per condition need to be adapted to contain the same number of columns per condition. These can be done by either removing excess replicates or adding empty columns.
B. We assume the input data to be of the following format: A1, B1, C1, …, A2, B2, C2, …, where letters denote sample type and numbers are the different replicates.
C. If you prefer to estimate feature variance different, use averages
and add an estimate for the standard deviation as last column. You will
need to set the last option of PreparedForVSClust
to
FALSE
.
D. If you don’t have replicates, use the same format as in C. and set the standard deviations to 1.
library(vsclust)
#### Input parameters, only read when now parameter file was provided
## All principal parameters for running VSClust can be defined as in the
## shinyapp at computproteomics.bmb.sdu.dk/Apps/VSClust
# name of study
Experiment <- "ProtExample"
# Number of replicates/sample per different experimental condition (sample
# type)
NumReps <- 3
# Number of different experimental conditions (e.g. time points or sample
# types)
NumCond <- 4
# Paired or unpaired statistical tests when carrying out LIMMA for
# statistical testing
isPaired <- FALSE
# Number of threads to accelerate the calculation (use 1 in doubt)
cores <- 1
# If 0 (default), then automatically estimate the cluster number for the
# vsclust
# run from the Minimum Centroid Distance
PreSetNumClustVSClust <- 0
# If 0 (default), then automatically estimate the cluster number for the
# original fuzzy c-means from the Minimum Centroid Distance
PreSetNumClustStand <- 0
# max. number of clusters when estimating the number of clusters. Higher
# numbers can drastically extend the computation time.
maxClust <- 10
At first, we load the example proteomics data set and carry out
statistical testing of all conditions version the first based on the
LIMMA moderated t-test. The data consists of mice fed with four
different diets (high fat, TTA, fish oil and TTA+fish oil). Understand more about the data
set with ?protein_expressions
This will calculate the false discovery rates for the differentially regulated features (pairwise comparisons versus the first “high fat” condition) and most importantly, their expected individual variances, to be used in the variance-sensitive clustering. These variances can also be uploaded separately via a last column containing them as individual standard deviations.
The PrepareForVSClust
function also creates a PCA plot
to assess variability and control whether the samples have been loaded
correctly (replicated samples should form groups).
After estimating the standard deviations, the matrix consists of the averaged quantitative feature values and a last column for the standard deviations of the features.
data(protein_expressions)
dat <- protein_expressions
#### running statistical analysis and estimation of individual variances
statOut <- PrepareForVSClust(dat, NumReps, NumCond, isPaired, TRUE)
dat <- statOut$dat
Sds <- dat[,ncol(dat)]
cat(paste("Features:",nrow(dat),"\nMissing values:",
sum(is.na(dat)),"\nMedian standard deviations:",
round(median(Sds,na.rm=TRUE),digits=3)))
## Features: 574
## Missing values: 0
## Median standard deviations: 0.221
There is no simple way to find the optimal number of clusters in a data set. For obtaining this number, we run the clustering for different cluster numbers and evaluate them via so-called validity indices, which provide information about suitable cluster numbers. VSClust uses mainly the “Maximum centroid distances” that denotes the shortest distance between any of the centroids. Alternatively, one can inspect the Xie Beni index.
The output of estimClustNum
contains the suggestion for
the number of clusters.
We further visualize the outcome.
#### Estimate number of clusters with maxClust as maximum number clusters
#### to run the estimation with
ClustInd <- estimClustNum(dat, maxClust=maxClust, scaling="standardize", cores=cores)
## Running cluster number 3
## Running cluster number 4
## Running cluster number 5
## Running cluster number 6
## Running cluster number 7
## Running cluster number 8
## Running cluster number 9
## Running cluster number 10
Now we run the clustering again with the optimal parameters from the estimation. One can take alternative numbers of clusters corresponding to large decays in the Minimum Centroid Distance or low values of the Xie Beni index.
First, we carry out the variance-sensitive method
#### Run clustering (VSClust and standard fcm clustering
ClustOut <- runClustWrapper(dat,
PreSetNumClustVSClust,
NULL,
VSClust=TRUE,
scaling="standardize",
cores=cores)
Bestcl <- ClustOut$Bestcl
VSClust_cl <- Bestcl
#ClustOut$p
## Write clustering results (VSClust)
write.csv(data.frame(cluster=Bestcl$cluster,
ClustOut$outFileClust,
isClusterMember=rowMaxs(Bestcl$membership)>0.5,
maxMembership=rowMaxs(Bestcl$membership),
Bestcl$membership),
paste(Experiment,
"FCMVarMResults",
Sys.Date(),
".csv",
sep=""))
## Write coordinates of cluster centroids
write.csv(Bestcl$centers,
paste(Experiment,
"FCMVarMResultsCentroids",
Sys.Date(),
".csv",
sep=""))
We see that most of the difference are between TTA diets and the rest. This shows that the TTA fatty acids have strong impact on the organisms. Cluster three shows the proteins that a commonly lower abundant in mice fed with fish oil and thus are related to biological processes affected this particular diet.
For comparison, this is the clustering using standard fuzzy c-means of the means over the replicates.
ClustOut <- runClustWrapper(dat, PreSetNumClustStand, NULL, VSClust=FALSE,
scaling="standardize", cores=cores)
Bestcl2 <- ClustOut$Bestcl
## Write clustering results (standard fcm)
write.csv(data.frame(cluster=Bestcl2$cluster,
ClustOut$outFileClust,
isClusterMember=rowMaxs(Bestcl2$membership)>0.5,
maxMembership=rowMaxs(Bestcl2$membership),
Bestcl2$membership),
paste(Experiment,
"FCMResults",
Sys.Date(),
".csv",
sep=""))
## Write coordinates of cluster centroids
write.csv(Bestcl2$centers, paste(Experiment,
"FCMResultsCentroids",
Sys.Date(),
".csv",
sep=""))
Here, the clusters look rather similar. VSClust best performs for larger numbers of different experimental conditions (one finds major improvements for D > 6). For a 4-dimensional data set, the algorithm mostly filters out features with very high variance levels, making them unsuitable for belonging to a particular cluster.
This analysis is then followed by evaluating the features (here
proteins) of each cluster for their biological relevance. This can be
done by functional analysis with e.g. the clusterProfiler
package.
You can use the runFuncEnrich
function to visualize the
results. It retrieves the functional enrichment from STRING. The infosource
argument can be anny category of of the ones given by the category
IDs described in the table of the STRING
API. The given feature IDs are converted in STRING without
specifying the species, which can lead to inconsistencies when using
generic gene names.
The output object is a compareCluster
object from
clusterProfiler
that can be further analyzed or visualized
as done in this vignette. Note that STRING does not provide a
geneRatio
. We use the gene Count
instead.
## got data from STRINGdb
## Reducing number of STRINGdb results
# Take the reduce version of the enrichment (redFuncs), not the full one
# (fullFuncs)
redClustEnriched <- ClustEnriched$redFuncs
# Load the clusterProfiler package
library(clusterProfiler)
# Plot the top 10 most enriched KEGG pathways
dotplot(redClustEnriched, showCategory=10, title="KEGG enrichment", size = "count")
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] clusterProfiler_4.15.1 MultiAssayExperiment_1.33.4
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [7] IRanges_2.41.2 S4Vectors_0.45.2
## [9] BiocGenerics_0.53.3 generics_0.1.3
## [11] MatrixGenerics_1.19.1 matrixStats_1.5.0
## [13] vsclust_1.9.10 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 gson_0.1.0 rlang_1.1.5
## [4] magrittr_2.0.3 DOSE_4.1.0 compiler_4.4.2
## [7] RSQLite_2.3.9 png_0.1-8 vctrs_0.6.5
## [10] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [13] crayon_1.5.3 fastmap_1.2.0 XVector_0.47.2
## [16] labeling_0.4.3 promises_1.3.2 rmarkdown_2.29
## [19] enrichplot_1.27.4 UCSC.utils_1.3.1 purrr_1.0.2
## [22] bit_4.5.0.1 xfun_0.50 cachem_1.1.0
## [25] aplot_0.2.4 jsonlite_1.8.9 blob_1.2.4
## [28] later_1.4.1 DelayedArray_0.33.4 BiocParallel_1.41.0
## [31] parallel_4.4.2 R6_2.5.1 RColorBrewer_1.1-3
## [34] bslib_0.8.0 stringi_1.8.4 limma_3.63.3
## [37] jquerylib_0.1.4 GOSemSim_2.33.0 Rcpp_1.0.14
## [40] knitr_1.49 ggtangle_0.0.6 R.utils_2.12.3
## [43] BiocBaseUtils_1.9.0 igraph_2.1.3 httpuv_1.6.15
## [46] Matrix_1.7-1 splines_4.4.2 tidyselect_1.2.1
## [49] qvalue_2.39.0 abind_1.4-8 yaml_2.3.10
## [52] codetools_0.2-20 curl_6.1.0 lattice_0.22-6
## [55] tibble_3.2.1 plyr_1.8.9 withr_3.0.2
## [58] treeio_1.31.0 shiny_1.10.0 KEGGREST_1.47.0
## [61] evaluate_1.0.3 gridGraphics_0.5-1 Biostrings_2.75.3
## [64] ggtree_3.15.0 pillar_1.10.1 BiocManager_1.30.25
## [67] ggfun_0.1.8 ggplot2_3.5.1 tidytree_0.4.6
## [70] munsell_0.5.1 scales_1.3.0 xtable_1.8-4
## [73] glue_1.8.0 lazyeval_0.2.2 maketools_1.3.1
## [76] tools_4.4.2 sys_3.4.3 data.table_1.16.4
## [79] fgsea_1.33.2 buildtools_1.0.0 fs_1.6.5
## [82] fastmatch_1.1-6 cowplot_1.1.3 grid_4.4.2
## [85] tidyr_1.3.1 ape_5.8-1 AnnotationDbi_1.69.0
## [88] colorspace_2.1-1 nlme_3.1-166 patchwork_1.3.0
## [91] GenomeInfoDbData_1.2.13 cli_3.6.3 S4Arrays_1.7.1
## [94] dplyr_1.1.4 gtable_0.3.6 R.methodsS3_1.8.2
## [97] yulab.utils_0.1.9 sass_0.4.9 digest_0.6.37
## [100] ggrepel_0.9.6 ggplotify_0.1.2 SparseArray_1.7.4
## [103] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
## [106] R.oo_1.27.0 lifecycle_1.0.4 httr_1.4.7
## [109] GO.db_3.20.0 statmod_1.5.0 mime_0.12
## [112] bit64_4.6.0-1