A correlation based multiview self organizing map for the
characterization of cell types (FuseSOM
) is a tool for
unsupervised clustering. FuseSOM
is robust and achieves
high accuracy by combining a Self Organizing Map
architecture and a Multiview
integration of correlation
based metrics to cluster highly multiplexed in situ imaging cytometry
assays. The FuseSOM
pipeline has been streamlined and
accepts currently used data structures including
SingleCellExperiment
and SpatialExperiment
objects as well as DataFrames
.
This is purely a tool generated for clustering and as such it does
not provide any means for QC and feature selection. It is advisable that
the user first use other tools for quality control and feature selection
before running FuseSOM
.
FuseSOM
Matrix InputIf you have a matrix containing expression data that was QCed and
normalised by some other tool, the next step is to run the
FuseSOM
algorithm.This can be done by calling the
runFuseSOM()
function which takes in the matrix of interest
where the columns are markers and the rows are observations, the makers
of interest (if this is not provided, it is assumed that all columns are
markers), and the number of clusters.
Next we will load in the Risom et al
dataset and run it through the FuseSOM pipeline. This dataset profiles
the spatial landscape of ductal carcinoma in situ (DCIS), which is a
pre-invasive lesion that is thought to be a precursor to invasive breast
cancer (IBC). The key conclusion of this manuscript (amongst others) is
that spatial information about cells can be used to predict disease
progression in patients.We will also be using the markers used in the
original study.
# load in the data
data("risom_dat")
# define the markers of interest
risomMarkers <- c('CD45','SMA','CK7','CK5','VIM','CD31','PanKRT','ECAD',
'Tryptase','MPO','CD20','CD3','CD8','CD4','CD14','CD68','FAP',
'CD36','CD11c','HLADRDPDQ','P63','CD44')
# we will be using the manual_gating_phenotype as the true cell type to gauge
# performance
names(risom_dat)[names(risom_dat) == 'manual_gating_phenotype'] <- 'CellType'
Now that we have loaded the data and define the markers of interest.
We can run the FuseSOM
algorithm. We have provided a
function runFuseSOM
that runs the pipeline from top to
bottom and returns the cluster labels as well as the
Self Organizing Map
model.
## You have provided a dataset of class data.frame
## Everything looks good. Now running the FuseSOM algorithm
## Now Generating the Self Organizing Map Grid
## Optimal Grid Size is: 8
## Now Running the Self Organizing Map Model
## Now Clustering the Prototypes
## Loading required namespace: fastcluster
## Now Mapping Clusters to the Original Data
## The Prototypes have been Clustered and Mapped Successfully
## The FuseSOM algorithm has completed successfully
Lets look at the distribution of the clusters.
##
## cluster_1 cluster_10 cluster_11 cluster_12 cluster_13 cluster_14
## 0.323602021 0.035968538 0.005439775 0.021443334 0.061100586 0.026596050
## cluster_15 cluster_16 cluster_17 cluster_18 cluster_19 cluster_2
## 0.020582156 0.032624297 0.024931106 0.076128143 0.015802618 0.014927087
## cluster_20 cluster_21 cluster_22 cluster_23 cluster_3 cluster_4
## 0.049962682 0.009185900 0.051771156 0.066913538 0.004923068 0.014108968
## cluster_5 cluster_6 cluster_7 cluster_8 cluster_9
## 0.040776783 0.064444827 0.020854863 0.010032725 0.007879780
Looks like cluster_1
has about 32% of the cells which is interesting. Next,
lets generate a heatmap of the marker expression for each cluster.
FuseSOM
to estimate the number of clustersFuseSOM
also provides functionality for estimating the
number of clusters in a dataset using three classes of methods
including:
We can estimate the number of clusters using the
estimateNumCluster
. Run
help(estimateNumCluster)
to see it’s complete
functionality.
# lets estimate the number of clusters using all the methods
# original clustering has 23 clusters so we will set kseq from 2:25
# we pass it the som model generated in the previous step
risomKest <- estimateNumCluster(data = risomRes$model, kSeq = 2:25,
method = c("Discriminant", "Distance"))
## Now Computing the Number of Clusters using Discriminant Analysis
## Now Computing The Number Of Clusters Using Distance Analysis
We can then use this result to determine the best number of clusters
for this dataset based on the different metrics. The
FuseSOM
package provides a plotting function
(optiPlot
) which generates an elbow plot with the optimal
value for the number of clusters for the distance based methods. See
below
# what is the best number of clusters determined by the discriminant method?
# optimal number of clusters according to the discriminant method is 7
risomKest$Discriminant
## [1] 7
# we can plot the results using the optiplot function
pSlope <- optiPlot(risomKest, method = 'slope')
pSlope
From the plots, we see that the Jump
statistics almost
perfectly capture the number of clusters. The Gap
method is
a close second with 15 clusters. All
the other methods significantly underestimates the number of
clusters.
FuseSOM
Sinlge Cell Epxeriment object as input.The FuseSOM
algorithm is also equipped to take in a
SingleCellExperiment
object as input. The results of the
pipeline will be written to either the metada or the colData fields. See
below.
First we create a SingleCellExperiment
object
library(SingleCellExperiment)
# create a singelcellexperiment object
colDat <- risom_dat[, setdiff(colnames(risom_dat), risomMarkers)]
sce <- SingleCellExperiment(assays = list(counts = t(risom_dat)),
colData = colDat)
sce
## class: SingleCellExperiment
## dim: 23 69672
## metadata(0):
## assays(1): counts
## rownames(23): CD45 SMA ... CD44 CellType
## rowData names(0):
## colnames: NULL
## colData names(1): X
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Next we pass it to the runFuseSOM()
function. Here, we
can provide the assay in which the data is stored and what name to store
the clusters under in the colData section. Note that the
Self Organizing Map
that is generated will be stored in the
metadata field.
risomRessce <- runFuseSOM(sce, markers = risomMarkers, assay = 'counts',
numClusters = 23, verbose = FALSE)
## You have provided a dataset of class SingleCellExperiment
## Everything looks good. Now running the FuseSOM algorithm
## Now Generating the Self Organizing Map Grid
## Optimal Grid Size is: 8
## Now Running the Self Organizing Map Model
## Now Clustering the Prototypes
## Now Mapping Clusters to the Original Data
## The Prototypes have been Clustered and Mapped Successfully
## The FuseSOM algorithm has completed successfully
## [1] "X" "clusters"
## [1] "SOM"
Notice how the there is now a clusters column in the colData and SOM field in the metadata. You can run this function again with a new set of cluster number. If you provide a new name for the clusters, it will be stored under that new column, else, it will overwrite the current clusters column. Running it again on the same object will overwrite the SOM field in the metadata.
Just like before, lets plot the heatmap of the resulting clusters across all markers.
FuseSOM
to estimate the number of clusters for
single cell experiment objectsJust like before, we can estimate the number of clusters
# lets estimate the number of clusters using all the methods
# original clustering has 23 clusters so we will set kseq from 2:25
# now we pass it a singlecellexperiment object instead of the som model as before
# this will return a singelcellexperiment object where the metatdata contains the
# cluster estimation information
risomRessce <- estimateNumCluster(data = risomRessce, kSeq = 2:25,
method = c("Discriminant", "Distance"))
## You have provided a dataset of class: SingleCellExperiment
## Now Computing the Number of Clusters using Discriminant Analysis
## Now Computing The Number Of Clusters Using Distance Analysis
## [1] "SOM" "clusterEstimation"
Notice how the metadata now contains a clusterEstimation
field which holds the results from the estimateNumCluster()
function
We can assess the results in a similar fashion as before
# what is the best number of clusters determined by the discriminant method?
# optimal number of clusters according to the discriminant method is 8
metadata(risomRessce)$clusterEstimation$Discriminant
## [1] 10
# we can plot the results using the optiplot function
pSlope <- optiPlot(risomRessce, method = 'slope')
## You have provided a dataset of class: SingleCellExperiment
## You have provided a dataset of class: SingleCellExperiment
## You have provided a dataset of class: SingleCellExperiment
## You have provided a dataset of class: SingleCellExperiment
## You have provided a dataset of class: SingleCellExperiment
Again, we see that the Jump
statistics almost perfectly
capture the number of clusters. The Gap
method is a close
second with 15 clusters. All the other
methods significantly underestimates the number of clusters.
FuseSOM
Spatial Epxeriment object as input.The methodology for Spatial Epxeriment
is exactly the
same as that of Single Cell Epxeriment
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.27.2 SummarizedExperiment_1.35.5
## [3] Biobase_2.67.0 GenomicRanges_1.57.2
## [5] GenomeInfoDb_1.41.2 IRanges_2.39.2
## [7] S4Vectors_0.43.2 BiocGenerics_0.53.0
## [9] MatrixGenerics_1.17.1 matrixStats_1.4.1
## [11] FuseSOM_1.9.0 knitr_1.48
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] mnormt_2.1.1 permute_0.9-7 rlang_1.1.4
## [4] magrittr_2.0.3 compiler_4.4.1 flexmix_2.3-19
## [7] mgcv_1.9-1 analogue_0.17-7 vctrs_0.6.5
## [10] stringr_1.5.1 pkgconfig_2.0.3 crayon_1.5.3
## [13] fastmap_1.2.0 backports_1.5.0 XVector_0.45.0
## [16] labeling_0.4.3 utf8_1.2.4 rmarkdown_2.28
## [19] UCSC.utils_1.1.0 purrr_1.0.2 coop_0.6-3
## [22] modeltools_0.2-23 xfun_0.48 zlibbioc_1.51.2
## [25] cachem_1.1.0 jsonlite_1.8.9 highr_0.11
## [28] DelayedArray_0.31.14 fpc_2.2-13 psych_2.4.6.26
## [31] prabclus_2.3-4 broom_1.0.7 parallel_4.4.1
## [34] cluster_2.1.6 R6_2.5.1 profileModel_0.6.1
## [37] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
## [40] car_3.1-3 diptest_0.77-1 jquerylib_0.1.4
## [43] Rcpp_1.0.13 nnet_7.3-19 Matrix_1.7-1
## [46] splines_4.4.1 tidyselect_1.2.1 abind_1.4-8
## [49] yaml_2.3.10 vegan_2.6-8 brglm_0.7.2
## [52] lattice_0.22-6 tibble_3.2.1 withr_3.0.2
## [55] evaluate_1.0.1 gridGraphics_0.5-1 proxy_0.4-27
## [58] kernlab_0.9-33 mclust_6.1.1 pillar_1.9.0
## [61] BiocManager_1.30.25 ggpubr_0.6.0 carData_3.0-5
## [64] generics_0.1.3 sp_2.1-4 ggplot2_3.5.1
## [67] munsell_0.5.1 scales_1.3.0 princurve_2.1.6
## [70] class_7.3-22 glue_1.8.0 pheatmap_1.0.12
## [73] maketools_1.3.1 tools_4.4.1 robustbase_0.99-4-1
## [76] sys_3.4.3 ggsignif_0.6.4 fastcluster_1.2.6
## [79] buildtools_1.0.0 fs_1.6.4 grid_4.4.1
## [82] tidyr_1.3.1 colorspace_2.1-1 nlme_3.1-166
## [85] GenomeInfoDbData_1.2.13 Formula_1.2-5 cli_3.6.3
## [88] DataVisualizations_1.3.2 FCPS_1.3.4 fansi_1.0.6
## [91] S4Arrays_1.5.11 dplyr_1.1.4 DEoptimR_1.1-3
## [94] gtable_0.3.6 rstatix_0.7.2 yulab.utils_0.1.7
## [97] sass_0.4.9 digest_0.6.37 SparseArray_1.5.45
## [100] ggplotify_0.1.2 farver_2.1.2 htmltools_0.5.8.1
## [103] lifecycle_1.0.4 httr_1.4.7 MASS_7.3-61