This guide provides an overview of the PIUMA 1 package, a comprehensive R package for performing Topological Data Analysis on high-dimensional datasets, such as -omics data.
Phenotyping is a process of characterizing and classifying individuals based on observable traits or phenotypic characteristics. In the context of medicine and biology, phenotyping involves the systematic analysis and measurement of various physical, physiological, and behavioral features of individuals, such as height, weight, blood pressure, biochemical markers, imaging data, and more. Phenotyping plays a crucial role in precision medicine as it provides essential data for understanding individual health characteristics and disease manifestations, by combining data from different sources to gain comprehensive insights into an individual’s health status and disease risk. This integrated approach allows for more accurate disease diagnosis, prognosis, and treatment selection. The same considerations could be also be extended in omics research, in which the expression values of thousands of genes and proteins, or the incidence of somatic and germline polymorphic variants are usually assessed to link molecular activities with the onset or the progression of diseases. In this field, phenotyping is needed to identify patterns and associations between phenotypic traits and a huge amount of available features. These analyses can uncover novel disease subtypes, identify predictive markers, and facilitate the development of personalized treatment strategies. In this context, the application of unsupervised learning methodologies could help the identification of specific phenotypes in huge heterogeneous cohorts, such as clinical or -omics data. Among them, the Topological Data Analysis (TDA) is a rapidly growing field that combines concepts from algebraic topology and computational geometry to analyze and extract meaningful information from complex and high-dimensional data sets (Carlsson 2009). Moreover, TDA is a robust and effective methodology that preserves the intrinsic characteristics of data and the mutual relationships among observations, by presenting complex data in a graph-based representation. Indeed, building topological models as networks, TDA allows complex diseases to be inspected in a continuous space, where subjects can ‘fluctuate’ over the graph, sharing, at the same time, more than one adjacent node of the network (Dagliati et al. 2020). Overall, TDA offers a powerful set of tools to capture the underlying topological features of data, revealing essential patterns and relationships that might be hidden from traditional statistical techniques (Casaclang-Verzosa et al. 2019).
PIUMA can be installed by:
We tested PIUMA on a subset of the single-cell RNA Sequencing dataset (GSE:GSE193346 generated and published by Feng et al. (2022) to demonstrate that distinct transcriptional profiles are present in specific cell types of each heart chambers, which were attributed to have roles in cardiac development (Feng et al. 2022). In this tutorial, our aim will be to exploit PIUMA for identifying sub-population of vascular endothelial cells, which can be associated with specific heart developmental stages. The original dataset consisted of three layers of heterogeneity: cell type, stage and zone (i.e., heart chamber). Our test dataset was obtained by subsetting vascular endothelial cells (cell type) by Seurat object, extracting raw counts and metadata. Thus, we filtered low expressed genes and normalized data by DaMiRseq :
#############################################
############# NOT TO EXECUTE ################
########## please skip this chunk ###########
#############################################
dataset_seu <- readRDS("./GSE193346_CD1_seurat_object.rds")
# subset vascular endothelial cells
vascularEC_seuobj <- subset(x = dataset_seu,
subset = markFinal == "vascular_ec")
df_data_counts <- vascularEC_seuobj@assays$RNA@counts
df_cl <- as.data.frame(df_data_counts)
meta_cl <- vascularEC_seuobj@meta.data[, c(10,13,14,15)]
meta_cl[sapply(meta_cl, is.character)] <- lapply(meta_cl[sapply(meta_cl,
is.character)],
as.factor)
## Filtering and normalization
colnames(meta_cl)[4] <- "class"
SE <- DaMiR.makeSE(df_cl, meta_cl)
data_norm <- DaMiR.normalization(SE,
type = "vst",
minCounts = 3,
fSample = 0.4,
hyper = "no")
vascEC_norm <- round(t(assay(data_norm)), 2)
vascEC_meta <- meta_cl[, c(3,4), drop=FALSE]
df_TDA <- cbind(vascEC_meta, vascEC_norm)
At the end, the dataset was composed of 1180 cells (observations) and 838 expressed genes (features). Moreover, 2 additional features are present in the metadata: ‘stage’ and ‘zone’. The first one describes the stage of heart development, while the second one refers to the heart chamber.
Users can directly import the testing dataset by:
library(PIUMA)
#> Loading required package: ggplot2
library(ggplot2)
data(vascEC_norm)
data(vascEC_meta)
df_TDA <- cbind(vascEC_meta, vascEC_norm)
dim(df_TDA)
#> [1] 1180 840
head(df_TDA[1:5, 1:7])
#> stage zone Rpl7 Dst Cox5b Eif5b Rpl31
#> AACCAACGTGGTACAG-a5k1 E15.5 RV 5.08 2.95 2.51 2.20 3.42
#> AACCATGAGGAAGTCC-a5k1 P3 LV 4.84 2.40 3.40 1.68 3.40
#> AACGAAAAGACCATAA-a5k1 P0 RV 4.42 2.43 2.61 2.33 2.61
#> AAGCGAGGTAGGAGGG-a5k1 P0 LV 3.74 2.52 2.52 2.01 2.52
#> AAGCGTTGTCTCGGAC-a5k1 E16.5 LV 4.99 3.31 2.33 2.33 3.62
The PIUMA package comes with a dedicated data structure to easily
store the information gathered from all the steps performed by a
Topological Data Analysis. As shown in the following cartoon, this
object, called TDAobj
, is an S4 class containing 9
slots:
orig_data
: data.frame
with the original
data (without outcomes)scaled_data
: data.frame
with re-scaled
data (without outcomes)outcomeFact
: data.frame
with the original
outcomesoutcome
: data.frame
with original outcomes
converted as numericcomp
: data.frame
containing the components
of projected datadist_mat
: data.frame
containing the
computed distance matrixdfMapper
: data.frame
containing the nodes,
with their elements,jacc
: matrix
of Jaccard indexes between
each pair of dfMapper
nodesnode_data_mat
: data.frame
with the node
size and the average valueThe makeTDAobj
function allows users to 1) generate the
TDAobj from a data.frame
, 2) select one or more variables
to be considered as outcome, and 3) perform the 0-1 scaling on the
remaining dataset:
For genomic data, such as RNA-Seq or scRNA-Seq, we have also developed a custom function to import a SummarizedExperiment object into PIUMA:
To perform TDA, some preliminary preprocessing steps have to be
carried out; specifically, the scaled data stored in
TDA_obj@scaled_data
, called point-cloud in TDA
jargon, has to be projected in a low dimensional space and transformed
in distance matrix, exploiting the dfToProjection
and
dfToDistance
functions, respectively. In this example, we
will use the umap as projection strategy, to obtain the
first 2 reduced dimensions (nComp = 2
) and the Euclidean
distance (distMethod = "euclidean"
) as distance metrics.
PIUMA allows setting 6 different projection strategies with their
specific arguments: UMAP
, TSNE
,
PCA
, MDS
, KPCA
, and
ISOMAP
and 3 types of well-known distance metrics are
available: Euclidean, Pearson’s correlation and the Gower’s distance (to
be preferred in case of categorical features are present). Users can
also use standard external functions both to implement the
low-dimensional reduction (e.g., the built-in
princomp
function) and to calculate distances
(e.g., the built-in dist
function).
set.seed(1)
# calculate the distance matrix
TDA_obj <- dfToDistance(TDA_obj, distMethod = "euclidean")
# calculate the projections (lenses)
TDA_obj <- dfToProjection(TDA_obj,
"UMAP",
nComp = 2,
umapNNeigh = 25,
umapMinDist = 0.3,
showPlot = FALSE)
# plot point-cloud based on stage and zone
df_plot <- as.data.frame(cbind(getOutcomeFact(TDA_obj),
getComp(TDA_obj)),
stringAsFactor = TRUE)
ggplot(data= df_plot, aes(x=comp1, y=comp2, color=stage))+
geom_point(size=3)+
facet_wrap(~zone)
As shown in Figure @ref(fig:chunk-4), the most of vascular endothelial cells are located in ventricles where, in turn, it is possible to more easily appreciate cell groups based on developmental stages.
One of the core algorithms in TDA is the TDA Mapper, which is designed to provide a simplified representation of the data’s topological structure, making it easier to interpret and analyze. The fundamental idea behind TDA Mapper is to divide the data into overlapping subsets called ‘clusters’ and, then, build a simplicial complex that captures the relationships between these clusters. This simplicial complex can be thought of as a network of points, edges, triangles, and higher-dimensional shapes that approximate the underlying topology of the data. The TDA Mapper algorithm proceeds through several consecutive steps:
TDA Mapper has been successfully applied to various domains, including biology, neuroscience, materials science, and more. Its ability to capture the underlying topological structure of data while being robust to noise and dimensionality makes it a valuable tool for gaining insights from complex datasets. PIUMA is thought to implement a 2-dimensional lens function and then apply one of the 4 well-known clustering algorithm: ‘k-means’, ‘hierarchical clustering’, DBSCAN or OPTICS.
TDA_obj <- mapperCore(TDA_obj,
nBins = 15,
overlap = 0.3,
clustMeth = "kmeans")
# number of clusters (nodes)
dim(getDfMapper(TDA_obj))
#> [1] 367 1
# content of two overlapping clusters
getDfMapper(TDA_obj)["node_102_cl_1", 1]
#> [1] "GCGGAAACAGAGTTGG-a5k1 GTGCAGCAGCTGGAGT-a25k TCAGTGACAGCTTTGA-a25k TGTTCTATCAAGCCGC-a25k"
getDfMapper(TDA_obj)["node_117_cl_1", 1]
#> [1] "CCACCATGTTGAGTCT-a5k1 GCCCAGAGTTGCTCAA-a5k1 GCGGAAACAGAGTTGG-a5k1 TCCAGAAGTGTTCCTC-a5k1 GTCCTCATCGGCTGAC-a25k TGCTCGTAGCCTGTGC-a25k"
Here, we decided to generated 15 bins (for each
dimension), each one overlapping by 30% with the
adjacent ones. The k-means algorithm is, then, applied
on the sample belonging to each ‘squared’ bin. In this example, the
Mapper aggregated samples in 369 partially overlapping clusters. Indeed,
as shown in the previous code chunk, the nodes
node_102_cl_1
and node_117_cl_1
shared 2 out
of 4 cells.
The output of mapper is a data.frame
, stored in the
dfMapper
slot, in which each row represents a group of
samples (here, a group of cells), called ’node’ in
network theory jargon. PIUMA allows the users to also generate a matrix
that specifies the similarity between nodes ‘edge’
allowing to represent the data as a network. Since the similarity, in
this context, consists of the number of samples, shared by nodes, PIUMA
implements a function (jaccardMatrix
) to calculate the
Jaccard’s index between each pairs of nodes.
# Jaccard Matrix
TDA_obj <- jaccardMatrix(TDA_obj)
head(round(getJacc(TDA_obj)[1:5,1:5],3))
#> node_3_cl_1 node_4_cl_1 node_4_cl_2 node_5_cl_1 node_5_cl_2
#> node_3_cl_1 NA NA 0.167 NA NA
#> node_4_cl_1 NA NA NA NA NA
#> node_4_cl_2 0.167 NA NA NA 0.308
#> node_5_cl_1 NA NA NA NA NA
#> node_5_cl_2 NA NA 0.308 NA NA
round(getJacc(TDA_obj)["node_102_cl_1","node_117_cl_1"],3)
#> [1] 0.111
Regarding the similarity matrix, we obtained a Jaccard matrix where
each clusters’ pair was compared; looking, for example, at the Jaccard
Index for nodes node_102_cl_1
and
node_117_cl_1
, we correctly got 0.5 (2/4 cells). Moreover,
the tdaDfEnrichment
function allows inferring the features
values for the generated nodes, by returning the averaged variables
values of samples belonging to specific nodes. Generally, this step is
called ‘Node Enrichment’. In addition the size of each node is also
appended to the output data.frame
(the last column name is
‘size’).
TDA_obj <- tdaDfEnrichment(TDA_obj,
cbind(getScaledData(TDA_obj),
getOutcome(TDA_obj)))
head(getNodeDataMat(TDA_obj)[1:5, tail(names(getNodeDataMat(TDA_obj)), 5)])
#> mt.Nd5 mt.Cytb stage zone size
#> node_3_cl_1 0.195 0.536 11.000 3.000 1
#> node_4_cl_1 0.240 0.720 13.000 1.000 1
#> node_4_cl_2 0.257 0.634 14.167 1.667 6
#> node_5_cl_1 0.988 0.999 13.000 1.000 1
#> node_5_cl_2 0.279 0.654 15.273 1.545 11
Printing the last 5 columns of the data.frame
returned
by tdaDfEnrichment
(node_data_mat
slot), we
can show the averaged expression values of each nodes for 4
mitochondrial genes as well as the number of samples belonging to the
nodes.
TDA requires several parameters to be set, such as the type of projection algorithm, the distance metrics, the number of overlapping bins, the percentage of overlap, and the clustering techniques. Moreover, often, specific algorithm parameters need to be chosen, such as the number of UMAP neighbors. This means that TDA analysis should be repeated several times, varying the hyperparameters to find the suitable combination (‘grid search’ approach) and an evaluation metrics is needed to assess each result. PIUMA implements two different strategies to ass
supervised approach
, usually called
‘anchoring’, in which the entropy of the network
generated by TDA is calculated averaging the entropies of each node
using one single outcome as class (i.e., ‘anchor’). The lower
the entropy, the better the network.unsupervised approach
that exploits a topological
measurement to force the network to be scale-free. Scale-free networks
are characterized by few highly connected nodes (hub nodes) and many
poorly connected nodes (leaf nodes). Scale-free networks follows a
power-law degree distribution in which the probability that a node has k
links follows P(k) ∼ k−γ,
where k is a node degree
(i.e., the number of its connections), γ is a degree exponent, and P(k) is the frequency of
nodes with a specific degree.Degree exponents between 2 < γ < 3 have been observed
in most biological and social networks. Forcing our network to be
scale-free ensures to unveil communities in our data. The higher the
correlation between P(k) and k, in log-log scale, the better the
network.# Anchoring (supervised)
entropy <- checkNetEntropy(getNodeDataMat(TDA_obj)[, "zone"])
entropy
#> [1] 1.316
# Scale free network (unsupervised)
netModel <- checkScaleFreeModel(TDA_obj, showPlot = TRUE)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.
#> `geom_smooth()` using formula = 'y ~ x'
#> `geom_smooth()` using formula = 'y ~ x'
netModel
#> $gamma
#> [1] 1.804252
#>
#> $corkpk
#> [1] -0.7102301
#>
#> $pValkpk
#> [1] 0.009647595
#>
#> $corlogklogpk
#> [1] -0.5955781
#>
#> $pVallogklogpk
#> [1] 0.04101903
In this example, we tested both the approaches even if the
unsupervised one is preferable as no prior knowledge (i.e.,
outcome) is needed to assess the network. We got a global entropy of 1.3
and correlations between P(k) and k equal to -0.75 and -0.58, with data
in linear scale or log-log scale, respectively. PIUMA provides users
also with the γ value, so that
it is easy to assess if the network is scale-free (2 < γ < 3) or not. In this
case, γ is equal to 2.09,
meaning that the network can be considered scale-free. Overall, the
entropy, the correlation between P(k) and k (linear or log-log scale),
and/or the γ value, provided
by checkNetEntropy
and checkScaleFreeModel
,
could be used to compare different sets of hyper-parameters, such as
different lenses, space reduction algorithms and Mapper arguments.
Cytoscape is a well-known tool to handle, process and analyze
networks (Shannon et al. 2003). Two files
are needed to generate and enrich network in Cytoscape: the jaccard
Matrix (TDA_obj@jacc
), to generate the structure of the
network (nodes and edges) and a data.frame
with additional
nodes information to enrich the network
(TDA_obj@node_data_mat
):
write.table(x = round(getJacc(TDA_obj),3),
file = "./jaccard.matrix.txt",
sep = "\t",
quote = FALSE,
na = "",
col.names = NA)
write.table(x = getNodeDataMat(TDA_obj),
file = "./nodeEnrichment.txt",
sep = "\t",
quote = FALSE,
col.names = NA)
To explore the network resulted following the PIUMA framework, we
imported jaccard.matrix.txt
in Cytoscape by the
aMatReader plugin (Settle et al.
2018) (PlugIn -> aMatReader -> Import Matrix file)
while nodeEnrichment.txt
by File -> Import ->
Table from File. Then, we identified network communities by the
GLay cluster function from the ‘clustermaker2’ plugin (Utriainen and Morris 2023).
As shown in Figure 3, using the transcriptome of vascular endothelial cells, it is possible to identify 11 communities of cells (top-right). Interestingly, some of them are in the same developmental stage (top-left). Moreover, there are clusters showing similar expression for some genes but different expression for other genes, suggesting that the sub-population could have a different biological function.For example, orange and yellow clusters have a similar average expression of Igfpb7 (bottom-right) but different expression level of Aprt (bottom-left).
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] PIUMA_1.3.0 ggplot2_3.5.1 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] farver_2.1.2 fastmap_1.2.0
#> [3] digest_0.6.37 rpart_4.1.23
#> [5] lifecycle_1.0.4 cluster_2.1.8
#> [7] magrittr_2.0.3 kernlab_0.9-33
#> [9] dbscan_1.2-0 compiler_4.4.2
#> [11] rlang_1.1.4 Hmisc_5.2-1
#> [13] sass_0.4.9 tools_4.4.2
#> [15] igraph_2.1.2 yaml_2.3.10
#> [17] data.table_1.16.4 knitr_1.49
#> [19] labeling_0.4.3 S4Arrays_1.7.1
#> [21] askpass_1.2.1 htmlwidgets_1.6.4
#> [23] DelayedArray_0.33.3 reticulate_1.40.0
#> [25] abind_1.4-8 withr_3.0.2
#> [27] foreign_0.8-87 BiocGenerics_0.53.3
#> [29] sys_3.4.3 nnet_7.3-19
#> [31] grid_4.4.2 stats4_4.4.2
#> [33] colorspace_2.1-1 scales_1.3.0
#> [35] MASS_7.3-61 SummarizedExperiment_1.37.0
#> [37] cli_3.6.3 crayon_1.5.3
#> [39] rmarkdown_2.29 vegan_2.6-8
#> [41] generics_0.1.3 umap_0.2.10.0
#> [43] rstudioapi_0.17.1 RSpectra_0.16-2
#> [45] httr_1.4.7 cachem_1.1.0
#> [47] stringr_1.5.1 zlibbioc_1.52.0
#> [49] splines_4.4.2 parallel_4.4.2
#> [51] XVector_0.47.0 BiocManager_1.30.25
#> [53] matrixStats_1.4.1 base64enc_0.1-3
#> [55] vctrs_0.6.5 Matrix_1.7-1
#> [57] jsonlite_1.8.9 IRanges_2.41.2
#> [59] patchwork_1.3.0 S4Vectors_0.45.2
#> [61] Formula_1.2-5 htmlTable_2.4.3
#> [63] maketools_1.3.1 jquerylib_0.1.4
#> [65] glue_1.8.0 tsne_0.1-3.1
#> [67] stringi_1.8.4 gtable_0.3.6
#> [69] GenomeInfoDb_1.43.2 GenomicRanges_1.59.1
#> [71] UCSC.utils_1.3.0 munsell_0.5.1
#> [73] tibble_3.2.1 pillar_1.10.0
#> [75] htmltools_0.5.8.1 openssl_2.3.0
#> [77] GenomeInfoDbData_1.2.13 R6_2.5.1
#> [79] Biobase_2.67.0 evaluate_1.0.1
#> [81] lattice_0.22-6 png_0.1-8
#> [83] backports_1.5.0 bslib_0.8.0
#> [85] Rcpp_1.0.13-1 SparseArray_1.7.2
#> [87] gridExtra_2.3 nlme_3.1-166
#> [89] permute_0.9-7 checkmate_2.3.2
#> [91] mgcv_1.9-1 xfun_0.49
#> [93] MatrixGenerics_1.19.0 buildtools_1.0.0
#> [95] pkgconfig_2.0.3
PIUMA is the italian word for feather↩︎