If you use ANF in published research, please cite:
Tianle Ma, Aidong Zhang, Integrate Multi-omic Data Using Affinity Network Fusion (ANF) for Cancer Patient Clustering, https://arxiv.org/abs/1708.07136
In the following, let’s first generate a synthetic dataset and use it for demonstrating the basic usage of ANF.
For complex objects (e.g., patients) with multi-view data, we can use a feature matrix representing each view. For example, gene expression matrix and miRNA expression matrix can represent two views of patients.
In the following, rows of a feature matrix represent objects, and columns represent features. Note each feature matrix contains a number of features from a feature space (corresponding one view). We can concatenate all features together as we will experiment later. However, since features from different feature spaces are usually heterogeneous, it may be a good idea to analyze them in its own feature space first, and then combine the results later. This is basically how ANF works.
For simplicity, let’s generate the first view (matrix
feature1
) of 200 samples: 100 samples for class 1 (the
first 100 rows of matrix feature1
), and 100 samples for
class 2 (the last 100 rows of matrix feature1
), using
multi-variate Gaussian distribution.
Let’s perform KMeans clustering. The Normalized Mutual Information (NMI) is only 0.26.
## [1] 0.2296436
Let’s perform spectral clustering using functions in ANF package. The NMI is 0.29, slightly higher than KMeans.
library(ANF)
d = dist(feature1)
d = as.matrix(d)
A1 = affinity_matrix(d, 10)
labels = spectral_clustering(A1, 2)
compare(labels, true.class, method='nmi')
## [1] 0.2614912
Similar to the first view, we can generate a second view (matrix
feature2
). The rows of feature1
and
feature2
have one-to-one correspondence.
Similarly, the NMI of KMeans clustering and spectral clustering are 0.14 (can be different because of random initialization) and 0.19 respectively.
## [1] 0.1084515
d = dist(feature2)
d = as.matrix(d)
A2 = affinity_matrix(d, 10)
labels = spectral_clustering(A2, 2)
compare(labels, true.class, method='nmi')
## [1] 0.3045581
HarmonizedTCGAData
package (https://github.com/BeautyOfWeb/HarmonizedTCGAData)
contains three R objects: Wall
, project_ids
and surv.plot
:
Wall
contains a complex list affinity matrices. In fact,
Wall
a list (five cancer type) of list (six feature
normalization types: raw.all
, raw.sel
,
log.all
, log.sel
, vst.sel
,
normalized
) of list (three feature spaces or views:
fpkm
, mirna
, and methy450
) of
matrices. The rownames of each matrix are case IDs (i.e., patient IDs),
and the column names of each matrix are aliquot IDs (which contains case
IDs as prefix).
project_ids
is a named character vector that maps the
case_id (represent a patient) to project_id (one-to-one corresponds to
disease type). This is used for evaluating clustering results, such as
calculating Normalized Mutual Information (NMI) and Adjusted Rand Index
(ARI).
surv.plot
is a data.frame containing patient survival
data for survival analysis, providing an “indirect” way to evaluate
clustering results.
HarmonizedTCGAData
package contains more details about
the above three data objects and simple examples of how to use them. We
suggest users to read the vignettes of HarmonizedTCGAData
first since it covers easier examples of using ANF
and
HarmonizedTCGAData
packages.
In the following, we are majorly focusing on reproducing the results
of the companion paper https://arxiv.org/abs/1708.07136 The code below may be a
little harder to follow than simply using ANF
package.
We can perform spectral clustering on a patient affinity matrix. Take adrend_gland cancer for example. We can cluseter patients using affinity matrix derived from log2 transformation of raw counts of differentially expressed genes.
affinity.mat <- Wall[["adrenal_gland"]][["log.sel"]][["fpkm"]]
labels <- spectral_clustering(affinity.mat, k = 2)
Since we know true disease types, which correspond to project ids in
project_ids
, we can calculate NMI and ARI.
true.disease.types <- as.factor(project_ids[rownames(affinity.mat)])
table(labels, true.disease.types)
## true.disease.types
## labels TCGA-ACC TCGA-PCPG
## 1 0 176
## 2 76 1
nmi <- igraph::compare(true.disease.types, labels, method = "nmi")
adjusted.rand = igraph::compare(true.disease.types, labels, method = "adjusted.rand")
# we can also calculate p-value using `surv.plot` data
surv.plot <- surv.plot[rownames(affinity.mat), ]
f <- survival::Surv(surv.plot$time, !surv.plot$censored)
fit <- survival::survdiff(f ~ labels)
pval <- stats::pchisq(fit$chisq, df = length(fit$n) - 1, lower.tail = FALSE)
message(paste("NMI =", nmi, ", ARI =", adjusted.rand, ", p-val =", pval))
In this package, We have provided a function eval_clu
that streamlines the above process from spectral clustering to
calculating NMI, ARI and p-value. Here is an example of how to use
eval_clu
:
## labels
## true_class 1 2
## TCGA-ACC 0 76
## TCGA-PCPG 176 1
For adrenal_gland cancer, we only misclassify one out of 253 patients using this affinity matrix. That is a pretty good result (In fact, this the best result we can achieve. Users can try using other matrices and compare the results). However, for many cases, using a single affinity matrix does a “terrible” job in clustering patients into correct disease types. Take uterus cancer for example (the NMI is near 0).
## labels
## true_class 1 2
## TCGA-UCEC 153 268
## TCGA-UCS 3 51
Instead of using one affinity matrix, we can “fuse” multiple affinity matrices using ANF, and then perform spectral clustering on the fused affinity matrix.
Let’s take uterus cancer for example.
# fuse three matrices: "fpkm" (gene expression), "mirnas" (miRNA expression) and "methy450" (DNA methylation)
fused.mat <- ANF(Wall = Wall$uterus$raw.all)
# Spectral clustering on fused patient affinity matrix
labels <- spectral_clustering(A = fused.mat, k = 2)
# Or we can directly evaluate clustering results using function `eval_clu`, which calls `spectral_clustering` and calculate NMI and ARI (and p-value if patient survival data is available. `surv.plot` does not contain information for uterus cancer patients)
res <- eval_clu(true_class = project_ids[rownames(fused.mat)], w = fused.mat)
## labels
## true_class 1 2
## TCGA-UCEC 410 11
## TCGA-UCS 14 40
Now NMI is 0.485. The clusering results is significantly better than using a single pateint affinity matrix. This demonstrates the power of ANF.
We have majorly used ANF to produce results in this paper: https://arxiv.org/abs/1708.07136 To reproduce the results, please refer to https://github.com/BeautyOfWeb/Clustering-TCGAFiveCancerTypes/blob/master/vignettes/ANF%20for%20Cancer%20Patient%20Clustering.Rmd (the last section).