Convex Analysis of Mixtures (CAM) is a fully unsupervised computational method to analyze tissue samples composed of unknown numbers and varying proportions of distinct subpopulations (Wang et al. 2016). CAM assumes that the measured expression level is the weighted sum of each subpopulation’s expression, where the contribution from a single subpopulation is proportional to the abundance and specific expression of that subpopulation. This linear mixing model can be formulated as X′ = AS′. CAM can identify molecular markers directly from the original mixed expression matrix, X, and further estimate the constituent proportion matrix, A, as well as subpopulation-specific expression profile matrix, S.
debCAM
is an R package developed for tissue
heterogeneity characterization by CAM algorithm. It provides basic
functions to perform unsupervised deconvolution on mixture expression
profiles by CAM and some auxiliary functions to help understand the
subpopulation-specific results. debCAM
also implements
functions to perform supervised deconvolution based on prior knowledge
of molecular markers, S matrix or A matrix. Semi-supervised
deconvolution can also be achieved by combining molecular markers from
CAM and from prior knowledge to analyze mixture expressions.
The function CAM()
includes all necessary steps to
decompose a matrix of mixture expression profiles. There are some
optional steps upstream of CAM()
that downsample the matrix
for reducing running time. Each step in CAM()
can also be
performed separately if you prefer a more flexible workflow. More
details will be introduced in sections below.
Starting your analysis by CAM()
, you need to specify the
range of possible subpopulation numbers and the percentage of
low/high-expressed molecules to be removed. Typically, 30% ~ 50%
low-expressed genes can be removed from gene expression data. Much less
low-expressed proteins are removed, e.g. 0% ~ 10%, due to a limited
number of proteins in proteomics data. The removal of high-expressed
molecules has much less impact on results, usually set to be 0% ~
10%.
Theoretically, debCAM
accepts any molecular expression
data types as long as the expressions follow the linear mixing model. We
have validated the feasibility of CAM in gene expression data
(microarray, RNAseq), proteomics data and DNA methylation data.
Requirements for the input expression data:
The input expression data should be stored in a matrix. Data frame, SummarizedExperiment or ExpressionSet object is also accepted and will be internally coerced into a matrix format before analysis. Each column in the matrix should be a tissue sample. Each row should be a probe/gene/protein/etc. Row names should be provided so that CAM can return the names of detected markers. Otherwise, rows will be automatically named by 1,2,3,…
We use a data set downsampled from GSE19830 as an example to show CAM workflow. This data set provides gene expression profiles of pure tissues (brain, liver, lung) and their biological mixtures with different proportions.
library(debCAM)
data(ratMix3)
#ratMix3$X: X matrix containing mixture expression profiles to be analyzed
#ratMix3$A: ground truth A matrix containing proportions
#ratMix3$S: ground truth S matrix containing subpopulation-specific expression profiles
data <- ratMix3$X
#10000 genes * 21 tissues
#meet the input data requirements
Unsupervised deconvolution can be achieved by the function
CAM()
with a simple setting as Section @ref(quick-start)
introduced. Other critical parameters are dim.rdc
(reduced
data dimension) and cluster.num
(number of clusters).
Increasing them will bring more time complexity. We can also specify
cores
for parallel computing configured by BiocParallel.
cores = 0
will disable parallel computing. No
cores
argument will invoke one core for each element in
K
. Setting the seed of the random number generator prior to
CAM can generate reproducible results.
set.seed(111) # set seed for internal clustering to generate reproducible results
rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95)
#CAM return three sub results:
#rCAM@PrepResult contains details corresponding to data preprocessing.
#rCAM@MGResult contains details corresponding to marker gene clusters detection.
#rCAM@ASestResult contains details corresponding to A and S matrix estimation.
We used MDL, a widely-adopted and consistent information theoretic criterion, to guide model selection. The underlying subpopulation number can be decided by minimizing the total description code length:
The A and S matrix estimated by CAM with a fixed subpopulation number, e.g. K = 3, can be obtained by
The marker genes detected by CAM and used for A matrix estimation can be obtained by
Data preprocessing has filtered many genes, among which there are also some biologically-meaningful marker genes. So we need to check each gene again to find all the possible markers. Two statistics based on the subpopulation-specific expressions are exploited to identify marker genes with certain thresholds. The first is OVE-FC (one versus everyone - fold change) (Yu et al. 2010). The second is the lower confidence bound of bootstrapped OVE-FC at α level.
MGstat <- MGstatistic(data, Aest, boot.alpha = 0.05, nboot = 1000)
MGlist.FC <- lapply(seq_len(3), function(x)
rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC > 10])
MGlist.FCboot <- lapply(seq_len(3), function(x)
rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC.alpha > 10])
The thresholds above are set arbitrary and will impact the number of resulted markers significantly. Each subpopulation can also have different threshold. To make threshold setting more easier, we can also set thresholds to be quantile of fold changes and margin-of-errors of one subpopulation’s input markers. Margin-of-error is the distance between one gene and the simplex defined by column vectors of A matrix (Wang et al. 2016). Margin-of-error threshold can be relaxed to a value larger than the maximum.
It is optional to re-estimate A and S matrix based on the new marker list and/or apply Alternating Least Square (ALS) method to further reduce mean squared error. Note that allowing too many iterations of ALS may bring the risk of a significant deviation from initial values. The constraint for methylation data, S ∈ [0, 1], will be imposed during re-estimation.
Fundamental to the success of CAM is that the scatter simplex of
mixed expressions is a rotated and compressed version of the scatter
simplex of pure expressions, where the marker genes are located at each
vertex. simplexplot()
function can show the scatter simplex
and detected marker genes in a 2D plot. The vertices in the
high-dimensional simplex will still locate at extreme points of the
low-dimensional simplex.
layout(matrix(c(1,2,3,4), 2, 2, byrow = TRUE))
simplexplot(data, Aest, MGlist, main=c('Initially detected markers'))
simplexplot(data, Aest, MGlist.FC, main=c('fc > 10'))
simplexplot(data, Aest, MGlist.FCboot,
main=c(expression(bold(paste('fc(bootstrap,',alpha,'=0.05) > 10')))))
simplexplot(data, Aest, MGlist.re, main=c("fc >= 'q0.2', error <= 'q1.2'"))
The colors and the vertex order displayed in 2D plot can be changed as
simplexplot(data, Aest, MGlist.FCboot,
data.extra=rbind(t(ratMix3$A),t(Aest)),
corner.order = c(2,1,3), col = "blue",
mg.col = c("red","orange","green"),
ex.col = "black", ex.pch = c(19,19,19,17,17,17))
legend("bottomright", cex=1.2, inset=.01, c("Ground Truth","CAM Estimation"),
pch=c(19,17), col="black")
We can also observe the convex cone and simplex of mixture expressions in 3D space by PCA. Note that PCA cannot guarantee that vertices are still preserved as extreme points of the dimension-reduced simplex.
Code to show convex cone:
library(rgl)
Xp <- data %*% t(PCAmat(rCAM))
plot3d(Xp[, 1:3], col='gray', size=3,
xlim=range(Xp[,1]), ylim=range(Xp[,2:3]), zlim=range(Xp[,2:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}
Code to show simplex:
library(rgl)
clear3d()
Xproj <- XWProj(data, PCAmat(rCAM))
Xp <- Xproj[,-1]
plot3d(Xp[, 1:3], col='gray', size=3,
xlim=range(Xp[,1:3]), ylim=range(Xp[,1:3]), zlim=range(Xp[,1:3]))
abclines3d(0,0,0, a=diag(3), col="black")
for(i in seq_along(MGlist)){
points3d(Xp[MGlist[[i]], 1:3], col= rainbow(3)[i], size = 8)
}
If the ground truth A and S matrix are available, the estimation from CAM can be evaluated:
cor(Aest, ratMix3$A)
#> Liver Brain Lung
#> [1,] -0.3963187 0.9839210 -0.4898373
#> [2,] 0.9869012 -0.5991347 -0.4865204
#> [3,] -0.4769864 -0.5245692 0.9856413
cor(Sest, ratMix3$S)
#> Liver Brain Lung
#> [1,] 0.5047310 0.9752287 0.5648873
#> [2,] 0.9732946 0.2578350 0.3488423
#> [3,] 0.4680238 0.5741921 0.9935207
Considering the presence of many co-expressed genes (housekeeping genes) may dominate the correlation coefficients between ground truth and estimated expression profiles, it is better to assess correlation coefficients over marker genes only.
There major steps in CAM (data preprocessing, marker gene cluster detection, and matrix decomposition) can also be performed separately as a more flexible choice.
set.seed(111)
#Data preprocession
rPrep <- CAMPrep(data, thres.low = 0.30, thres.high = 0.95)
#> outlier cluster number: 0
#> convex hull cluster number: 48
#Marker gene cluster detection with a fixed K
rMGC <- CAMMGCluster(3, rPrep)
#A and S matrix estimation
rASest <- CAMASest(rMGC, rPrep, data)
#Obtain A and S matrix
Aest <- Amat(rASest)
Sest <- Smat(rASest)
#obtain marker gene list detected by CAM and used for A estimation
MGlist <- MGsforA(PrepResult = rPrep, MGResult = rMGC)
#obtain a full list of marker genes
MGstat <- MGstatistic(data, Aest, boot.alpha = 0.05, nboot = 1000)
MGlist.FC <- lapply(seq_len(3), function(x)
rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC > 10])
MGlist.FCboot <- lapply(seq_len(3), function(x)
rownames(MGstat)[MGstat$idx == x & MGstat$OVE.FC.alpha > 10])
MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')
We have implemented PCA in CAM()
/CAMPrep()
to reduce data dimensions. Sample clustering, as another data dimension
reduction method, is optional prior to
CAM()
/CAMPrep()
.
#clustering
library(apcluster)
apres <- apclusterK(negDistMat(r=2), t(data), K = 10)
#> Trying p = -2427050
#> Number of clusters: 9
#>
#> Number of clusters: 9 for p = -2427050
#You can also use apcluster(), but need to make sure the number of clusters is large than potential subpopulation number.
data.clusterMean <- lapply(slot(apres,"clusters"),
function(x) rowMeans(data[, x, drop = FALSE]))
data.clusterMean <- do.call(cbind, data.clusterMean)
set.seed(111)
rCAM <- CAM(data.clusterMean, K = 2:5, thres.low = 0.30, thres.high = 0.95)
# or rPrep <- CAMPrep(data.clusterMean, thres.low = 0.30, thres.high = 0.95)
We can still follow the workflow in @ref(cam-workflow) or @ref(alternative-cam-workflow) to obtain marker gene list and estimated S matrix. However, the estimated A matrix is for the new data composed of cluster centers. The A matrix for the original data can be obtained by
GSE11058 run four immune cell lines on microarrays as well as their mixtures of various relative proportions. The code chunk below shows how to use CAM to blindly separate the four mixtures into four pure cell lines. Note that this data set contains 54613 probe/probesets. We can reduce running time by downsampling probe/probesets or remove more low-expressed genes (e.g. 70%).
#download data and phenotypes
library(GEOquery)
gsm<- getGEO('GSE11058')
pheno <- pData(phenoData(gsm[[1]]))$characteristics_ch1
mat <- exprs(gsm[[1]])
mat <- mat[-grep("^AFFX", rownames(mat)),]
mat.aggre <- sapply(unique(pheno), function(x) rowMeans(mat[,pheno == x]))
data <- mat.aggre[,5:8]
#running CAM
set.seed(111)
rCAM <- CAM(data, K = 4, thres.low = 0.70, thres.high = 0.95)
Aest <- Amat(rCAM, 4)
Aest
#> [,1] [,2] [,3] [,4]
#> [1,] 0.3883538 0.2389400 0.1039654 0.26874070
#> [2,] 0.2171408 0.4881505 0.2075359 0.08717281
#> [3,] 0.3480010 0.1810097 0.4342633 0.03672609
#> [4,] 0.3795312 0.3204192 0.2752069 0.02484273
#Use ground truth A to validate CAM-estimated A matrix
Atrue <- matrix(c(2.50, 0.50, 0.10, 0.02,
1.25, 3.17, 4.95, 3.33,
2.50, 4.75, 1.65, 3.33,
3.75, 1.58, 3.30, 3.33), 4,4,
dimnames = list(c("MixA", "MixB", "MixC","MixD"),
c("Jurkat", "IM-9", "Raji", "THP-1")))
Atrue <- Atrue / rowSums(Atrue)
Atrue
#> Jurkat IM-9 Raji THP-1
#> MixA 0.250000000 0.1250000 0.2500000 0.3750000
#> MixB 0.050000000 0.3170000 0.4750000 0.1580000
#> MixC 0.010000000 0.4950000 0.1650000 0.3300000
#> MixD 0.001998002 0.3326673 0.3326673 0.3326673
cor(Aest, Atrue)
#> Jurkat IM-9 Raji THP-1
#> [1,] 0.2958875 -0.2006075 -0.7497038 0.98630126
#> [2,] -0.1976556 -0.1508022 0.9935143 -0.88675357
#> [3,] -0.7918492 0.9725408 -0.4427615 0.03629895
#> [4,] 0.9981488 -0.8746620 -0.1050113 0.31145418
GSE41826 conducted a reconstitution experiment by mixing neuron and glial derived DNA from a single individual from 10% ~ 90%. The code chunk below shows how to use CAM to blindly separate 9 mixtures into neuron and glial specific CpG methylation quantification. Additional requirements of running CAM on methylation data:
#download data
library(GEOquery)
gsm <- getGEO('GSE41826')
mixtureId <- unlist(lapply(paste0('Mix',seq_len(9)),
function(x) gsm[[1]]$geo_accession[gsm[[1]]$title==x]))
data <- gsm[[1]][,mixtureId]
#Remove CpG sites in sex chromosomes if tissues are from both males and females
#Not necessary in this example as mixtures are from the same subject
#gpl<- getGEO('GPL13534')
#annot<-dataTable(gpl)@table[,c('Name','CHR')]
#rownames(annot) <- annot$Name
#annot <- annot[rownames(data),]
#data <- data[annot$CHR != 'X' & annot$CHR != 'Y',]
#downsample CpG sites
featureId <- sample(seq_len(nrow(data)), 50000)
#Running CAM
#When input data has too many probes, lof has to be disabled due to space limit.
#When cluster.num is large, quick.select can be enable to increase the speed
rCAM <- CAM(data[featureId,], K = 2, thres.low = 0.10, thres.high = 0.60,
cluster.num = 100, MG.num.thres = 10,
lof.thres = 0, quick.select = 20)
MGlist <- MGsforA(rCAM, K = 2)
#Identify markers from all CpG sites
MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.2', err.thres='q1.2')
#re-esitmation with methylation constraint
rre <- redoASest(data, MGlist.re, maxIter = 20, methy = TRUE)
#Validation using ground truth A matrix
Atrue <- cbind(seq(0.1, 0.9, 0.1), seq(0.9, 0.1, -0.1))
cor(rre$Aest, Atrue)
We can also run CAM on unmethylated quantification, 1 - beta, to obtain quite similar results since the underlying linear mixing model is also applicable for unmethylated probe intensity.
CAM algorithm can estimate A and S matrix based on blindly detected markers. Thus, we can also use part of CAM algorithm to estimate A and S matrix based on known markers.
The package provides AfromMarkers()
to estimate A matrix
from marker list.
Aest <- AfromMarkers(data, MGlist)
#MGlist is a list of vectors, each of which contains known markers for one subpopulation
Or we can use redoASest()
to estimate A and S matrix
from marker list and alternatively re-estimate two matrices by ALS to
further reduce mean squared error. Note that allowing too many
iterations of ALS may bring the risk of a significant deviation from
initial values. The constraint for methylation data, S ∈ [0, 1],
can be imposed during re-estimation.
Many datasets provide expression profiles for purified cell lines or
even every single cell, which can be treated as references of S matrix.
Some methods use least squares techniques or support vector regression
(Newman et al. 2015) to estimate A matrix
based on known S matrix. debCAM
will estimate A matrix by
identified markers from known S matrix, which has better performance in
terms of correlation coefficient with ground truth A matrix.
data <- ratMix3$X
S <- ratMix3$S #known S matrix
#Identify markers
pMGstat <- MGstatistic(S, c("Liver","Brain","Lung"))
pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x)
rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10])
#Estimate A matrix
Aest <- AfromMarkers(data, pMGlist.FC)
#(Optional) alternative re-estimation
rre <- redoASest(data, pMGlist.FC, maxIter = 10)
debCAM
also supports estimating A matrix directly from
known S matrix using least squares method. Marker identification from S
matrix is still needed because only markers’ squared errors will be
counted which facilitates (1) faster computational running time and (2)
a greater signal-to-noise ratio owing to markers’ discriminatory power
(Newman et al. 2015).
With known A matrix, debCAM
estimates S matrix using
non-negative least squares (NNLS) from NMF, and further
identify markers.
data <- ratMix3$X
A <- ratMix3$A #known A matrix
#Estimate S matrix
Sest <- t(NMF::.fcnnls(A, t(data))$coef)
#Identify markers
pMGstat <- MGstatistic(data, A)
pMGlist.FC <- lapply(unique(pMGstat$idx), function(x)
rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10])
#(Optional) alternative re-estimation
rre <- redoASest(data, pMGlist.FC, A=A, maxIter = 10)
When prior information of markers, S matrix and/or A matrix is available, semi-supervised deconvolution can also be performed by combining markers from prior information and markers identified by CAM. While supervised deconvolution cannot handle the underlying subpopulations without prior information, unsupervised deconvolution may miss the subpopulation without enough discrimination power. Therefore, semi-supervised deconvolution can take advantage of both methods.
Aest <- AfromMarkers(data, MGlist)
#MGlist is a list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation
Or