Title: | Deconvolution by Convex Analysis of Mixtures |
---|---|
Description: | An R package for fully unsupervised deconvolution of complex tissues. It provides basic functions to perform unsupervised deconvolution on mixture expression profiles by Convex Analysis of Mixtures (CAM) and some auxiliary functions to help understand the subpopulation-specific results. It also implements functions to perform supervised deconvolution based on prior knowledge of molecular markers, S matrix or A matrix. Combining molecular markers from CAM and from prior knowledge can achieve semi-supervised deconvolution of mixtures. |
Authors: | Lulu Chen <[email protected]> |
Maintainer: | Lulu Chen <[email protected]> |
License: | GPL-2 |
Version: | 1.25.0 |
Built: | 2024-12-14 03:58:30 UTC |
Source: | https://github.com/bioc/debCAM |
The core function in this package is CAM
which achieves fully
unsupervised deconvolution on mixture expression profiles.
Each step in CAM
can also be performed separately by
CAMPrep
, CAMMGCluster
and CAMASest
in a more flexible workflow.
MGstatistic
can help extract a complete marker list from CAM
results. MDL
can help decide the underlying subpopulation
number. With other functions, e.g. AfromMarkers
and
MGstatistic
, this package can also perform supervised
deconvolution based on prior knowledge of molecular markers,
subpopulation-specific expression matrix (S) or proportion matrix (A).
Semi-supervised deconvolution can be achieved by combining molecular markers
from CAM and from prior knowledge to analyze mixture expressions.
Wang, N., Hoffman, E. P., Chen, L., Chen, L., Zhang, Z., Liu, C., … Wang, Y. (2016). Mathematical modelling of transcriptional heterogeneity identifies novel markers and subpopulations in complex tissues. Scientific Reports, 6, 18909. http://doi.org/10.1038/srep18909
This function estimates proportion matrix (A matrix) from observed mixture expression data based on marker genes.
AfromMarkers(data, MGlist, scaleRecover = TRUE)
AfromMarkers(data, MGlist, scaleRecover = TRUE)
data |
A data set that will be internally coerced into a matrix. Each row is a gene and each column is a sample. data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally. |
MGlist |
A list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation. |
scaleRecover |
If TRUE, scale ambiguity of each column vector in A matrix is removed based on sum-to-one constraint on each row. |
With the expression levels of subpopulation-specific
marker genes, the relative proportions of constituent subpopulations are
estimated by spatial median using l1median
.
Marker genes could be from unsupervised/supervised detection or
from literatures.
Scale ambiguity is optionally removed based on sum-to-one constraint of rows.
Return the estimated proportion matrix (A matrix).
#obtain data and marker genes data(ratMix3) S <- ratMix3$S pMGstat <- MGstatistic(S, c("Liver","Brain","Lung")) pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x) rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10]) #estimate A matrix from markers Aest <- AfromMarkers(ratMix3$X, pMGlist.FC)
#obtain data and marker genes data(ratMix3) S <- ratMix3$S pMGstat <- MGstatistic(S, c("Liver","Brain","Lung")) pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x) rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10]) #estimate A matrix from markers Aest <- AfromMarkers(ratMix3$X, pMGlist.FC)
Accessors to proportion matrix and subpopulation-specific expression matrix estimated by CAM.
Amat(x, ...) Smat(x, ...) ## S4 method for signature 'CAMObj' Amat(x, k, usingPCA = TRUE) ## S4 method for signature 'CAMASObj' Amat(x, usingPCA = TRUE) ## S4 method for signature 'CAMObj' Smat(x, k, usingPCA = TRUE) ## S4 method for signature 'CAMASObj' Smat(x, usingPCA = TRUE)
Amat(x, ...) Smat(x, ...) ## S4 method for signature 'CAMObj' Amat(x, k, usingPCA = TRUE) ## S4 method for signature 'CAMASObj' Amat(x, usingPCA = TRUE) ## S4 method for signature 'CAMObj' Smat(x, k, usingPCA = TRUE) ## S4 method for signature 'CAMASObj' Smat(x, usingPCA = TRUE)
x |
|
... |
additional argument list. |
k |
subpopulation number |
usingPCA |
If TRUE, A matrix is estimated by transforming dimension-reduced A matrix back to original space. Otherwise, A matrix is directly estimated in original data space. The default is TRUE. |
Estimated A matrix or S matrix.
#obtain data data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) Aest <- Amat(rCAM, 3) Sest <- Smat(rCAM, 3) Aest <- Amat(slot(rCAM, "ASestResult")[[1]]) Sest <- Smat(slot(rCAM, "ASestResult")[[1]])
#obtain data data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) Aest <- Amat(rCAM, 3) Sest <- Smat(rCAM, 3) Aest <- Amat(slot(rCAM, "ASestResult")[[1]]) Sest <- Smat(slot(rCAM, "ASestResult")[[1]])
This function performs a fully unsupervised computational deconvolution to identify marker genes that define each of the multiple subpopulations, and estimate the proportions of these subpopulations in the mixture tissues as well as their respective expression profiles.
CAM(data, K = NULL, corner.strategy = 2, dim.rdc = 10, thres.low = 0.05, thres.high = 0.95, cluster.method = c("K-Means", "apcluster"), cluster.num = 50, MG.num.thres = 20, lof.thres = 0.02, quickhull = TRUE, quick.select = NULL, sample.weight = NULL, appro3 = TRUE, generalNMF = FALSE, cores = NULL)
CAM(data, K = NULL, corner.strategy = 2, dim.rdc = 10, thres.low = 0.05, thres.high = 0.95, cluster.method = c("K-Means", "apcluster"), cluster.num = 50, MG.num.thres = 20, lof.thres = 0.02, quickhull = TRUE, quick.select = NULL, sample.weight = NULL, appro3 = TRUE, generalNMF = FALSE, cores = NULL)
data |
Matrix of mixture expression profiles. Data frame, SummarizedExperiment or ExpressionSet object will be internally coerced into a matrix. Each row is a gene and each column is a sample. Data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally. |
K |
The candidate subpopulation number(s), e.g. K = 2:8. |
corner.strategy |
The method to find corners of convex hull. 1: minimum sum of margin-of-errors; 2: minimum sum of reconstruction errors. The default is 2. |
dim.rdc |
Reduced data dimension; should be not less than maximum candidate K. |
thres.low |
The lower bound of percentage of genes to keep for CAM with ranked norm. The value should be between 0 and 1. The default is 0.05. |
thres.high |
The higher bound of percentage of genes to keep for CAM with ranked norm. The value should be between 0 and 1. The default is 0.95. |
cluster.method |
The method to do clustering.
The default "K-Means" will use |
cluster.num |
The number of clusters; should be much larger than K. The default is 50. |
MG.num.thres |
The clusters with the gene number smaller than MG.num.thres will be treated as outliers. The default is 20. |
lof.thres |
Remove local outlier using |
quickhull |
Perform quickhull to select clusters or not. The default is True. |
quick.select |
The number of candidate corners kept after quickhull and SFFS greedy search. If Null, only quickhull is applied. The default is 20. If this value is larger than the number of candidate corners after quickhull, greedy search will also not be applied. |
sample.weight |
Vector of sample weights. If NULL, all samples have the same weights. The length should be the same as sample numbers. All values should be positive. |
appro3 |
Estimate A and S matrix by approach 3 or not. Please see
|
generalNMF |
If TRUE, the decomposed proportion matrix has no sum-to-one constraint for each row. The default is FALSE. TRUE value brings two changes: (1) Without assuming samples are normalized, the first principal component will not forced to be along c(1,1,..,1) but a standard PCA will be applied during preprocessing. (2) Without sum-to-one constraint for each row, the scale ambiguity of each column vector in proportion matrix will not be removed. |
cores |
The number of system cores for parallel computing. If not provided, one core for each element in K will be invoked. Zero value will disable parallel computing. |
This function includes three necessary steps to decompose a matrix
of mixture expression profiles: data preprocessing, marker gene cluster
search, and matrix decomposition. They are implemented in
CAMPrep
, CAMMGCluster
and
CAMASest
, separately.
More details can be found in the help document of each function.
For this function, you needs to specify the range of possible subpopulation numbers and the percentage of low/high-expressed genes to be removed. Typically, 30% ~ 50% low-expressed genes can be removed from gene expression data. The removal of high-expressed genes has much less impact on results, and usually set to be 0% ~ 10%.
This function can also analyze other molecular expression data, such as proteomics data. Much less low-expressed proteins need to be removed, e.g. 0% ~ 10%, due to a limited number of proteins without missing values.
An object of class "CAMObj
" containing the following
components:
PrepResult |
An object of class " |
MGResult |
A list of " |
ASestResult |
A list of " |
#obtain data data(ratMix3) data <- ratMix3$X #set seed to generate reproducible results set.seed(111) #CAM with known subpopulation number rCAM <- CAM(data, K = 3, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) #Larger dim.rdc can improve performance but increase time complexity ## Not run: #CAM with a range of subpopulation number rCAM <- CAM(data, K = 2:5, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95) #Use "apcluster" to aggregate gene vectors in CAM rCAM <- CAM(data, K = 2:5, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95, cluster.method = 'apcluster') #CAM with quick selection to reduce time complexity rCAM <- CAM(data, K = 3, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95, quick.select = 20) #CAM with different sample weights (e.g. adjusted based on sample quality) rCAM <- CAM(data, K = 3, dim.rdc = 5, thres.low = 0.30, thres.high = 0.95, sample.weight = c(rep(10,11),rep(1,10))) #CAM for general NMF (no sum-to-one contraint for proportion matrix) rCAM <- CAM(data, K = 3, dim.rdc = 5, thres.low = 0.30, thres.high = 0.95, generalNMF = TRUE) ## End(Not run)
#obtain data data(ratMix3) data <- ratMix3$X #set seed to generate reproducible results set.seed(111) #CAM with known subpopulation number rCAM <- CAM(data, K = 3, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) #Larger dim.rdc can improve performance but increase time complexity ## Not run: #CAM with a range of subpopulation number rCAM <- CAM(data, K = 2:5, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95) #Use "apcluster" to aggregate gene vectors in CAM rCAM <- CAM(data, K = 2:5, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95, cluster.method = 'apcluster') #CAM with quick selection to reduce time complexity rCAM <- CAM(data, K = 3, dim.rdc = 10, thres.low = 0.30, thres.high = 0.95, quick.select = 20) #CAM with different sample weights (e.g. adjusted based on sample quality) rCAM <- CAM(data, K = 3, dim.rdc = 5, thres.low = 0.30, thres.high = 0.95, sample.weight = c(rep(10,11),rep(1,10))) #CAM for general NMF (no sum-to-one contraint for proportion matrix) rCAM <- CAM(data, K = 3, dim.rdc = 5, thres.low = 0.30, thres.high = 0.95, generalNMF = TRUE) ## End(Not run)
This function estimates A and S matrix based on marker gene clusters detected by CAM.
CAMASest(MGResult, PrepResult, data, corner.strategy = 2, appro3 = TRUE, generalNMF = FALSE)
CAMASest(MGResult, PrepResult, data, corner.strategy = 2, appro3 = TRUE, generalNMF = FALSE)
MGResult |
An object of class " |
PrepResult |
An object of class " |
data |
Matrix of mixture expression profiles which need to be the same
as the input of |
corner.strategy |
The method to detect corner clusters. 1: minimum sum of margin-of-errors; 2: minimum sum of reconstruction errors. The default is 2. |
appro3 |
Estimate A and S matrix by approach 3 or not. Please see details for further information. The default is TRUE. |
generalNMF |
If TRUE, the decomposed proportion matrix has no sum-to-one constraint for each row. Without this constraint, the scale ambiguity of each column vector in proportion matrix will not be removed. The default is FALSE. |
This function is used internally by CAM
function to
estimate proportion matrix (A), subpopulation-specific expression matrix (S)
and mdl values. It can also be used when you want to perform CAM step by
step.
The mdl values are calculated in three approaches: (1) based on data and A matrix in dimension-reduced space; (2) based on original data with A matrix estimated by transforming dimension-reduced A matrix back to original space; (3) based on original data with A directly estimated in original space. A and S matrix in original space estimated from the latter two approaches are returned. mdl is the sum of two terms: code length of data under the model and code length of model. Both mdl value and the first term (code length of data) will be returned.
An object of class "CAMASObj
" containing the
following components:
Aest |
Estimated proportion matrix from Approach 2. |
Sest |
Estimated subpopulation-specific expression matrix from Approach 2. |
Aest.proj |
Estimated proportion matrix from Approach 2, before removing scale ambiguity. |
Ascale |
The estimated scales to remove scale ambiguity of each column vector in Aest. Sum-to-one constraint on each row of Aest is used for scale estimation. |
AestO |
Estimated proportion matrix from Approach 3. |
SestO |
Estimated subpopulation-specific expression matrix from Approach 3. |
AestO.proj |
Estimated proportion matrix from Approach 3, before removing scale ambiguity. |
AscaleO |
The estimated scales to remove scale ambiguity of each column vector in AestO. Sum-to-one constraint on each row of AestO is used for scale estimation. |
datalength |
Three values for code length of data. The first is calculated based on dimension-reduced data. The second and third are based on the original data. |
mdl |
Three mdl values. The first is calculated based on dimension-reduced data. The second and third are based on the original data. |
#obtain data data(ratMix3) data <- ratMix3$X #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) #Marker gene cluster detection with a fixed K rMGC <- CAMMGCluster(3, rPrep) #A and S matrix estimation rASest <- CAMASest(rMGC, rPrep, data)
#obtain data data(ratMix3) data <- ratMix3$X #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) #Marker gene cluster detection with a fixed K rMGC <- CAMMGCluster(3, rPrep) #A and S matrix estimation rASest <- CAMASest(rMGC, rPrep, data)
An S4 class for storing estimated proportions, subpopulation-specific expressions and mdl values. The mdl values are calculated in three approaches: (1) based on data and A matrix in dimension-reduced space; (2) based on original data with A matrix estimated by transforming dimension-reduced A matrix back to original space; (3) based on original data with A directly estimated in original space. A and S matrix in original space estimated from the latter two approaches are returned. mdl is the sum of two terms: code length of data under the model and code length of model. Both mdl value and the first term (code length of data) will be returned.
Aest
Estimated proportion matrix from Approach 2.
Sest
Estimated subpopulation-specific expression matrix from Approach 2.
Aest.proj
Estimated proportion matrix from Approach 2, before removing scale ambiguity.
Ascale
The estimated scales to remove scale ambiguity of each column vector in Aest. Sum-to-one constraint on each row of Aest is used for scale estimation.
AestO
Estimated proportion matrix from Approach 3.
SestO
Estimated subpopulation-specific expression matrix from Approach 3.
AestO.proj
Estimated proportion matrix from Approach 3, before removing scale ambiguity.
AscaleO
The estimated scales to remove scale ambiguity of each column vector in AestO. Sum-to-one constraint on each row of AestO is used for scale estimation.
datalength
Three values for code length of data. The first is calculated based on dimension-reduced data. The second and third are based on the original data.
mdl
Three mdl values. The first is calculated based on dimension-reduced data. The second and third are based on the original data.
This function finds corner clusters as MG clusters (clusters containing marker genes).
CAMMGCluster(K, PrepResult, generalNMF = FALSE, nComb = 200)
CAMMGCluster(K, PrepResult, generalNMF = FALSE, nComb = 200)
K |
The candidate subpopulation number. |
PrepResult |
An object of class " |
generalNMF |
If TRUE, the decomposed proportion matrix has no sum-to-one constraint for each row. Without this constraint, the scale ambiguity of corner cluster centers will not be removed when computing reconstruction errors. The default is FALSE. |
nComb |
The number of possible combinations of clusters as corner clusters. Within these possible combinations ranked by margin errors, we can further select the best one based on reconstruction errors. The default is 200. |
This function is used internally by CAM
function to
detect clusters containing marker genes,
or used when you want to perform CAM step by step.
This function provides two solutions. The first is the combination of clusters yielding the minimum sum of margin-of-errors for cluster centers. In the second, nComb possible combinations are selected by ranking sum of margin-of-errors for cluster centers. Then the best one is selected based on reconstruction errors of all data points in original space.
An object of class "CAMMGObj
" containing the
following components:
idx |
Two numbers which are two solutions' ranks by sum of margin-of-error. |
corner |
The indexes of clusters as detected corners. Each row is a solution. |
error |
Two rows. The first row is sum of margin-of-errors for nComb possible combinations. The second row is reconstruction errors for nComb possible combinations. |
#obtain data data(ratMix3) data <- ratMix3$X #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) #Marker gene cluster detection with a fixed K = 3 rMGC <- CAMMGCluster(3, rPrep)
#obtain data data(ratMix3) data <- ratMix3$X #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) #Marker gene cluster detection with a fixed K = 3 rMGC <- CAMMGCluster(3, rPrep)
An S4 class for storing marker gene detection results.
idx
Two numbers which are two solutions' ranks by sum of margin-of-error.
corner
The indexes of clusters as detected corners. Each row is a solution.
error
Two rows. The first row is sum of margin-of-errors for nComb possible combinations. The second row is reconstruction errors for nComb possible combinations.
An S4 class for storing results of CAM.
PrepResult
An object of class "CAMPrepObj
" storing data
preprocessing results from CAMPrep
function.
MGResult
A list of "CAMMGObj
" objects
storing marker gene detection
results from CAMMGCluster
function for each candidate
subpopulation number.
ASestResult
A list of "CAMASObj
" objects storing
estimated proportions, subpopulation-specific expressions and mdl values
from CAMASest
function for each candidate
subpopulation number.
This function perform preprocessing for CAM, including norm-based filtering, dimension deduction, perspective projection, local outlier removal and aggregation of gene expression vectors by clustering.
CAMPrep(data, dim.rdc = 10, thres.low = 0.05, thres.high = 0.95, cluster.method = c("K-Means", "apcluster"), cluster.num = 50, MG.num.thres = 20, lof.thres = 0.02, quickhull = TRUE, quick.select = NULL, sample.weight = NULL, generalNMF = FALSE)
CAMPrep(data, dim.rdc = 10, thres.low = 0.05, thres.high = 0.95, cluster.method = c("K-Means", "apcluster"), cluster.num = 50, MG.num.thres = 20, lof.thres = 0.02, quickhull = TRUE, quick.select = NULL, sample.weight = NULL, generalNMF = FALSE)
data |
Matrix of mixture expression profiles. Data frame, SummarizedExperiment or ExpressionSet object will be internally coerced into a matrix. Each row is a gene and each column is a sample. Data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally. |
dim.rdc |
Reduced data dimension; should be not less than maximum candidate K. |
thres.low |
The lower bound of percentage of genes to keep for CAM with ranked norm. The value should be between 0 and 1. The default is 0.05. |
thres.high |
The higher bound of percentage of genes to keep for CAM with ranked norm. The value should be between 0 and 1. The default is 0.95. |
cluster.method |
The method to do clustering. The default "K-Means" will
use |
cluster.num |
The number of clusters; should be much larger than K. The default is 50. |
MG.num.thres |
The clusters with the gene number smaller than MG.num.thres will be treated as outliers. The default is 20. |
lof.thres |
Remove local outlier using |
quickhull |
Perform quickhull to select clusters or not. The default is True. |
quick.select |
The number of candidate corners kept after quickhull and SFFS greedy search. If Null, only quickhull is applied. The default is 20. If this value is larger than the number of candidate corners after quickhull, greedy search will also not be applied. |
sample.weight |
Vector of sample weights. If NULL, all samples have the same weights. The length should be the same as sample numbers. All values should be positive. |
generalNMF |
If TRUE, the decomposed proportion matrix has no sum-to-one constraint for each row. Without assuming samples are normalized, the first principal component will not forced to be along c(1,1,..,1) but a standard PCA will be applied during preprocessing. |
This function is used internally by CAM
function to
preprocess data, or used when you want to perform CAM step by step.
Low/high-expressed genes are filtered by their L2-norm ranks. Dimension reduction is slightly different from PCA. The first loading vector is forced to be c(1,1,...,1) with unit norm normalization. The remaining are eigenvectors from PCA in the space orthogonal to the first vector. Perspective projection is to project dimension-reduced gene expression vectors to the hyperplane orthogonal to c(1,0,...,0), i.e., the first axis in the new coordinate system. local outlier removal is optional to exclude outliers in simplex formed after perspective projection. Finally, gene expression vectors are aggregated by clustering to further reduce the impact of noise/outlier and help improve the efficiency of simplex corner detection.
An object of class "CAMPrepObj
" containing the
following components:
Valid |
logical vector to indicate the genes left after filtering. |
Xprep |
Preprocessed data matrix. |
Xproj |
Preprocessed data matrix after perspective projection. |
W |
The matrix whose rows are loading vectors. |
SW |
Sample weights. |
cluster |
cluster results including two vectors. The first indicates the cluster to which each gene is allocated. The second is the number of genes in each cluster. |
c.outlier |
The clusters with the gene number smaller than MG.num.thres. |
centers |
The centers of candidate corner clusters (candidate clusters containing marker genes). |
#obtain data data(ratMix3) data <- ratMix3$X #set seed to generate reproducible results set.seed(111) #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95)
#obtain data data(ratMix3) data <- ratMix3$X #set seed to generate reproducible results set.seed(111) #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95)
An S4 class for storing data preprocessing results.
Valid
logical vector to indicate the genes left after filtering.
Xprep
Preprocessed data matrix.
Xproj
Preprocessed data matrix after perspective projection.
W
The matrix whose rows are loading vectors.
cluster
cluster results including two vectors. The first indicates the cluster to which each gene is allocated. The second is the number of genes in each cluster.
c.outlier
The clusters with the gene number smaller than MG.num.thres.
centers
The centers of candidate corner clusters (candidate clusters containing marker genes).
Given a set of data points, return possible combinations of data points as corners. These combinations are selected by ranking the sum of margin-of-errors.
cornerSort(X, K, nComb)
cornerSort(X, K, nComb)
X |
Data in a matrix. Each column is a data point. |
K |
The number of corner points. |
nComb |
The number of returned combinations of data points as corners. All combinations will be returned if the number of all combinations is less than nComb. |
This function is to detect corner points from
data
points by conducting an exhaustive combinatorial search (with total
combinations), based on a convex-hull-to-data fitting criterion:
sum of margin-of-errors.
nComb
combinations are returned for further
selection based on reconstruction errors of all data points in original
space.
The function is implemented in Java with R-to-Java interface provided by
rJava
package. It relies on NonNegativeLeastSquares class in Parallel
Java Library (https://www.cs.rit.edu/~ark/pj.shtml).
A list containing the following components:
idx |
A matrix to show the indexes of data points in combinations to construct a convex hull. Each column is one combination. |
error |
A vector of margin-of-error sums for each combination. |
data <- matrix(c(0.1,0.2,1.0,0.0,0.0,0.5,0.3, 0.1,0.7,0.0,1.0,0.0,0.5,0.3, 0.8,0.1,0.0,0.0,1.0,0.0,0.4), nrow =3, byrow = TRUE) topconv <- cornerSort(data, 3, 10)
data <- matrix(c(0.1,0.2,1.0,0.0,0.0,0.5,0.3, 0.1,0.7,0.0,1.0,0.0,0.5,0.3, 0.8,0.1,0.0,0.0,1.0,0.0,0.4), nrow =3, byrow = TRUE) topconv <- cornerSort(data, 3, 10)
This function obtains minimum description length (mdl) values for each candidate subpopulation number.
MDL(CAMResult, mdl.method = 3) ## S4 method for signature 'MDLObj,missing' plot(x, data.term = FALSE, ...)
MDL(CAMResult, mdl.method = 3) ## S4 method for signature 'MDLObj,missing' plot(x, data.term = FALSE, ...)
CAMResult |
Result from |
mdl.method |
Approach to calculate mdl values; should be 1, 2, or 3. The default is 3. |
x |
|
data.term |
If true, plot data term (code length of data under model). |
... |
All other arguments are passed to the plotting command. |
This function extracts minimum description length (mdl) values from
the result of CAM
function, which contains mdl values form
three approaches for each candidate subpopulation number.
For more details about three approaches, refer to CAMASest
.
mdl is code length of data under the model plus code length of model. Both mdl value and the first term about data are returned.
An object of class "MDLObj
" containing the
following components:
K |
The candidate subpopulation numbers. |
datalengths |
For each model with a certain subpopulation number, code length of data under the model. |
mdls |
mdl value for each model with a certain subpopulation number. |
## Not run: #obtain data data(ratMix3) data <- ratMix3$X #Analysis by CAM rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95) #extract mdl values MDL(rCAM) MDL(rCAM, 1) MDL(rCAM, 2) #plot MDL curves plot(MDL(rCAM)) plot(MDL(rCAM), data.term = TRUE) #with data length curve ## End(Not run)
## Not run: #obtain data data(ratMix3) data <- ratMix3$X #Analysis by CAM rCAM <- CAM(data, K = 2:5, thres.low = 0.30, thres.high = 0.95) #extract mdl values MDL(rCAM) MDL(rCAM, 1) MDL(rCAM, 2) #plot MDL curves plot(MDL(rCAM)) plot(MDL(rCAM), data.term = TRUE) #with data length curve ## End(Not run)
An S4 class for storing mdl values.
K
The candidate subpopulation numbers.
datalengths
For each model with a certain subpopulation number, code length of data under the model.
mdls
mdl value for each model with a certain subpopulation number.
This function returns marker genes detected by CAM for estimating A.
MGsforA(CAMResult = NULL, K = NULL, PrepResult = NULL, MGResult = NULL, corner.strategy = 2)
MGsforA(CAMResult = NULL, K = NULL, PrepResult = NULL, MGResult = NULL, corner.strategy = 2)
CAMResult |
Result from |
K |
The candidate subpopulation number. |
PrepResult |
An object of class "CAMPrepObj" from |
MGResult |
An object of class "CAMMGObj" from
|
corner.strategy |
The method to detect corner clusters. 1: minimum sum of margin-of-errors; 2: minimum sum of reconstruction errors. The default is 2. |
This function needs to specify CAMResult and K, or PrepResult and
MGResult. The returned marker genes are those used by CAM for estimating A.
To obtain a more complete marker gene list, please refer to
MGstatistic
.
A list of vectors, each of which contains marker genes for one subpopulation.
#obtain data and run CAM data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) #obtain marker genes detected by CAM for estimating A MGlist <- MGsforA(rCAM, K = 3) #obtain data and run CAM step by step rPrep <- CAMPrep(data, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) rMGC <- CAMMGCluster(3, rPrep) #obtain marker genes detected by CAM for estimating A MGlist <- MGsforA(PrepResult = rPrep, MGResult = rMGC)
#obtain data and run CAM data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) #obtain marker genes detected by CAM for estimating A MGlist <- MGsforA(rCAM, K = 3) #obtain data and run CAM step by step rPrep <- CAMPrep(data, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) rMGC <- CAMMGCluster(3, rPrep) #obtain marker genes detected by CAM for estimating A MGlist <- MGsforA(PrepResult = rPrep, MGResult = rMGC)
This function computes One-Versus-Everyone Fold Change (OVE-FC) from subpopulation-specific expression profiles. Bootstrapping is optional.
MGstatistic(data, A = NULL, boot.alpha = NULL, nboot = 1000, cores = NULL)
MGstatistic(data, A = NULL, boot.alpha = NULL, nboot = 1000, cores = NULL)
data |
A data set that will be internally coerced into a matrix. Each row is a gene and each column is a sample. Data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally. |
A |
When data are mixture expression profiles, A is estimated proportion matrix or prior proportion matrix. When data are pure expression profiles, A is a phenotype vector to indicate which subpopulation each sample belongs to. |
boot.alpha |
Alpha for bootstrapped OVE-FC confidence interval. The default is 0.05. |
nboot |
The number of boots. |
cores |
The number of system cores for parallel computing. If not provided, the default back-end is used. |
This function calculates OVE-FC and bootstrapped OVE-FC which can be used to identify markers from all genes.
A data frame containing the following components:
idx |
Numbers or phenotypes indicating which subpopulation each gene could be a marker for. If A is a proportion matrix without column name, numbers are returned. Otherwise, phenotypes. |
OVE.FC |
One-versus-Everyone fold change (OVE-FC) |
OVE.FC.alpha |
lower confidence bound of bootstrapped OVE-FC at alpha level. |
#data are mixture expression profiles, A is proportion matrix data(ratMix3) MGstat <- MGstatistic(ratMix3$X, ratMix3$A) ## Not run: MGstat <- MGstatistic(ratMix3$X, ratMix3$A, boot.alpha = 0.05) #enable boot ## End(Not run) #data are pure expression profiles without replicates MGstat <- MGstatistic(ratMix3$S) #boot is not applicable ## Not run: #data are pure expression profiles with phenotypes S <- matrix(rgamma(3000,0.1,0.1), 1000, 3) S <- S[, c(1,1,1,2,2,2,3,3,3,3)] + rnorm(1000*10, 0, 0.5) MGstat <- MGstatistic(S, c(1,1,1,2,2,2,3,3,3,3), boot.alpha = 0.05) ## End(Not run)
#data are mixture expression profiles, A is proportion matrix data(ratMix3) MGstat <- MGstatistic(ratMix3$X, ratMix3$A) ## Not run: MGstat <- MGstatistic(ratMix3$X, ratMix3$A, boot.alpha = 0.05) #enable boot ## End(Not run) #data are pure expression profiles without replicates MGstat <- MGstatistic(ratMix3$S) #boot is not applicable ## Not run: #data are pure expression profiles with phenotypes S <- matrix(rgamma(3000,0.1,0.1), 1000, 3) S <- S[, c(1,1,1,2,2,2,3,3,3,3)] + rnorm(1000*10, 0, 0.5) MGstat <- MGstatistic(S, c(1,1,1,2,2,2,3,3,3,3), boot.alpha = 0.05) ## End(Not run)
Accessor to Dimension-reduction loading matrix.
PCAmat(x, ...) ## S4 method for signature 'CAMObj' PCAmat(x) ## S4 method for signature 'CAMPrepObj' PCAmat(x)
PCAmat(x, ...) ## S4 method for signature 'CAMObj' PCAmat(x) ## S4 method for signature 'CAMPrepObj' PCAmat(x)
x |
a |
... |
additional argument list. |
The matrix whose rows are loading vectors for dimension reduction.
#obtain data data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) W <- PCAmat(rCAM) W <- PCAmat(slot(rCAM, "PrepResult"))
#obtain data data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc = 3, thres.low = 0.30, thres.high = 0.95) W <- PCAmat(rCAM) W <- PCAmat(slot(rCAM, "PrepResult"))
Rat brain, liver and lung biospecimens derived from one animal at the cRNA homogenate level in different proportions. 3 technical replicates each. We downsample the original data to 10000 probes/probesets and 7 mixtures. Proportions used in experiments and pure expression profiles are also included.
data(ratMix3)
data(ratMix3)
A list with three matrices: mixture profiles (X), mixing proportions (A) and pure profiles (S).
Shen-Orr et al. (2010) Nat Methods 2010 Apr;7(4):287-9. PMID: 20208531
This function re-estimates proportion and expression matrix iteratively by Alternating Least Square (ALS) method. The initial values are from markers or known proportion matrix or known expression matrix,
redoASest(data, MGlist, A = NULL, S = NULL, generalNMF = FALSE, maxIter = 2, methy = FALSE)
redoASest(data, MGlist, A = NULL, S = NULL, generalNMF = FALSE, maxIter = 2, methy = FALSE)
data |
A data set that will be internally coerced into a matrix. Each row is a gene and each column is a sample. data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally. |
MGlist |
A list of vectors, each of which contains CAM-detected markers and/or prior markers for one subpopulation. |
A |
Initial proportion matrix. If NULL, it will be estimated from
initial expression matrix. If initial expression matrix is also NULL, it
will be estimated from |
S |
Initial expression matrix. If NULL, it will be estimated from initial proportion matrix. |
generalNMF |
If TRUE, the decomposed proportion matrix has no sum-to-one constraint for each row. Without this constraint, the scale ambiguity of each column vector in proportion matrix will not be removed. The default is FALSE. |
maxIter |
maximum number of iterations for Alternating Least Square (ALS) method. The default is 2. If zero, ALS is not applied. |
methy |
Should be TRUE when dealing with methylation data, whose expression levels are confined between 0 and 1. |
If only markers are provided, they are used to estimate initial proportion matrix and then expression matrix. If proportion matrix or expression matrix is provided, it will be treated as initial matrix to estimate the other one. Then Alternating Least Square (ALS) method is applied to estimate two matrix alternatively. Note only markers' squared errors will be counted in ALS, which facilitates (1) faster computational running time and (2) a greater signal-to-noise ratio owing to markers' discriminatory power.
This function can be used to refine CAM estimation or perform supervised deconvolution. Note that allowing too many iterations may bring the risk of a significant deviation from initial values.
A list containing the following components:
Aest |
Proportion matrix after re-estimation and possible refinement. |
Sest |
expression matrix after re-estimation and possible refinement. |
mse |
Mean squared error (i.e. mean of reconstruction errors) for input markers |
#obtain data and run CAM data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) #obtain marker genes detected by CAM for estimating A MGlist <- MGsforA(rCAM, K = 3) #Re-estimation based on marker list rre <- redoASest(data, MGlist, maxIter = 10) Aest <- rre$Aest #re-estimated A matrix Sest <- rre$Sest #re-estimated S matrix #Re-estimation with initial A matrix rre <- redoASest(data, MGlist, A=ratMix3$A, maxIter = 10) #Re-estimation with initial S matrix rre <- redoASest(data, MGlist, S=ratMix3$S, maxIter = 10)
#obtain data and run CAM data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) #obtain marker genes detected by CAM for estimating A MGlist <- MGsforA(rCAM, K = 3) #Re-estimation based on marker list rre <- redoASest(data, MGlist, maxIter = 10) Aest <- rre$Aest #re-estimated A matrix Sest <- rre$Sest #re-estimated S matrix #Re-estimation with initial A matrix rre <- redoASest(data, MGlist, A=ratMix3$A, maxIter = 10) #Re-estimation with initial S matrix rre <- redoASest(data, MGlist, S=ratMix3$S, maxIter = 10)
This function generates a new list of markers based on initially detected markers by CAM and/or prior markers.
reselectMG(data, MGlist, fc.thres = "q0.5", err.thres = NULL)
reselectMG(data, MGlist, fc.thres = "q0.5", err.thres = NULL)
data |
A data set that will be internally coerced into a matrix. Each row is a gene and each column is a sample. data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally. |
MGlist |
A list of vectors, each of which contains CAM-detected markers and/or prior markers for one subpopulation. |
fc.thres |
The lower threshold of fold change to select markers, an absolute value (e.g. 10) or a quantile value after 'q' (e.g. 'q0.5', the median of fold changes of one subpopulation's input markers). Each subpopulation can have its own threshold if a vector is provided. The default is 'q0.5'. If NULL, still use the input markers. |
err.thres |
The upper threshold of reconstruction error to select markers, an absolute value or a quantile value after 'q' (e.g. 'q0.5', the median of reconstruction errors of one subpopulation's input markers; 'q1.2', the maximum error times 1.2 ). Each subpopulation can have its own threshold if a vector is provided. The default is NULL, which means no such a threshold is applied. |
Considering some meaningful markers may be mistakenly filtered by
preprocessing and thus missed by CAM
, this function use the
input marker gene list to estimate proportions by AfromMarkers
and then estimate expression levels. Next, a new list of markers are
generated by fold change threshold and reconstruction error threshold.
The input marker genes could also be from other supervised detection and/or from literatures. This function reselects a list of marker genes based on the input.
A list of vectors, each of which contains new selected markers for one subpopulation.
#obtain data and run CAM data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) #obtain marker genes detected by CAM with a fixed K MGlist <- MGsforA(rCAM, K = 3) #Reselect markers from all genes MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.5') MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.5', err.thres='q0.95')
#obtain data and run CAM data(ratMix3) data <- ratMix3$X rCAM <- CAM(data, K = 3, dim.rdc= 3, thres.low = 0.30, thres.high = 0.95) #obtain marker genes detected by CAM with a fixed K MGlist <- MGsforA(rCAM, K = 3) #Reselect markers from all genes MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.5') MGlist.re <- reselectMG(data, MGlist, fc.thres='q0.5', err.thres='q0.95')
This function detects the corners of convex hull by greedy search. It will
be used to reduce the number of candidate corners and thus reduce the time
complexity in the further exhaustive search by cornerSort
.
sffsHull(Xall, Aall, Kmax, deltaK = 8)
sffsHull(Xall, Aall, Kmax, deltaK = 8)
Xall |
Data in a matrix. Each column is a data point. The cost is computed based on all data points. |
Aall |
Candiatie corners in a matrix. Each column is a candidate corner. |
Kmax |
The target number of corners to be selected. |
deltaK |
The extra number of corners that need to be searched.
The default is 8 and will be truncated based on all the available
corners.
SFFS runs until a corner set of cardinality ( |
The Sequential Floating Forward selection (SFFS) is one of greedy
search methods for feature selection. With sum of margin-of-errors
as cost function and candidate corners as features, SFFS is used to select
best Kmax
corners to form an approximated hull. The best subset of
candidate corners is initialized as the empty set and at each step a new
corner is added. After that, the algorithm searches for corner that can be
removed from the best subset until the cost function does not decrease.
A list with length (Kmax
+ deltak
). Each component is
a vector of the corner indices of the SFFS-selected subset with
certain cardinality.
data <- matrix(c(0.1,0.2,1.0,0.0,0.0,0.5,0.3, 0.1,0.7,0.0,1.0,0.0,0.5,0.3, 0.8,0.1,0.0,0.0,1.0,0.0,0.4), nrow =3, byrow = TRUE) rsffs <- sffsHull(data, data, 3) rsffs <- sffsHull(data, data[,1:5], 3)
data <- matrix(c(0.1,0.2,1.0,0.0,0.0,0.5,0.3, 0.1,0.7,0.0,1.0,0.0,0.5,0.3, 0.8,0.1,0.0,0.0,1.0,0.0,0.4), nrow =3, byrow = TRUE) rsffs <- sffsHull(data, data, 3) rsffs <- sffsHull(data, data[,1:5], 3)
This function shows scatter simplex of mixture expressions.
simplexplot(data, A, MGlist = NULL, data.extra = NULL, corner.order = NULL, col = "gray", pch = 1, cex = 0.8, mg.col = "red", mg.pch = 1, mg.cex = 1.2, ex.col = "black", ex.pch = 19, ex.cex = 1.5, ...)
simplexplot(data, A, MGlist = NULL, data.extra = NULL, corner.order = NULL, col = "gray", pch = 1, cex = 0.8, mg.col = "red", mg.pch = 1, mg.cex = 1.2, ex.col = "black", ex.pch = 19, ex.cex = 1.5, ...)
data |
A data set that will be internally coerced into a matrix. Each row is a gene and each column is a sample. Data should be in non-log linear space with non-negative numerical values (i.e. >= 0). Missing values are not supported. All-zero rows will be removed internally. |
A |
Prior/Estimated proportion matrix. |
MGlist |
A list of vectors, each of which contains known markers and/or CAM-detected markers for one subpopulation. |
data.extra |
Extra data points to be shown in the simplex plot, e.g.
the points associated with prior/estimated proportion vectors. The format
should be consistent with |
corner.order |
The order to show simplex corners counterclockwise. |
col |
The color for data points. The default is "gray". |
pch |
The symbol/character for data points. The default is 1. |
cex |
The symbol/character expansion for data points. The default is 0.8. |
mg.col |
The colors for marker genes. Marker genes of one subpopulation could have their own color if a vector is provided. The default is "red". |
mg.pch |
The symbols/characters for marker genes. Marker genes of one subpopulation could have their own symbol/character if a vector is provided. The default is 1. |
mg.cex |
The symbol/character expansion for marker genes. The default is 1.2. |
ex.col |
The colors for extra data points. Each data point could have its own color if a vector is provided. The default is "black". |
ex.pch |
The symbols/characters for extra data points. Each data point could have its own symbol/character if a vector is provided. The default is 19. |
ex.cex |
The symbol/character expansion for extra data points. The default is 1.5. |
... |
All other arguments are passed to the plotting command. |
This function can show the scatter simplex and detected marker genes
in a 2D plot. The corners in the high-dimensional simplex will still locate
at extreme points of low-dimensional simplex. These corners will follow the
order set by corner.order
to display in the plot counterclockwise.
A plot to the current device.
#obtain data, A matrix, marker genes data(ratMix3) data <- ratMix3$X A <- ratMix3$A pMGstat <- MGstatistic(ratMix3$S, c("Liver","Brain","Lung")) pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x) rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10]) #plot simplex for data simplexplot(data, A) simplexplot(data, A, MGlist = pMGlist.FC) #Color marker genes in the plot simplexplot(data, A, MGlist = pMGlist.FC, data.extra = t(A)) #show the location of proportion vectors in the plot #set different corner order and colors simplexplot(data, A, MGlist = pMGlist.FC, corner.order = c(2,1,3), col = "blue", mg.col = c("red","orange","green"))
#obtain data, A matrix, marker genes data(ratMix3) data <- ratMix3$X A <- ratMix3$A pMGstat <- MGstatistic(ratMix3$S, c("Liver","Brain","Lung")) pMGlist.FC <- lapply(c("Liver","Brain","Lung"), function(x) rownames(pMGstat)[pMGstat$idx == x & pMGstat$OVE.FC > 10]) #plot simplex for data simplexplot(data, A) simplexplot(data, A, MGlist = pMGlist.FC) #Color marker genes in the plot simplexplot(data, A, MGlist = pMGlist.FC, data.extra = t(A)) #show the location of proportion vectors in the plot #set different corner order and colors simplexplot(data, A, MGlist = pMGlist.FC, corner.order = c(2,1,3), col = "blue", mg.col = c("red","orange","green"))
This function reduces data dimension by loading matrix and then project dimension-reduced data to the hyperplane orthogonal to c(1,0,...,0), i.e., the first axis in the new coordinate system..
XWProj(data, W)
XWProj(data, W)
data |
A data set that will be internally coerced into a matrix. Each row is a gene and each column is a sample. Missing values are not supported. All-zero rows will be removed internally. |
W |
The matrix whose rows are loading vectors;
should be obtained from |
This function can project gene expression vectors to simplex plot
generated by CAM
/CAMPrep
. Using slot Xproj
in "CAMPrepObj
" can only show the simplex of genes after
filtering. This function helps observe all genes in simplex plot.
The data after perspective projection.
#obtain data data(ratMix3) data <- ratMix3$X #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.50, thres.high = 0.90) #obtain simplex Xproj <- XWProj(data, PCAmat(rPrep)) #plot simplex in 3d space #plot3d(Xproj[,-1]) #The first dimension is constant after projection
#obtain data data(ratMix3) data <- ratMix3$X #preprocess data rPrep <- CAMPrep(data, dim.rdc = 3, thres.low = 0.50, thres.high = 0.90) #obtain simplex Xproj <- XWProj(data, PCAmat(rPrep)) #plot simplex in 3d space #plot3d(Xproj[,-1]) #The first dimension is constant after projection