Title: | Adaptive Signature Selection and InteGratioN (ASSIGN) |
---|---|
Description: | ASSIGN is a computational tool to evaluate the pathway deregulation/activation status in individual patient samples. ASSIGN employs a flexible Bayesian factor analysis approach that adapts predetermined pathway signatures derived either from knowledge-based literature or from perturbation experiments to the cell-/tissue-specific pathway signatures. The deregulation/activation level of each context-specific pathway is quantified to a score, which represents the extent to which a patient sample encompasses the pathway deregulation/activation signature. |
Authors: | Ying Shen, Andrea H. Bild, W. Evan Johnson, and Mumtehena Rahman |
Maintainer: | Ying Shen <[email protected]>, W. Evan Johnson <[email protected]>, David Jenkins <[email protected]>, Mumtehena Rahman <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.43.0 |
Built: | 2024-10-30 03:32:21 UTC |
Source: | https://github.com/bioc/ASSIGN |
The assign.convergence checks the convergence of the MCMC chain of the model parameters generated by the Gibbs sampling algorithm.
assign.convergence( test, burn_in = 0, iter = 2000, parameter = c("B", "S", "Delta", "beta", "kappa", "gamma", "sigma"), whichGene, whichSample, whichPath )
assign.convergence( test, burn_in = 0, iter = 2000, parameter = c("B", "S", "Delta", "beta", "kappa", "gamma", "sigma"), whichGene, whichSample, whichPath )
test |
The list object returned from the assign.mcmc function. The list components are the MCMC chains of the B, S, Delta, beta, gamma, and sigma. |
burn_in |
The number of burn-in iterations. These iterations are discarded when computing the posterior means of the model parameters. The default is 0. |
iter |
The number of total iterations. The default is 2000. |
parameter |
A character string indicating which model parameter is to be checked for convergence. This must be one of "B", "S", "Delta", "beta", "kappa", "gamma", and "sigma". |
whichGene |
A numerical value indicating which gene is to be checked for convergence. The value has to be in the range between 1 and G. |
whichSample |
A numerical value indicating which test sample is to be checked for convergence. The value has to be in the range between 1 and N. |
whichPath |
A numerical value indicating which pathway is to be checked for convergence. The value has to be in the range between 1 and K. |
To compute the convergence of the gth gene in B, set whichGene=g, whichSample=NA, whichPath=NA.
To compute the convergence of the gth gene in the kth pathway within the signature matrix (S), set whichGene=g, whichSample=NA, whichPath=NA.
To compute the convergence of the kth pathway in the jth test sample within the pathway activation matrix (A), set whichGene=NA, whichSample=n, whichPath=k.
The assign.convergence function returns the a vector of the estimated values from each Gibbs sampling iteration of the model parameter to be checked, and a trace plot of this parameter.
Ying Shen
## Not run: # check the 10th gene in the 1st pathway for the convergence trace.plot <- assign.convergence(test=mcmc.chain, burn_in=0, iter=2000, parameter="S", whichGene=10, whichSample=NA, whichPath=1) ## End(Not run)
## Not run: # check the 10th gene in the 1st pathway for the convergence trace.plot <- assign.convergence(test=mcmc.chain, burn_in=0, iter=2000, parameter="S", whichGene=10, whichSample=NA, whichPath=1) ## End(Not run)
The assign.cv.output function outputs the summary results and plots for the cross validation done on the training dataset.
assign.cv.output( processed.data, mcmc.pos.mean.trainingData, trainingData, trainingLabel, adaptive_B = FALSE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir )
assign.cv.output( processed.data, mcmc.pos.mean.trainingData, trainingData, trainingLabel, adaptive_B = FALSE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir )
processed.data |
The list object returned from the assign.preprocess function. |
mcmc.pos.mean.trainingData |
The list object returned from the assign.mcmc function. Notice that for cross validation, the Y argument in the assign.mcmc function should be set as the training dataset. |
trainingData |
The genomic measure matrix of training samples (i.g., gene expression matrix). The dimension of this matrix is probe number x sample number. The default is NULL. |
trainingLabel |
The list linking the index of each training sample to a specific group it belongs to. |
adaptive_B |
Logicals. If TRUE, the model adapts the baseline/background (B) of genomic measures for the test samples. The default is FALSE. |
adaptive_S |
Logicals. If TRUE, the model adapts the signatures (S) of genomic measures for the test samples. The default is FALSE. |
mixture_beta |
Logicals. If TRUE, elements of the pathway activation matrix are modeled by a spike-and-slab mixture distribution. The default is TRUE. |
outputDir |
The path to the directory to save the output files. The path needs to be quoted in double quotation marks. |
The assign.cv.output function is suggested to run after the assign.preprocess, assign.mcmc and assign.summary function. For the cross validation, The Y argument in the assign.mcmc function is the output value "trainingData_sub" from the assign.preprocess function.
The assign.cv.output returns one .csv file containing one/multiple pathway activity for each individual training samples, scatter plots of pathway activity for each individual pathway in all the training samples, and heatmap plots for the gene expression signatures for each individual pathways.
Ying Shen
assign.cv.output(processed.data=processed.data, mcmc.pos.mean.trainingData=mcmc.pos.mean, trainingData=trainingData1, trainingLabel=trainingLabel1, adaptive_B=FALSE, adaptive_S=FALSE, mixture_beta=TRUE, outputDir=tempdir)
assign.cv.output(processed.data=processed.data, mcmc.pos.mean.trainingData=mcmc.pos.mean, trainingData=trainingData1, trainingLabel=trainingLabel1, adaptive_B=FALSE, adaptive_S=FALSE, mixture_beta=TRUE, outputDir=tempdir)
The assign.mcmc function uses a Bayesian sparse factor analysis model to estimate the adaptive baseline/background, adaptive pathway signature, and pathway activation status of individual test (disease) samples.
assign.mcmc( Y, Bg, X, Delta_prior_p, iter = 2000, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, sigma_sZero = 0.01, sigma_sNonZero = 1, p_beta = 0.01, sigma_bZero = 0.01, sigma_bNonZero = 1, alpha_tau = 1, beta_tau = 0.01, Bg_zeroPrior = TRUE, S_zeroPrior = FALSE, ECM = FALSE, progress_bar = TRUE )
assign.mcmc( Y, Bg, X, Delta_prior_p, iter = 2000, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, sigma_sZero = 0.01, sigma_sNonZero = 1, p_beta = 0.01, sigma_bZero = 0.01, sigma_bNonZero = 1, alpha_tau = 1, beta_tau = 0.01, Bg_zeroPrior = TRUE, S_zeroPrior = FALSE, ECM = FALSE, progress_bar = TRUE )
Y |
The G x J matrix of genomic measures (i.g., gene expression) of test samples. Y is the testData_sub variable returned from the data.process function. Genes/probes present in at least one pathway signature are retained. |
Bg |
The G x 1 vector of genomic measures of the baseline/background (B). Bg is the B_vector variable returned from the data.process function. Bg is the starting value of baseline/background level in the MCMC chain. |
X |
The G x K matrix of genomic measures of the signature. X is the S_matrix variable returned from the data.process function. X is the starting value of pathway signatures in the MCMC chain. |
Delta_prior_p |
The G x K matrix of prior probability of a gene being "significant" in its associated pathway. Delta_prior_p is the Pi_matrix variable returned from the data.process function. |
iter |
The number of iterations in the MCMC. The default is 2000. |
adaptive_B |
Logicals. If TRUE, the model adapts the baseline/background (B) of genomic measures for the test samples. The default is TRUE. |
adaptive_S |
Logicals. If TRUE, the model adapts the signatures (S) of genomic measures for the test samples. The default is FALSE. |
mixture_beta |
Logicals. If TRUE, elements of the pathway activation matrix are modeled by a spike-and-slab mixture distribution. The default is TRUE. |
sigma_sZero |
Each element of the signature matrix (S) is modeled by a spike-and-slab mixture distribution. Sigma_sZero is the variance of the spike normal distribution. The default is 0.01. |
sigma_sNonZero |
Each element of the signature matrix (S) is modeled by a spike-and-slab mixture distribution. Sigma_sNonZero is the variance of the slab normal distribution. The default is 1. |
p_beta |
p_beta is the prior probability of a pathway being activated in individual test samples. The default is 0.01. |
sigma_bZero |
Each element of the pathway activation matrix (A) is modeled by a spike-and-slab mixture distribution. sigma_bZero is the variance of the spike normal distribution. The default is 0.01. |
sigma_bNonZero |
Each element of the pathway activation matrix (A) is modeled by a spike-and-slab mixture distribution. sigma_bNonZero is the variance of the slab normal distribution. The default is 1. |
alpha_tau |
The shape parameter of the precision (inverse of the variance) of a gene. The default is 1. |
beta_tau |
The rate parameter of the precision (inverse of the variance) of a gene. The default is 0.01. |
Bg_zeroPrior |
Logicals. If TRUE, the prior distribution of baseline/background level follows a normal distribution with mean zero. The default is TRUE. |
S_zeroPrior |
Logicals. If TRUE, the prior distribution of signature follows a normal distribution with mean zero. The default is TRUE. |
ECM |
Logicals. If TRUE, ECM algorithm, rather than Gibbs sampling, is applied to approximate the model parameters. The default is FALSE. |
progress_bar |
Display a progress bar for MCMC. Default is TRUE. |
The assign.mcmc function can be set as following major modes. The combination of logical values of adaptive_B, adaptive_S and mixture_beta can form different modes.
Mode A: adaptive_B = FALSE, adaptive_S = FALSE, mixture_beta = FALSE. This is a regression mode without adaptation of baseline/background, signature, and no shrinkage of the pathway activation level.
Mode B: adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = FALSE. This is a regression mode with adaptation of baseline/background, but without signature, and with no shrinkage of the pathway activation level.
Mode C: adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE. This is a regression mode with adaptation of baseline/background, but without signature, and with shrinkage of the pathway activation level when it is not significantly activated.
Mode D: adaptive_B = TRUE, adaptive_S = TRUE, mixture_beta = TRUE. This is a Bayesian factor analysis mode with adaptation of baseline/background, adaptation signature, and with shrinkage of the pathway activation level.
beta_mcmc |
The iter x K x J array of the pathway activation level estimated in every iteration of MCMC. |
tau2_mcmc |
The iter x G matrix of the precision of genes estimated in every iteration of MCMC |
gamma_mcmc |
The iter x K x J array of probability of pathway being activated estimated in every iteration of MCMC. |
kappa_mcmc |
The iter x K x J array of pathway activation level (adjusted beta scaling between 0 and 1) estimated in every iteration of MCMC.) |
S_mcmc |
The iter x G x K array of signature estimated in every iteration of MCMC. |
Delta_mcmc |
The iter x G x K array of binary indicator of a gene being significant estimated in every iteration of MCMC. |
Ying Shen
mcmc.chain <- assign.mcmc(Y=processed.data$testData_sub, Bg = processed.data$B_vector, X=processed.data$S_matrix, Delta_prior_p = processed.data$Pi_matrix, iter = 20, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta=TRUE)
mcmc.chain <- assign.mcmc(Y=processed.data$testData_sub, Bg = processed.data$B_vector, X=processed.data$S_matrix, Delta_prior_p = processed.data$Pi_matrix, iter = 20, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta=TRUE)
The assign.output function outputs the summary results and plots for prediction/validation for the test dataset.
assign.output( processed.data, mcmc.pos.mean.testData, trainingData, testData, trainingLabel, testLabel, geneList, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir )
assign.output( processed.data, mcmc.pos.mean.testData, trainingData, testData, trainingLabel, testLabel, geneList, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir )
processed.data |
The list object returned from the assign.preprocess function. |
mcmc.pos.mean.testData |
The list object returned from the assign.mcmc function. Notice that for prediction/validation in the test dataset, the Y argument in the assign.mcmc function should be set as the test dataset. |
trainingData |
The genomic measure matrix of training samples (i.g., gene expression matrix). The dimension of this matrix is probe number x sample number. |
testData |
The genomic measure matrix of test samples (i.g., gene expression matrix). The dimension of this matrix is probe number x sample number. |
trainingLabel |
The list linking the index of each training sample to a specific group it belongs to. |
testLabel |
The vector of the phenotypes/labels of the test samples. |
geneList |
The list that collects the signature genes of one/multiple pathways. Every component of this list contains the signature genes associated with one pathway. |
adaptive_B |
Logicals. If TRUE, the model adapts the baseline/background (B) of genomic measures for the test samples. The default is TRUE. |
adaptive_S |
Logicals. If TRUE, the model adapts the signatures (S) of genomic measures for the test samples. The default is FALSE. |
mixture_beta |
Logicals. If TRUE, elements of the pathway activation matrix are modeled by a spike-and-slab mixture distribution. The default is TRUE. |
outputDir |
The path to the directory to save the output files. The path needs to be quoted in double quotation marks. |
The assign.output function is suggested to run after the assign.preprocess, assign.mcmc and assign.summary functions. For the prediction/validation in the test dataset, The Y argument in the assign.mcmc function is the output value "testData_sub" from the assign.preprocess function.
The assign.output returns one .csv file containing one/multiple pathway activity for each individual test samples, scatter plots of pathway activity for each individual pathway in all the test samples, and heatmap plots for the gene expression of the prior signature and posterior signatures (if adaptive_S equals TRUE) of each individual pathway in the test samples.
Ying Shen
assign.output(processed.data = processed.data, mcmc.pos.mean.testData = mcmc.pos.mean, trainingData = trainingData1, testData = testData1, trainingLabel = trainingLabel1, testLabel = testLabel1, geneList = NULL, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir = tempdir)
assign.output(processed.data = processed.data, mcmc.pos.mean.testData = mcmc.pos.mean, trainingData = trainingData1, testData = testData1, trainingLabel = trainingLabel1, testLabel = testLabel1, geneList = NULL, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir = tempdir)
The assign.preprocess function is used to perform quality control on the user-provided input data and generate starting values and/or prior values for the model parameters. The assign.preprocess function is optional. For users who already have the correct format for the input of the assign function, they can skip this step and go directly to the assign.mcmc function.
assign.preprocess( trainingData = NULL, testData, anchorGenes = NULL, excludeGenes = NULL, trainingLabel, geneList = NULL, n_sigGene = NA, theta0 = 0.05, theta1 = 0.9, pctUp = 0.5, geneselect_iter = 500, geneselect_burn_in = 100, progress_bar = TRUE )
assign.preprocess( trainingData = NULL, testData, anchorGenes = NULL, excludeGenes = NULL, trainingLabel, geneList = NULL, n_sigGene = NA, theta0 = 0.05, theta1 = 0.9, pctUp = 0.5, geneselect_iter = 500, geneselect_burn_in = 100, progress_bar = TRUE )
trainingData |
The genomic measure matrix of training samples (i.g., gene expression matrix). The dimension of this matrix is probe number x sample number. The default is NULL. |
testData |
The genomic measure matrix of test samples (i.g., gene expression matrix). The dimension of this matrix is probe number x sample number. |
anchorGenes |
A list of genes that will be included in the signature even if they are not chosen during gene selection. |
excludeGenes |
A list of genes that will be excluded from the signature even if they are chosen during gene selection. |
trainingLabel |
The list linking the index of each training sample to a specific group it belongs to. See details and examples for more information. |
geneList |
The list that collects the signature genes of one/multiple pathways. Every component of this list contains the signature genes associated with one pathway. The default is NULL. |
n_sigGene |
The vector of the signature genes to be identified for one pathway. n_sigGene needs to be specified when geneList is set NULL. The default is NA. See examples for more information. |
theta0 |
The prior probability for a gene to be significant, given that the gene is NOT defined as "significant" in the signature gene lists provided by the user. The default is 0.05. |
theta1 |
The prior probability for a gene to be significant, given that the gene is defined as "significant" in the signature gene lists provided by the user. The default is 0.9. |
pctUp |
By default, ASSIGN bayesian gene selection chooses the signature genes with an equal fraction of genes that increase with pathway activity and genes that decrease with pathway activity. Use the pctUp parameter to modify this fraction. Set pctUP to NULL to select the most significant genes, regardless of direction. The default is 0.5 |
geneselect_iter |
The number of iterations for bayesian gene selection. The default is 500. |
geneselect_burn_in |
The number of burn-in iterations for bayesian gene selection. The default is 100 |
progress_bar |
Display a progress bar for gene selection. Default is TRUE. |
The assign.preprocess is applied to perform quality control on the user-provided genomic data and meta data, re-format the data in a way that can be used in the following analysis, and generate starting/prior values for the pathway signature matrix. The output values of the assign.preprocess function will be used as input values for the assign.mcmc function.
For training data with 1 control group and 3 experimental groups (10 samples/group; all 3 experimental groups share 1 control group), the trainingLabel can be specified as: trainingLabel <- list(control = list(expr1=1:10, expr2=1:10, expr3=1:10), expr1 = 11:20, expr2 = 21:30, expr3 = 31:40)
For training data with 3 control groups and 3 experimental groups (10 samples/group; Each experimental group has its corresponding control group), the trainingLabel can be specified as: trainingLabel <- list(control = list(expr1=1:10, expr2=21:30, expr3=41:50), expr1 = 11:20, expr2 = 31:40, expr3 = 51:60)
It is highly recommended that the user use the same experiment name when specifying control indices and experimental indices.
trainingData_sub |
The G x N matrix of G genomic measures (i.g., gene expression) of N training samples. Genes/probes present in at least one pathway signature are retained. Only returned when the training dataset is available. |
testData_sub |
The G x N matrix of G genomic measures (i.g., gene expression) of N test samples. Genes/probes present in at least one pathway signature are retained. |
B_vector |
The G x 1 vector of genomic measures of the baseline/background. Each element of the B_vector is calculated as the mean of the genomic measures of the control samples in training data. |
S_matrix |
The G x K matrix of genomic measures of the signature. Each column of the S_matrix represents a pathway. Each element of the S_matrix is calculated as the mean of genomic measures of the experimental samples minus the mean of the control samples in the training data. |
Delta_matrix |
The G x K matrix of binary indicators. Each column of the Delta_matrix represents a pathway. The elements in Delta_matrix are binary (0, insignificant gene; 1, significant gene). |
Pi_matrix |
The G x K matrix of probability p of a Bernoulli distribution. Each column of the Pi_matrix represents a pathway. Each element in the Pi_matrix is the probability of a gene to be significant in its associated pathway. |
diffGeneList |
The list that collects the signature genes of one/multiple pathways generated from the training samples or from the user provided gene list. Every component of this list contains the signature genes associated with one pathway. |
Ying Shen
processed.data <- assign.preprocess(trainingData=trainingData1, testData=testData1, trainingLabel=trainingLabel1, geneList=geneList1)
processed.data <- assign.preprocess(trainingData=trainingData1, testData=testData1, trainingLabel=trainingLabel1, geneList=geneList1)
The assign.summary function computes the posterior mean of the model parameters estimated in every iteration during the Gibbs sampling.
assign.summary( test, burn_in = 1000, iter = 2000, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE )
assign.summary( test, burn_in = 1000, iter = 2000, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE )
test |
The list object returned from the assign.mcmc function. The list components are the MCMC chains of the B, S, Delta, beta, gamma, and sigma. |
burn_in |
The number of burn-in iterations. These iterations are discarded when computing the posterior means of the model parameters. The default is 1000. |
iter |
The number of total iterations. The default is 2000. |
adaptive_B |
Logicals. If TRUE, the model adapts the baseline/background (B) of genomic measures for the test samples. The default is TRUE. |
adaptive_S |
Logicals. If TRUE, the model adapts the signatures (S) of genomic measures for the test samples. The default is FALSE. |
mixture_beta |
Logicals. If TRUE, elements of the pathway activation matrix are modeled by a spike-and-slab mixture distribution. The default is TRUE. |
The assign.summary function is suggested to run after the assign.convergence function, which is used to check the convergence of the MCMC chain. If the MCMC chain does not converge to a stationary phase, more iterations are required in the assign.mcmc function. The number of burn-in iterations is usually set to be half of the number of total iterations, meaning that the first half of the MCMC chain is discarded when computing the posterior means.
beta_pos |
The N x K matrix of the posterior mean of the pathway activation level in test samples (transposed matrix A). Columns:K pathways; rows: N test samples |
sigma_pos |
The G x 1 vector of the posterior mean of the variance of gene. |
kappa_pos |
The N x K matrix of posterior mean of pathway activation level in test samples (transposed matrix A) (adjusted beta_pos scaling between 0 and 1). Columns:K pathways; rows: N test samples |
gamma_pos |
The N x K matrix of the posterior probability of pathways being activated in test samples. |
S_pos |
The G x K matrix of the posterior mean of pathway signature genes. |
Delta_pos |
The G x K matrix of the posterior probability of genes being significant in the associated pathways. |
Ying Shen
data(trainingData1) data(testData1) data(geneList1) trainingLabel1 <- list(control = list(bcat=1:10, e2f3=1:10, myc=1:10, ras=1:10, src=1:10), bcat = 11:19, e2f3 = 20:28, myc= 29:38, ras = 39:48, src = 49:55) processed.data <- assign.preprocess(trainingData=trainingData1, testData=testData1, trainingLabel=trainingLabel1, geneList=geneList1) mcmc.chain <- assign.mcmc(Y=processed.data$testData_sub, Bg = processed.data$B_vector, X=processed.data$S_matrix, Delta_prior_p = processed.data$Pi_matrix, iter = 20, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta=TRUE) mcmc.pos.mean <- assign.summary(test=mcmc.chain, burn_in=10, iter=20, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta = TRUE)
data(trainingData1) data(testData1) data(geneList1) trainingLabel1 <- list(control = list(bcat=1:10, e2f3=1:10, myc=1:10, ras=1:10, src=1:10), bcat = 11:19, e2f3 = 20:28, myc= 29:38, ras = 39:48, src = 49:55) processed.data <- assign.preprocess(trainingData=trainingData1, testData=testData1, trainingLabel=trainingLabel1, geneList=geneList1) mcmc.chain <- assign.mcmc(Y=processed.data$testData_sub, Bg = processed.data$B_vector, X=processed.data$S_matrix, Delta_prior_p = processed.data$Pi_matrix, iter = 20, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta=TRUE) mcmc.pos.mean <- assign.summary(test=mcmc.chain, burn_in=10, iter=20, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta = TRUE)
The assign.wrapper function integrates the assign.preprocess, assign.mcmc, assign.summary, assign.output, assign.cv.output functions into one wrapper function.
assign.wrapper( trainingData = NULL, testData, trainingLabel, testLabel = NULL, geneList = NULL, anchorGenes = NULL, excludeGenes = NULL, n_sigGene = NA, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir, p_beta = 0.01, theta0 = 0.05, theta1 = 0.9, iter = 2000, burn_in = 1000, sigma_sZero = 0.01, sigma_sNonZero = 1, S_zeroPrior = FALSE, pctUp = 0.5, geneselect_iter = 500, geneselect_burn_in = 100, outputSignature_convergence = FALSE, ECM = FALSE, progress_bar = TRUE, override_S_matrix = NULL )
assign.wrapper( trainingData = NULL, testData, trainingLabel, testLabel = NULL, geneList = NULL, anchorGenes = NULL, excludeGenes = NULL, n_sigGene = NA, adaptive_B = TRUE, adaptive_S = FALSE, mixture_beta = TRUE, outputDir, p_beta = 0.01, theta0 = 0.05, theta1 = 0.9, iter = 2000, burn_in = 1000, sigma_sZero = 0.01, sigma_sNonZero = 1, S_zeroPrior = FALSE, pctUp = 0.5, geneselect_iter = 500, geneselect_burn_in = 100, outputSignature_convergence = FALSE, ECM = FALSE, progress_bar = TRUE, override_S_matrix = NULL )
trainingData |
The genomic measure matrix of training samples (i.g., gene expression matrix). The dimension of this matrix is probe number x sample number. The default is NULL. |
testData |
The genomic measure matrix of test samples (i.g., gene expression matrix). The dimension of this matrix is probe number x sample number. |
trainingLabel |
The list linking the index of each training sample to a specific group it belongs to. See examples for more information. |
testLabel |
The vector of the phenotypes/labels of the test samples. The default is NULL. |
geneList |
The list that collects the signature genes of one/multiple pathways. Every component of this list contains the signature genes associated with one pathway. The default is NULL. |
anchorGenes |
A list of genes that will be included in the signature even if they are not chosen during gene selection. |
excludeGenes |
A list of genes that will be excluded from the signature even if they are chosen during gene selection. |
n_sigGene |
The vector of the signature genes to be identified for one pathway. n_sigGene needs to be specified when geneList is set NULL. The default is NA. See examples for more information. |
adaptive_B |
Logicals. If TRUE, the model adapts the baseline/background (B) of genomic measures for the test samples. The default is TRUE. |
adaptive_S |
Logicals. If TRUE, the model adapts the signatures (S) of genomic measures for the test samples. The default is FALSE. |
mixture_beta |
Logicals. If TRUE, elements of the pathway activation matrix are modeled by a spike-and-slab mixture distribution. The default is TRUE. |
outputDir |
The path to the directory to save the output files. The path needs to be quoted in double quotation marks. |
p_beta |
p_beta is the prior probability of a pathway being activated in individual test samples. The default is 0.01. |
theta0 |
The prior probability for a gene to be significant, given that the gene is NOT defined as "significant" in the signature gene lists provided by the user. The default is 0.05. |
theta1 |
The prior probability for a gene to be significant, given that the gene is defined as "significant" in the signature gene lists provided by the user. The default is 0.9. |
iter |
The number of iterations in the MCMC. The default is 2000. |
burn_in |
The number of burn-in iterations. These iterations are discarded when computing the posterior means of the model parameters. The default is 1000. |
sigma_sZero |
Each element of the signature matrix (S) is modeled by a spike-and-slab mixture distribution. Sigma_sZero is the variance of the spike normal distribution. The default is 0.01. |
sigma_sNonZero |
Each element of the signature matrix (S) is modeled by a spike-and-slab mixture distribution. Sigma_sNonZero is the variance of the slab normal distribution. The default is 1. |
S_zeroPrior |
Logicals. If TRUE, the prior distribution of signature follows a normal distribution with mean zero. The default is TRUE. |
pctUp |
By default, ASSIGN bayesian gene selection chooses the signature genes with an equal fraction of genes that increase with pathway activity and genes that decrease with pathway activity. Use the pctUp parameter to modify this fraction. Set pctUP to NULL to select the most significant genes, regardless of direction. The default is 0.5 |
geneselect_iter |
The number of iterations for bayesian gene selection. The default is 500. |
geneselect_burn_in |
The number of burn-in iterations for bayesian gene selection. The default is 100 |
outputSignature_convergence |
Create a pdf of the MCMC chain. The default is FALSE. |
ECM |
Logicals. If TRUE, ECM algorithm, rather than Gibbs sampling, is applied to approximate the model parameters. The default is FALSE. |
progress_bar |
Display a progress bar for MCMC and gene selection. Default is TRUE. |
override_S_matrix |
Replace the S_matrix created by assign.preprocess with the matrix provided in override_S_matrix. This can be used to indicate the expected directions of genes in a signature if training data is not provided. |
The assign.wrapper function is an all-in-one function which outputs the necessary results for basic users. For users who need more intermediate results for model diagnosis, it is better to run the assign.preprocess, assign.mcmc, assign.convergence, assign.summary functions separately and extract the output values from the returned list objects of those functions.
The assign.wrapper returns one/multiple pathway activity for each individual training sample and test sample, scatter plots of pathway activity for each individual pathway in the training and test data, heatmap plots for gene expression signatures for each individual pathway, heatmap plots for the gene expression of the prior and posterior signatures (if adaptive_S equals TRUE) of each individual pathway in the test data
Ying Shen and W. Evan Johnson
data(trainingData1) data(testData1) data(geneList1) trainingLabel1 <- list(control = list(bcat=1:10, e2f3=1:10, myc=1:10, ras=1:10, src=1:10), bcat = 11:19, e2f3 = 20:28, myc= 29:38, ras = 39:48, src = 49:55) testLabel1 <- rep(c("subtypeA","subtypeB"), c(53,58)) assign.wrapper(trainingData=trainingData1, testData=testData1, trainingLabel=trainingLabel1, testLabel=testLabel1, geneList=geneList1, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta=TRUE, outputDir=tempdir, p_beta=0.01, theta0=0.05, theta1=0.9, iter=20, burn_in=10)
data(trainingData1) data(testData1) data(geneList1) trainingLabel1 <- list(control = list(bcat=1:10, e2f3=1:10, myc=1:10, ras=1:10, src=1:10), bcat = 11:19, e2f3 = 20:28, myc= 29:38, ras = 39:48, src = 49:55) testLabel1 <- rep(c("subtypeA","subtypeB"), c(53,58)) assign.wrapper(trainingData=trainingData1, testData=testData1, trainingLabel=trainingLabel1, testLabel=testLabel1, geneList=geneList1, adaptive_B=TRUE, adaptive_S=FALSE, mixture_beta=TRUE, outputDir=tempdir, p_beta=0.01, theta0=0.05, theta1=0.9, iter=20, burn_in=10)
The first ComBat step (on the signatures only) has already been performed. This step performs batch correction on the test data, using reference batch ComBat, to prepare the test data for ASSIGN analysis.
ComBat.step2( testData, pcaPlots = FALSE, combat_train = NULL, plots_to_console = FALSE )
ComBat.step2( testData, pcaPlots = FALSE, combat_train = NULL, plots_to_console = FALSE )
testData |
The input test data to batch correct |
pcaPlots |
a logical value indicating whether or not the function should create PCA plots. The default is FALSE. |
combat_train |
the ComBat training data data frame. If you do not have this, the function will attempt to download it from the internet. Please contact the developers if you have any issues with access to the file. |
plots_to_console |
By default this function will write PDF versions of the plots. Set this to TRUE to send the plots to the command line. The default is FALSE. |
This function downloads the training data from the internet, so an internet connection is necessary
A list of data.frames is returned, including control (GFP) and signature data, as well as the batch corrected test data. This data can go directly into the runassign.single and runassign.multi functions, or subsetted to go directly into ASSIGN.
Overexpression signatures may contain genes that are consistently differentially expressed. This list was compiled based on the GFRN gene list. These genes appear in at least 60
character vector of commonly differentially expressed genes
Bild et al.
Gather the ASSIGN results in a specific directory
gather_assign_results(path = ".")
gather_assign_results(path = ".")
path |
The path to the directory containing ASSIGN results. The default is the current working directory. |
A data frame of ASSIGN predictions from each run in the directory
Signature genes for 5 oncogenic pathways.
List with 5 components representing each pathway. 200 signature genes are selected for each pathway.
Bild et al. (2006) Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 439, 353-357.
Pathway signature gene lists have been optimized based on correlations of pathway activity data and protein data. The gene lists can be used to avoid the bayesian gene selection step of ASSIGN, which will decrease the amount of time it takes to run ASSIGN.
List of gene lists for akt, bad, egfr, her2, igf1r, krasgv, krasqh, and raf
Bild et al.
Combine two data frames
merge_drop(x, y, by = 0, ...)
merge_drop(x, y, by = 0, ...)
x |
The first data frame to be coerced to one. |
y |
The second data frame to be coerced to one. |
by |
specifications of the columns used for merging. The default is by row names |
... |
arguments to be passed to or from methods. |
The returned data frame is the combination of x and y, with the rownames properly assigned.
## Not run: merged.df <- merge_drop(df1,df2) ## End(Not run)
## Not run: merged.df <- merge_drop(df1,df2) ## End(Not run)
This function runs ASSIGN pathway prediction on gene list lengths from 5 to 500 to find the optimum gene list length for the GFRN pathways by correlating the ASSIGN predictions to a matrix of correlation data that you provide. This function takes a long time to run because you are running ASSIGN many times on many pathways, so I recommend parallelizing by pathway or running the ASSIGN predictions first (long and parallelizable) and then running the correlation step (quick) separately.
optimizeGFRN( indata, correlation, correlationList, run = c("akt", "bad", "egfr", "her2", "igf1r", "krasgv", "krasqh", "raf"), run_ASSIGN_only = FALSE, correlation_only = FALSE, keep_optimized_only = FALSE, pathway_lengths = c(seq(5, 20, 5), seq(25, 275, 25), seq(300, 500, 50)), iter = 1e+05, burn_in = 50000 )
optimizeGFRN( indata, correlation, correlationList, run = c("akt", "bad", "egfr", "her2", "igf1r", "krasgv", "krasqh", "raf"), run_ASSIGN_only = FALSE, correlation_only = FALSE, keep_optimized_only = FALSE, pathway_lengths = c(seq(5, 20, 5), seq(25, 275, 25), seq(300, 500, 50)), iter = 1e+05, burn_in = 50000 )
indata |
The list of data frames from ComBat.step2 |
correlation |
A matrix of data to correlate ASSIGN predictions to. The number of rows should be the same and in the same order as indata |
correlationList |
A list that shows which columns of correlation should be used for each pathway. See below for more details |
run |
specifies the pathways to predict. The default list will cause all eight pathways to be run in serial. Specify a pathway ("akt", "bad", "egfr", etc.) or list of pathways to run those pathways only. |
run_ASSIGN_only |
a logical value indicating if you want to run the ASSIGN predictions only. Use this to parallelize ASSIGN runs across a compute cluster or across compute threads |
correlation_only |
a logical value indicating if you want to run the correlation step only. The function will find the ASSIGN runs in the cwd and optimize them based on the correlation data matrix. |
keep_optimized_only |
a logical value indicating if you want to keep all of the ASSIGN run results, or only the runs that provided the optimum ASSIGN correlations. This will delete all directories in the current working directory that match the pattern "_gene_list". The default is FALSE |
pathway_lengths |
The gene list lengths that should be run. The default is the 20 pathway lengths that were used in the paper, but this list can be customized to which pathway lengths you are willing to accept |
iter |
The number of iterations in the MCMC. |
burn_in |
The number of burn-in iterations. These iterations are discarded when computing the posterior means of the model parameters. |
ASSIGN runs are output to the current workingdirectory. This function returns the correlation data and the optimized gene lists that you can use with runassignGFRN to try these lists on other data.
## Not run: testData <- read.table(paste0("https://drive.google.com/uc?authuser=0", "&id=1mJICN4z_aCeh4JuPzNfm8GR_lkJOhWFr", "&export=download"), sep='\t', row.names=1, header=1) corData <- read.table(paste0("https://drive.google.com/uc?authuser=0", "&id=1MDWVP2jBsAAcMNcNFKE74vYl-orpo7WH", "&export=download"), sep='\t', row.names=1, header=1) corData$negAkt <- -1 * corData$Akt corData$negPDK1 <- -1 * corData$PDK1 corData$negPDK1p241 <- -1 * corData$PDK1p241 corList <- list(akt=c("Akt","PDK1","PDK1p241"), bad=c("negAkt","negPDK1","negPDK1p241"), egfr=c("EGFR","EGFRp1068"), her2=c("HER2","HER2p1248"), igf1r=c("IGFR1","PDK1","PDK1p241"), krasgv=c("EGFR","EGFRp1068"), krasqh=c("EGFR","EGFRp1068"), raf=c("MEK1","PKCalphap657","PKCalpha")) combat.data <- ComBat.step2(testData, pcaPlots = TRUE) optimization_results <- optimizeGFRN(combat.data, corData, corList) ## End(Not run)
## Not run: testData <- read.table(paste0("https://drive.google.com/uc?authuser=0", "&id=1mJICN4z_aCeh4JuPzNfm8GR_lkJOhWFr", "&export=download"), sep='\t', row.names=1, header=1) corData <- read.table(paste0("https://drive.google.com/uc?authuser=0", "&id=1MDWVP2jBsAAcMNcNFKE74vYl-orpo7WH", "&export=download"), sep='\t', row.names=1, header=1) corData$negAkt <- -1 * corData$Akt corData$negPDK1 <- -1 * corData$PDK1 corData$negPDK1p241 <- -1 * corData$PDK1p241 corList <- list(akt=c("Akt","PDK1","PDK1p241"), bad=c("negAkt","negPDK1","negPDK1p241"), egfr=c("EGFR","EGFRp1068"), her2=c("HER2","HER2p1248"), igf1r=c("IGFR1","PDK1","PDK1p241"), krasgv=c("EGFR","EGFRp1068"), krasqh=c("EGFR","EGFRp1068"), raf=c("MEK1","PKCalphap657","PKCalpha")) combat.data <- ComBat.step2(testData, pcaPlots = TRUE) optimization_results <- optimizeGFRN(combat.data, corData, corList) ## End(Not run)
Display a PCA Plot of the Data
pcaplot(mat, sub, center = TRUE, scale = TRUE, plottitle = "PCA")
pcaplot(mat, sub, center = TRUE, scale = TRUE, plottitle = "PCA")
mat |
The data frame on which to perform pca. |
sub |
The number of samples in this batch, from left to right in the data frame |
center |
a logical value indicating whether the variables should be shifted to be zero centered. The default is TRUE |
scale |
a logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place. The default is TRUE |
plottitle |
The title to display above your PCA plot. The default is "PCA". |
A PCA plot is displayed
This function runs eight ASSIGN runs based on the pathway optimizations from the paper. You can run all eight pathways in serial, or call this function and specify the run parameter to run a specific pathway. Some ASSIGN parameters can be customized using this function. The default values were used in the analysis for the paper.
runassignGFRN( indata, run = c("akt", "bad", "egfr", "her2", "igf1r", "krasgv", "raf"), optimized_geneList = NULL, use_seed = 1234, sigma_sZero = 0.05, sigma_sNonZero = 0.5, S_zeroPrior = FALSE, iter = 1e+05, burn_in = 50000, exclude_common_genes = FALSE, adaptive_S = TRUE, ECM = FALSE )
runassignGFRN( indata, run = c("akt", "bad", "egfr", "her2", "igf1r", "krasgv", "raf"), optimized_geneList = NULL, use_seed = 1234, sigma_sZero = 0.05, sigma_sNonZero = 0.5, S_zeroPrior = FALSE, iter = 1e+05, burn_in = 50000, exclude_common_genes = FALSE, adaptive_S = TRUE, ECM = FALSE )
indata |
The list of data frames from ComBat.step2 |
run |
specifies the pathways to predict. The default list will cause all eight pathways to be run in serial. Specify a pathway ("akt", "bad", "egfr", etc.) or list of pathways to run those pathways only. |
optimized_geneList |
a list of custom optimized gene lists for the gfrn pathways either created manually or output by optimizeGFRN |
use_seed |
Set the seed before running ASSIGN. This will make the result consistent between runs. The default is 1234. Set use_seed as FALSE to not set a seed. |
sigma_sZero |
Each element of the signature matrix (S) is modeled by a spike-and-slab mixture distribution. Sigma_sZero is the variance of the spike normal distribution. The default is 0.05. |
sigma_sNonZero |
Each element of the signature matrix (S) is modeled by a spike-and-slab mixture distribution. Sigma_sNonZero is the variance of the slab normal distribution. The default is 0.5. |
S_zeroPrior |
Logicals. If TRUE, the prior distribution of signature follows a normal distribution with mean zero. The default is FALSE. |
iter |
The number of iterations in the MCMC. The default is 100000. |
burn_in |
The number of burn-in iterations. These iterations are discarded when computing the posterior means of the model parameters. The default is 50000. |
exclude_common_genes |
Remove commonly differentially expressed genes for overexpression signatures. The default is FALSE. |
adaptive_S |
Logical. If TRUE, the model adapts the signatures (S) of genomic measures for the test samples. The default for GFRN analysis is TRUE. |
ECM |
Logicals. If TRUE, ECM algorithm, rather than Gibbs sampling, is applied to approximate the model parameters. The default is FALSE. |
Data is output to the current working directory in a results directory.
## Not run: testData <- read.table(paste0("https://drive.google.com/uc?authuser=0&", "id=1mJICN4z_aCeh4JuPzNfm8GR_lkJOhWFr", "&export=download"), sep='\t', row.names=1, header=1) combat.data <- ComBat.step2(testData, pcaPlots = TRUE) runassignGFRN(combat.data) ## End(Not run)
## Not run: testData <- read.table(paste0("https://drive.google.com/uc?authuser=0&", "id=1mJICN4z_aCeh4JuPzNfm8GR_lkJOhWFr", "&export=download"), sep='\t', row.names=1, header=1) combat.data <- ComBat.step2(testData, pcaPlots = TRUE) runassignGFRN(combat.data) ## End(Not run)
Gene expression datasets for 111 lung cancer patient samples, including 53 cases of lung adenocarcinoma and 58 cases of lung squamous carcinoma.
Data frame with 1000 genes/probes (rows) and 111 samples (columns)
Bild et al. (2006) Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 439, 353-357.
Gene expression datasets for 5 oncogenic pathway perturbation experiments, including B-Catenin, E2F3, MYC, RAS, and SRC pathways.
Data frame with 1000 genes/probes (rows) and 55 samples (columns)
Bild et al. (2006) Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 439, 353-357.