This vignette provides an overview of the Bioconductor package ASSIGN (Adaptive Signature Selection and InteGratioN) for signature-based profiling of heterogeneous biological pathways. ASSIGN is a computational tool used to evaluate the pathway deregulation/activation status in individual patient samples. ASSIGN employs a flexible Bayesian factor analysis approach that adapts predetermined pathway signatures derived either from a literature search or from perturbation experiments to create cell-/tissue-specific pathway signatures. The deregulation/activation level of each context-specific pathway is quantified to a score, which represents the extent to which a patient sample matches thepathway deregulation/activation signature.
Some distinctive features of ASSIGN are:
In the following examples, we will illustrate how to run ASSIGN
using either the easy to use assign.wrapper
function for
simple analysis or each individual ASSIGN
step for more detailed intermediate results.
For either analysis, we will first load ASSIGN and
create a temporary directory 'tempdir'
under the user’s
current working directory. All output generated in this vignette will be
saved in 'tempdir'
.
Next, load the training data, test data, training labels, and test
labels. The training dataset is a G (number of genomic measurements) x N
(number of samples in pathway perturbation experiments) matrix,
including five oncogenic pathways: B-Catenin, E2F3, MYC, RAS, and SRC
pathways in this example. The training data labels denote the column
indices of control and experimental samples for each perturbation
experiment. For example, we specify the column indices of the 10 RAS
control samples to be 1:10, and column indices of 10 RAS activated
samples to be 39:48. The test dataset is a G (number of genomic
measurements) x N (number of patient samples) matrix. The test data
labels denote the classes of the N test samples. In our example, test
samples 1-53 are adenocarcinoma and samples 54-111 are squamous cell
carcinoma. We specify 'Adeno'
and 'Squamous'
in the vector of test data labels. Note that the test data labels are
optional. ASSIGN
outputs additional validation plots to evaluate classification accuracy
when test data labels are provided.
assign.wrapper
We developed an all-in-one assign.wrapper
function to
run ASSIGN
with one command. For most users, assign.wrapper
will be
sufficient. The assign.wrapper
function outputs the
following files:
If training data is provided, assign.wrapper
also
outputs the following files:
When Adaptive_S
is TRUE, assign.wrapper
also outputs the following files:
Finally, if the testLabel
argument is not NULL,
assign.wrapper
also outputs the following files:
Here we illustrate how to run assign.wrapper
function
with three examples. To start, create a temporary directory
'tempdir'
and load training and test datasets. The
individual parameters are described in detail in the sections below and
the ASSIGN
reference manual.
dir.create(file.path(tempdir,"wrapper_example1"))
assign.wrapper(trainingData=trainingData1, testData=testData1,
trainingLabel=trainingLabel1, testLabel=testLabel1,
geneList=NULL, n_sigGene=rep(200,5), adaptive_B=TRUE,
adaptive_S=FALSE, mixture_beta=TRUE,
outputDir=file.path(tempdir,"wrapper_example1"),
iter=2000, burn_in=1000)
dir.create(file.path(tempdir,"wrapper_example2"))
assign.wrapper(trainingData=trainingData1, testData=testData1,
trainingLabel=trainingLabel1, testLabel=NULL,
geneList=geneList1, n_sigGene=NULL, adaptive_B=TRUE,
adaptive_S=FALSE, mixture_beta=TRUE,
outputDir=file.path(tempdir,"wrapper_example2"),
iter=2000, burn_in=1000)
dir.create(file.path(tempdir,"wrapper_example3"))
assign.wrapper(trainingData=NULL, testData=testData1,
trainingLabel=NULL, testLabel=NULL,
geneList=geneList1, n_sigGene=NULL, adaptive_B=TRUE,
adaptive_S=TRUE, mixture_beta=TRUE,
outputDir=file.path(tempdir,"wrapper_example3"),
iter=2000, burn_in=1000)
We developed a series of functions: assign.preprocess
,
assign.mcmc
, assign.convergence
,
assign.summary
, assign.cv.output
, and
assign.output
that work in concert to produce detailed
results.
assign.preprocess
We first run the assign.preprocess
function on the input
datasets. When the genomic measurements (e.g., gene expression profiles)
of training samples are provided, but predetermined pathway signature
gene lists are not provided, the assign.preprocess
function
utilizes a Bayesian univariate regression module to select a gene set
(usually 50-200 genes, but this can be specified by the user) based on
the absolute value of the regression coefficient (fold change) and the
posterior probability of the variable to be selected (statistical
significance). Since we have no predetermined gene lists to provide, we
leave the geneList
option as default NULL. Here we specify
200 signature genes for each of the five pathways.
# training dataset is available;
# the gene list of pathway signature is NOT available
processed.data <- assign.preprocess(trainingData=trainingData1,
testData=testData1,
trainingLabel=trainingLabel1,
geneList=NULL, n_sigGene=rep(200,5))
Alternatively, the users can have both the training data and the
curated/predetermined pathway signatures. Some genes in the curated
pathway signatures, although not significantly differentially expressed,
need to be included for the purpose of prediction. In this case, we
specify the trainingData
and geneList
parameters when both the training dataset and predetermined signature
gene list are available.
# training dataset is available;
# the gene list of pathway signature is available
processed.data <- assign.preprocess(trainingData=trainingData1,
testData=testData1,
trainingLabel=trainingLabel1,
geneList=geneList1)
In some cases, the expression profiles (training dataset) is
unavailable. Only the knowledge-based gene list or gene list from the
joint knowledge of some prior profiling experiments is available. In
this case, we specify geneList
and leave the
trainingData
and trainingLabel
as default
NULL.
# training dataset is NOT available;
# the gene list of pathway signature is available
processed.data <- assign.preprocess(trainingData=NULL,
testData=testData1,
trainingLabel=NULL,
geneList=geneList1)
The assign.preprocess
function returns the processed
training dataset (trainingData_sub
) and test dataset
(testData_sub
) as well as the prior parameters for the
background vector (B_vector
), signature matrix
(S_matrix
) and the probability signature matrix
(Pi_matrix
) and differentially expressed gene lists of each
pathway (diffGeneList
). The details of the
assign.preprocess
output are described in the
'value'
section of the manual page of
assign.preprocess
function. The output data of
assign.preprocess
function are used as the input data of
the assign.mcmc
function.
assign.mcmc
For the assign.mcmc
function, Y, Bg, and X are specified
as the output of the assign.preprocess
function. The
adaptive_B
(adaptive background), Adaptive_S
(adaptive signature) and mixture_beta
(regularization of
signature strength) can be specified TRUE or FALSE based on the analysis
context. When training and test samples are from the different cell or
tissue types, we recommend the adaptive background option to be TRUE.
Notice that when the training dataset is not available, the adaptive
signature option must be set TRUE, meaning that the magnitude of the
signature should be estimated from the test dataset. The default
iter
(iteration) is 2000. Particularly, when training
datasets are unavailable, it is better to specify the X
option in the assign.mcmc
using a more informative X
(specify up- or down- regulated genes) to initiate the model.
mcmc.chain <- assign.mcmc(Y=processed.data$testData_sub,
Bg = processed.data$B_vector,
X=processed.data$S_matrix,
Delta_prior_p = processed.data$Pi_matrix,
iter = 2000, adaptive_B=TRUE,
adaptive_S=FALSE, mixture_beta=TRUE)
The assign.mcmc
function returns the MCMC chain
recording default 2000 iterations for each parameter. The details of
assign.mcmc
output are described in the
'value'
section of the manual page of
assign.mcmc
function.
assign.convergence
We can make a trace plot to check the convergence of the model
parameters through assign.convergence
. The
burn_in
default is 0, so that the trace plot starts from
the first iteration. Additional iterations can be specified if the MCMC
chain does not converge in 2000 iterations.
trace.plot <- assign.convergence(test=mcmc.chain, burn_in=0, iter=2000,
parameter="B", whichGene=1,
whichSample=NA, whichPath=NA)
The assign.convergence
function returns a vector of the
estimated values from each Gibbs sampling iteration of the model
parameter to be checked and a trace plot of the parameter.
assign.summary
We then apply the assign.summary
function to compute the
posterior mean of each parameter. Typically we use the second half of
the MCMC chain to compute the posterior mean. We specify the default
burn-in period to be the first 1000 iteration and the default total
iteration to be 2000. The 1000 burn-in iterations are discarded when we
compute the posterior mean. The adaptive_B
,
Adaptive_S
and mixture_beta
options have to
set the same as those in the assign.mcmc
function.
mcmc.pos.mean <- assign.summary(test=mcmc.chain, burn_in=1000,
iter=2000, adaptive_B=TRUE,
adaptive_S=FALSE, mixture_beta=TRUE)
The assign.summary
function returns the posterior mean
of each parameter. The details of the assign.summary
output
are described in the 'value'
section of the manual page of
assign.summary
function.
assign.cv.output
The assign.cv.output
generates the cross-validation
results in the training samples. Output files from
assign.cv.output
are:
# For cross-validation, Y in the assign.mcmc function
# should be specified as processed.data$trainingData_sub.
assign.cv.output(processed.data=processed.data,
mcmc.pos.mean.trainingData=mcmc.pos.mean,
trainingData=trainingData1,
trainingLabel=trainingLabel1, adaptive_B=FALSE,
adaptive_S=FALSE, mixture_beta=TRUE,
outputDir=tempdir)
assign.output
The assign.output
generates the prediction results in
the test samples. Output files from assign.output
are:
Adaptive_S
is
specified TRUE.testLabel
argument is not NULL.The user needs to specify the output directory in the
outputDir
option, when running
assign.cv.output
and assign.output
.
The ASSIGN package allows a signature to be adapted to fit other biological contexts, reducing the contribution of specific genes in the signature to better match heterogeneity observed in the test dataset. Occasionally, adapting a signature may reduce the importance of key signature genes. For example, if a signature is created by overexpressing an oncogenic gene in a cell line, but during the adaptation step, ASSIGN reduces the importance of that key gene, the quality of the ASSIGN predictions may be reduced. Alternatively, if a gene in the signature is associated with some other heterogeneity in the data, such as smoking status, ASSIGN may adapt to differences in that gene, rather than the actual desired signature activity predictions. To this end, we have added the ability to provide a list of key genes to anchor in the signature, and genes to exclude from the signature. ASSIGN accomplishes this by setting the probability of inclusion into the signature to one for anchor genes, and zero for exclude genes. The change in expression values can still adapt, increasing or reducing the fold change associated with each gene in the signature, but the anchor genes will always contribute to the final signature, and the exclude genes will not.
dir.create(file.path(tempdir, "anchor_exclude_example"))
anchorList = list(bcat="224321_at",
e2f3="202589_at",
myc="221891_x_at",
ras="201820_at",
src="224567_x_at")
excludeList = list(bcat="1555340_x_at",
e2f3="1555340_x_at",
myc="1555340_x_at",
ras="204748_at",
src="1555339_at")
assign.wrapper(trainingData=trainingData1, testData=testData1,
trainingLabel=trainingLabel1, testLabel=NULL,
geneList=geneList1, n_sigGene=NULL, adaptive_B=TRUE,
adaptive_S=TRUE, mixture_beta=TRUE,
outputDir=file.path(tempdir, "anchor_exclude_example"),
anchorGenes=anchorList, excludeGenes=excludeList,
iter=2000, burn_in=1000)
By default, ASSIGN Bayesian gene selection chooses the signature genes with an equal fraction of genes that increase with pathway activity and genes that decrease with pathway activity. Use the pctUp parameter to modify this fraction. Set pctUP to NULL to select the most significant genes, regardless of direction.
When running ASSIGN, the number of genes in the gene list can affect the predictions that ASSIGN produces, but it is not always clear how long the gene list should be. Included within ASSIGN is the optimization procedure used in the publication Activity of distinct growth factor receptor network components in breast tumors uncovers two biologically relevant subtypes. The function allows you to optimize the gene list lengths for the pathways included in the paper using your own correlation data and gene list lengths. This function runs ASSIGN pathway prediction on various gene list lengths to find the optimum gene list length for the GFRN pathways by correlating the ASSIGN predictions to a matrix of correlation data that you provide. This function takes a long time to run because you are running ASSIGN many times on many pathways, so I recommend parallelizing by pathway or running the ASSIGN predictions first (long and parallelizable) and then running the correlation step (quick) separately.
The following example optimizes the pathway length for the AKT pathway based on correlating ASSIGN predictions with proteomics data. First, read in the test data that you want to predict using ASSIGN and the data (e.g. proteomics data) that will be used for correlation:
dir.create(file.path(tempdir, "optimization_example"))
setwd(file.path(tempdir, "optimization_example"))
testData <- read.table("https://drive.google.com/uc?authuser=0&id=1mJICN4z_aCeh4JuPzNfm8GR_lkJOhWFr&export=download",
sep='\t', row.names=1, header=1)
corData1 <- read.table("https://drive.google.com/uc?authuser=0&id=1MDWVP2jBsAAcMNcNFKE74vYl-orpo7WH&export=download",
sep='\t', row.names=1, header=1)
Next, create a list of data used for correlation. The list should contain a vector of column names from the correlation data for each of the pathways that are being optimized. The gene list length that has the largest average correlation for the columns in the correlation list will be the optimized gene list.
#this is a list of pathways and columns in the correlation data that will
#be used for correlation
corList <- list(akt=c("Akt","PDK1","PDK1p241"))
Finally, run the ComBat batch correction procedure and run the
optimizeGFRN
function:
#run the batch correction procedure between the test and training data
combat.data <- ComBat.step2(testData, pcaPlots = TRUE)
#run the default optimization procedure
optimization_results <- optimizeGFRN(combat.data, corData,
corList, run="akt")
ASSIGN
will output the results for each gene list length in the current working
directory. The optimizeGFRN
function returns a list of
optimized gene lists which can be used on other datasets and correlation
results. Additional options and documentation is available in the
optimizeGFRN
function documentation.
If you use ASSIGN in your publication, please cite:
Shen, Y. et al. ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways. Bioinformatics 31 (11), 1745-1753 (2015). doi:10.1093/bioinformatics/btv031
Please see the ASSIGN reference manual for full descriptions of functions and the various options they support.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ASSIGN_1.43.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
## [4] xfun_0.48 maketools_1.3.1 cachem_1.1.0
## [7] knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28
## [10] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
## [13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.1
## [16] sys_3.4.3 tools_4.4.1 evaluate_1.0.1
## [19] bslib_0.8.0 yaml_2.3.10 BiocManager_1.30.25
## [22] jsonlite_1.8.9 rlang_1.1.4