Title: | Regulatory Network Inference and Driver Gene Evaluation using Integrative Multi-Omics Analysis and Penalized Regression |
---|---|
Description: | Integrating an increasing number of available multi-omics cancer data remains one of the main challenges to improve our understanding of cancer. One of the main challenges is using multi-omics data for identifying novel cancer driver genes. We have developed an algorithm, called AMARETTO, that integrates copy number, DNA methylation and gene expression data to identify a set of driver genes by analyzing cancer samples and connects them to clusters of co-expressed genes, which we define as modules. We applied AMARETTO in a pancancer setting to identify cancer driver genes and their modules on multiple cancer sites. AMARETTO captures modules enriched in angiogenesis, cell cycle and EMT, and modules that accurately predict survival and molecular subtypes. This allows AMARETTO to identify novel cancer driver genes directing canonical cancer pathways. |
Authors: | Jayendra Shinde, Celine Everaert, Shaimaa Bakr, Mohsen Nabian, Jishu Xu, Vincent Carey, Nathalie Pochet and Olivier Gevaert |
Maintainer: | Olivier Gevaert <[email protected]> |
License: | Apache License (== 2.0) + file LICENSE |
Version: | 1.23.0 |
Built: | 2024-12-30 05:07:34 UTC |
Source: | https://github.com/bioc/AMARETTO |
AMARETTO_CreateModuleData
AMARETTO_CreateModuleData(AMARETTOinit, AMARETTOresults)
AMARETTO_CreateModuleData(AMARETTOinit, AMARETTOresults)
AMARETTOinit |
List output from AMARETTO_Initialize(). |
AMARETTOresults |
List output from AMARETTO_Run() |
result
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_MD <- AMARETTO_CreateModuleData(AMARETTOinit, AMARETTOresults)
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_MD <- AMARETTO_CreateModuleData(AMARETTOinit, AMARETTOresults)
AMARETTO_CreateRegulatorPrograms
AMARETTO_CreateRegulatorPrograms(AMARETTOinit, AMARETTOresults)
AMARETTO_CreateRegulatorPrograms(AMARETTOinit, AMARETTOresults)
AMARETTOinit |
List output from AMARETTO_Initialize(). |
AMARETTOresults |
List output from AMARETTO_Run() |
result
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_RP <- AMARETTO_CreateRegulatorPrograms(AMARETTOinit,AMARETTOresults)
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_RP <- AMARETTO_CreateRegulatorPrograms(AMARETTOinit,AMARETTOresults)
Downloading TCGA dataset for AMARETTO analysis
AMARETTO_Download(CancerSite = "CHOL", TargetDirectory = TargetDirectory)
AMARETTO_Download(CancerSite = "CHOL", TargetDirectory = TargetDirectory)
CancerSite |
TCGA cancer code for data download |
TargetDirectory |
Directory path to download data |
result
TargetDirectory <- file.path(getwd(),"Downloads/");dir.create(TargetDirectory) CancerSite <- 'CHOL' DataSetDirectories <- AMARETTO_Download(CancerSite,TargetDirectory = TargetDirectory)
TargetDirectory <- file.path(getwd(),"Downloads/");dir.create(TargetDirectory) CancerSite <- 'CHOL' DataSetDirectories <- AMARETTO_Download(CancerSite,TargetDirectory = TargetDirectory)
Code to evaluate AMARETTO on a new gene expression test set. Uses output from AMARETTO_Run() and CreateRegulatorData().
AMARETTO_EvaluateTestSet(AMARETTOresults = AMARETTOresults, MA_Data_TestSet = MA_Data_TestSet, RegulatorData_TestSet = RegulatorData_TestSet)
AMARETTO_EvaluateTestSet(AMARETTOresults = AMARETTOresults, MA_Data_TestSet = MA_Data_TestSet, RegulatorData_TestSet = RegulatorData_TestSet)
AMARETTOresults |
AMARETTO output from AMARETTO_Run(). |
MA_Data_TestSet |
Gene expression matrix from a test set (that was not used in AMARETTO_Run()). |
RegulatorData_TestSet |
Test regulator data from CreateRegulatorData(). |
result
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTOtestReport <- AMARETTO_EvaluateTestSet(AMARETTOresults = AMARETTOresults, MA_Data_TestSet = AMARETTOinit$MA_matrix_Var, RegulatorData_TestSet = AMARETTOinit$RegulatorData)
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTOtestReport <- AMARETTO_EvaluateTestSet(AMARETTOresults = AMARETTOresults, MA_Data_TestSet = AMARETTOinit$MA_matrix_Var, RegulatorData_TestSet = AMARETTOinit$RegulatorData)
Retrieve a download of all the data linked with the run (including heatmaps)
AMARETTO_ExportResults(AMARETTOinit, AMARETTOresults, data_address, Heatmaps = TRUE, CNV_matrix = NULL, MET_matrix = NULL)
AMARETTO_ExportResults(AMARETTOinit, AMARETTOresults, data_address, Heatmaps = TRUE, CNV_matrix = NULL, MET_matrix = NULL)
AMARETTOinit |
AMARETTO initialize output |
AMARETTOresults |
AMARETTO results output |
data_address |
Directory to save data folder |
Heatmaps |
Output heatmaps as pdf |
CNV_matrix |
CNV_matrix |
MET_matrix |
MET_matrix |
result
data('ProcessedDataLIHC') TargetDirectory <- file.path(getwd(),"Downloads/");dir.create(TargetDirectory) AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_ExportResults(AMARETTOinit,AMARETTOresults,TargetDirectory,Heatmaps = FALSE)
data('ProcessedDataLIHC') TargetDirectory <- file.path(getwd(),"Downloads/");dir.create(TargetDirectory) AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_ExportResults(AMARETTOinit,AMARETTOresults,TargetDirectory,Heatmaps = FALSE)
Retrieve an interactive html report, including gene set enrichment analysis if asked for.
AMARETTO_HTMLreport(AMARETTOinit, AMARETTOresults, ProcessedData, show_row_names = FALSE, SAMPLE_annotation = NULL, ID = NULL, hyper_geo_test_bool = FALSE, hyper_geo_reference = NULL, output_address = "./", MSIGDB = TRUE, driverGSEA = TRUE, phenotype_association_table = NULL)
AMARETTO_HTMLreport(AMARETTOinit, AMARETTOresults, ProcessedData, show_row_names = FALSE, SAMPLE_annotation = NULL, ID = NULL, hyper_geo_test_bool = FALSE, hyper_geo_reference = NULL, output_address = "./", MSIGDB = TRUE, driverGSEA = TRUE, phenotype_association_table = NULL)
AMARETTOinit |
AMARETTO initialize output |
AMARETTOresults |
AMARETTO results output |
ProcessedData |
List of processed input data |
show_row_names |
if True, sample names will appear in the heatmap |
SAMPLE_annotation |
SAMPLE annotation will be added to heatmap |
ID |
ID column of the SAMPLE annotation data frame |
hyper_geo_test_bool |
Boolean if a hyper geometric test needs to be performed. If TRUE provide a GMT file in the hyper_geo_reference parameter. |
hyper_geo_reference |
GMT file with gene sets to compare with. |
output_address |
Output directory for the html files. |
MSIGDB |
TRUE if gene sets were retrieved from MSIGDB. Links will be created in the report. |
driverGSEA |
if TRUE, module drivers will also be included in the hypergeometric test. |
phenotype_association_table |
a Data Frame, containing all modules phenotype association data. Optional. |
result
## Not run: data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_HTMLreport(AMARETTOinit= AMARETTOinit,AMARETTOresults= AMARETTOresults, ProcessedData = ProcessedDataLIHC, hyper_geo_test_bool=FALSE, output_address='./') ## End(Not run)
## Not run: data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_HTMLreport(AMARETTOinit= AMARETTOinit,AMARETTOresults= AMARETTOresults, ProcessedData = ProcessedDataLIHC, hyper_geo_test_bool=FALSE, output_address='./') ## End(Not run)
Code used to initialize the seed clusters for an AMARETTO run. Requires processed gene expressiosn (rna-seq or microarray), CNV (usually from a GISTIC run), and methylation (from MethylMix, provided in this package) data. Uses the function CreateRegulatorData() and results are fed into the function AMARETTO_Run().
AMARETTO_Initialize(ProcessedData = ProcessedData, Driver_list = NULL, NrModules, VarPercentage, PvalueThreshold = 0.001, RsquareThreshold = 0.1, pmax = 10, NrCores = 1, OneRunStop = 0, method = "union", random_seeds = NULL, convergence_cutoff = 0.01)
AMARETTO_Initialize(ProcessedData = ProcessedData, Driver_list = NULL, NrModules, VarPercentage, PvalueThreshold = 0.001, RsquareThreshold = 0.1, pmax = 10, NrCores = 1, OneRunStop = 0, method = "union", random_seeds = NULL, convergence_cutoff = 0.01)
ProcessedData |
List of Expression, CNV and MethylMix data matrices, with genes in rows and samples in columns. |
Driver_list |
Custom list of driver genes to be considered in analysis |
NrModules |
How many gene co-expression modules should AMARETTO search for? Usually around 100 is acceptable, given the large number of possible driver-passenger gene combinations. |
VarPercentage |
Minimum percentage by variance for filtering of genes; for example, 75% would indicate that the CreateRegulatorData() function only analyses genes that have a variance above the 75th percentile across all samples. |
PvalueThreshold |
Threshold used to find relevant driver genes with CNV alterations: maximal p-value. |
RsquareThreshold |
Threshold used to find relevant driver genes with CNV alterations: minimal R-square value between CNV and gene expression data. |
pmax |
'pmax' variable for glmnet function from glmnet package; the maximum number of variables aver to be nonzero. Should not be changed by user unless she/he fully understands the AMARETTO algorithm and how its parameters choices affect model output. |
NrCores |
A numeric variable indicating the number of computer/server cores to use for paralellelization. Default is 1, i.e. no parallelization. Please check your computer or server's computing capacities before increasing this number. Parallelization is done via the RParallel package. Mac vs. Windows environments may behave differently when using parallelization. |
OneRunStop |
OneRunStop |
method |
Perform union or intersection of the driver genes evaluated from the input data matrices and custom driver gene list provided. |
random_seeds |
A numeric vector of length 2, containing two seed numbers for randomization : 1st for kmeans and 2nd for glmnet |
convergence_cutoff |
A numeric value (E.g. 0.01) representing the fraction of the total number of genes, in which, The algorithm is considered reaching convergence and will stop, if Nr of Gene-replacements in an iteration falls below this threshold * total number of genes. |
result
data('ProcessedDataLIHC') data('Driver_Genes') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) ## Not run: AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, Driver_list = Driver_Genes[['MSigDB']], NrModules = 2, VarPercentage = 50) ## End(Not run)
data('ProcessedDataLIHC') data('Driver_Genes') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) ## Not run: AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, Driver_list = Driver_Genes[['MSigDB']], NrModules = 2, VarPercentage = 50) ## End(Not run)
Wrapper code that analyzes process TCGA GISTIC (CNV) and gene expression (rna-seq or microarray) data via one call
AMARETTO_Preprocess(DataSetDirectories = DataSetDirectories, BatchData = BatchData)
AMARETTO_Preprocess(DataSetDirectories = DataSetDirectories, BatchData = BatchData)
DataSetDirectories |
DataSetDirectories |
BatchData |
BatchData |
result
## Not run: TargetDirectory <- "Downloads" # path to data download directory CancerSite <- 'CHOL' DataSetDirectories <- AMARETTO_Download(CancerSite,TargetDirectory) ProcessedData <- AMARETTO_Preprocess(DataSetDirectories,BatchData) ## End(Not run)
## Not run: TargetDirectory <- "Downloads" # path to data download directory CancerSite <- 'CHOL' DataSetDirectories <- AMARETTO_Download(CancerSite,TargetDirectory) ProcessedData <- AMARETTO_Preprocess(DataSetDirectories,BatchData) ## End(Not run)
AMARETTO_Run Function to run AMARETTO, a statistical algorithm to identify cancer drivers by integrating a variety of omics data from cancer and normal tissue.
AMARETTO_Run(AMARETTOinit)
AMARETTO_Run(AMARETTOinit)
AMARETTOinit |
List output from AMARETTO_Initialize(). |
result
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit)
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit)
Function to visualize the gene modules
AMARETTO_VisualizeModule(AMARETTOinit, AMARETTOresults, ProcessedData, ModuleNr, show_row_names = FALSE, SAMPLE_annotation = NULL, ID = NULL, order_samples = NULL)
AMARETTO_VisualizeModule(AMARETTOinit, AMARETTOresults, ProcessedData, ModuleNr, show_row_names = FALSE, SAMPLE_annotation = NULL, ID = NULL, order_samples = NULL)
AMARETTOinit |
List output from AMARETTO_Initialize(). |
AMARETTOresults |
List output from AMARETTO_Run(). |
ProcessedData |
List of processed input data |
ModuleNr |
Module number to visualize |
show_row_names |
If TRUE, row names will be shown on the plot. |
SAMPLE_annotation |
Matrix or Dataframe with sample annotation |
ID |
Column used as sample name |
order_samples |
Order samples in heatmap by mean or by clustering |
result
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_VisualizeModule(AMARETTOinit = AMARETTOinit,AMARETTOresults = AMARETTOresults, ProcessedData = ProcessedDataLIHC, ModuleNr = 1)
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) AMARETTO_VisualizeModule(AMARETTOinit = AMARETTOinit,AMARETTOresults = AMARETTOresults, ProcessedData = ProcessedDataLIHC, ModuleNr = 1)
A list of cancer driver genes described in literature.
Driver_Genes
Driver_Genes
List
A dataset containing all MSIGDB pathways and their descriptions. .
MsigdbMapping
MsigdbMapping
List
Title plot_run_history
plot_run_history(AMARETTOinit, AMARETTOresults)
plot_run_history(AMARETTOinit, AMARETTOresults)
AMARETTOinit |
AMARETTO initialize output |
AMARETTOresults |
AMARETTO results output |
plot
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) plot_run_history(AMARETTOinit,AMARETTOresults)
data('ProcessedDataLIHC') AMARETTOinit <- AMARETTO_Initialize(ProcessedData = ProcessedDataLIHC, NrModules = 2, VarPercentage = 50) AMARETTOresults <- AMARETTO_Run(AMARETTOinit) plot_run_history(AMARETTOinit,AMARETTOresults)
A list of dataframes of processed toy example dataset from TCGA-LIHC.
ProcessedDataLIHC
ProcessedDataLIHC
List
Function to turn a .gct data files into a matrix format
read_gct(file_address)
read_gct(file_address)
file_address |
Address of the input gct file. |
result
data_matrix<-read_gct(file_address="")
data_matrix<-read_gct(file_address="")