In order to make light of cancer development, it is crucial to understand which genes play a role in the mechanisms linked to this disease and moreover which role that is. Commonly biological processes such as proliferation and apoptosis have been linked to cancer progression. We have developed the Moonlight framework that allows for prediction of cancer driver genes through multi-omics data integration. Based on expression data we perform functional enrichment analysis, infer gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer. We then use these scores to predict oncogenic mediators with two specific roles: genes that potentially act as tumor suppressor genes (TSGs) and genes that potentially act as oncogenes (OCGs). This constitutes Moonlight’s primary layer. As gene expression data alone does not explain the cancer phenotypes, a second layer of evidence is needed. We have integrated three secondary layers, one based on mutations, one based on abnormal DNA methylation patterns, for prediction of the driver genes, and one baed on the effect of mutations on transcription factors. The mutational layer predicts driver mutations in the oncogenic mediators and thereby allows for the prediction of cancer driver genes using the driver mutation prediction tool CScape-somatic. The methylation layer investigates abnormal DNA methylation patterns and differentially methylated CpG sites in the oncogenic mediators using the tool EpiMix and uses patterns of hypo- and hypermethylation for prediction of the driver genes. The transcription factor layer checks which transcription factors are important for determining the transcription of possible driver genes, and identifies mutations on such transcription factors from a cancer mutation dataset that are able to destabilize them and therefore likely affect transcription. Overall, this methodology not only allows us to identify genes with dual role (TSG in one cancer type and OCG in another) but also to elucidate the underlying biological processes.
Cancer development is influenced by (epi)genetic alterations in two distinctly different categories of genes, known as tumor suppressor genes (TSG) and oncogenes (OCG). The occurrence of these alterations in genes of the first category leads to faster cell proliferation while alterations in genes of second category increases or changes their function. In 2020, we developed the Moonlight framework that allows for prediction of cancer driver genes (Colaprico, Antonio and Olsen, Catharina and Bailey, Matthew H. and Odom, Gabriel J. and Terkelsen, Thilde and Silva, Tiago C. and Olsen, André V. and Cantini, Laura and Zinovyev, Andrei and Barillot, Emmanuel and Noushmehr, Houtan and Bertoli, Gloria and Castiglioni, Isabella and Cava, Claudia and Bontempi, Gianluca and Chen, Xi Steven and Papaleo, Elena 2020). Here, gene expression data are integrated together with biological processes and gene regulatory networks to score the importance of well-known biological processes with respect to the studied cancer. These scores are used to predict oncogenic mediators: putative TSGs and putative OCGs. As gene expression data alone is not enough to explain the deregulation of the genes, a second layer of evidence is needed. For this reason, we integrated a secondary mutational layer which predicts driver mutations and passenger mutations in the oncogenic mediators. The prediction of the driver mutations are carried out using the CScape-somatic (Rogers, Mark F and Gaunt, Tom R and Campbell, Colin 2020) driver mutation prediction tool. Those oncogenic mediators with at least one driver mutation are retained as the final set of driver genes (Nourbakhsh, Mona and Saksager, Astrid and Tom, Nikola and Chen, Xi Steven and Colaprico, Antonio and Olsen, Catharina and Tiberti, Matteo and Papaleo, Elena 2023). Moreover, we have integrated a secondary methylation layer that investigates abnormal DNA methylation patterns in the oncogenic mediators using the tool EpiMix (Zheng, Yuanning and Jun, John and Brennan, Kevin and Gevaert, Olivier 2023). Those oncogenic mediators demonstrating hypo- and hypermethylation patterns are predicted as OCGs and TSGs, respectively. These new functionalities are released in the updated version of Moonlight to produce Moonlight2R. A third transcription factor layer checks which transcription factors are important for determining the transcription of possible driver genes, and if any mutation available in a cancer mutation dataset is able to destabilize them and therefore likely affect transcription. The mutations can be provided as MAF-formatted files, while predictions on whether any mutation on a transcription factor can be destabilizing are available as the MAVISp database. MAVISp is our computational framework for the effect of predictions of mutations using structural methods; you can read more about it on our preprint.
Moonlight’s pipeline is shown below:
The proposed pipeline consists of following eight steps:
To install Moonlight2R use the code below.
First, install devtools
or if you already have it
installed, load it.
Install Moonlight2R from GitHub:
First, install the BiocStyle Bioconductor package.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BiocStyle")
Then install Moonlight2R with its accompanying vignette.
You can view the vignette in the following way.
## Loading required package: doParallel
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
##
##
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Obtain Input
The input to Moonlight is a set of differentially expressed genes and gene expression data. Mutation and/or methylation data are also needed. These data should be available for the same sample sets. Gene expression data, mutation data, methylation data and differentially expressed genes can for example be obtained from TCGA using the R package TCGAbiolinks. Help documents on how to use TCGAbiolinks are available here. To find other examples of usage of TCGAbiolinks on TCGA cancer types see our GitHub repository. Example data of the input (differentially expressed genes, gene expression data, mutation data, and methylation data) are stored in the Moonlight2R package:
data(DEGsmatrix)
data(dataFilt)
data(dataMAF)
data(GEO_TCGAtab)
data(LOC_transcription)
data(LOC_translation)
data(LOC_protein)
data(Oncogenic_mediators_mutation_summary)
data(DEG_Mutations_Annotations)
data(dataMethyl)
data(DEG_Methylation_Annotations)
data(Oncogenic_mediators_methylation_summary)
data(MetEvidenceDriver)
data(LUAD_sample_anno)
Download
: Get GEO dataYou can search GEO data using the getDataGEO
function.
GEO_TCGAtab: a 18x12 matrix that provides the GEO data set we matched to one of the 18 given TCGA cancer types
knitr::kable(GEO_TCGAtab, digits = 2,
caption = "Table with GEO data set matched to one
of the 18 given TCGA cancer types ",
row.names = TRUE)
Cancer | TP | NT | DEG. | Dataset | TP.1 | NT.1 | Platform | DEG.. | Common | GEO_Normal | GEO_Tumor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | BLCA | 408 | 19 | 2937 | GSE13507 | 165 | 10 | GPL65000 | 2099 | 896 | control | cancer |
2 | BRCA | 1097 | 114 | 3390 | GSE39004 | 61 | 47 | GPL6244 | 2449 | 1248 | normal | Tumor |
3 | CHOL | 36 | 9 | 5015 | GSE26566 | 104 | 59 | GPL6104 | 3983 | 2587 | Surrounding | Tumor |
4 | COAD | 286 | 41 | 3788 | GSE41657 | 25 | 12 | GPL6480 | 3523 | 1367 | N | A |
5 | ESCA | 184 | 11 | 2525 | GSE20347 | 17 | 17 | GPL571 | 1316 | 406 | normal | carcinoma |
6 | GBM | 156 | 5 | 4828 | GSE50161 | 34 | 13 | GPL570 | 4504 | 2660 | normal | GBM |
7 | HNSC | 520 | 44 | 2973 | GSE6631 | 22 | 22 | GPL8300 | 142 | 129 | normal | cancer |
8 | KICH | 66 | 25 | 4355 | GSE15641 | 6 | 23 | GPL96 | 1789 | 680 | normal | chromophobe |
9 | KIRC | 533 | 72 | 3618 | GSE15641 | 32 | 23 | GPL96 | 2911 | 939 | normal | clear cell RCC |
10 | KIRP | 290 | 32 | 3748 | GSE15641 | 11 | 23 | GPL96 | 2020 | 756 | normal | papillary RCC |
11 | LIHC | 371 | 50 | 3043 | GSE45267 | 46 | 41 | GPL570 | 1583 | 860 | normal liver | HCC sample |
12 | LUAD | 515 | 59 | 3498 | GSE10072 | 58 | 49 | GPL96 | 666 | 555 | normal | tumor |
13 | LUSC | 503 | 51 | 4984 | GSE33479 | 14 | 27 | GPL6480 | 3729 | 1706 | normal | squamous cell carcinoma |
14 | PRAD | 497 | 52 | 1860 | GSE6919 | 81 | 90 | GPL8300 | 246 | 149 | normal prostate | tumor samples |
15 | READ | 94 | 10 | 3628 | GSE20842 | 65 | 65 | GPL4133 | 2172 | 1261 | M | T |
16 | STAD | 415 | 35 | 2622 | GSE2685 | 10 | 10 | GPL80 | 487 | 164 | N | T |
17 | THCA | 505 | 59 | 1994 | GSE33630 | 60 | 45 | GPL570 | 1451 | 781 | N | T |
18 | UCEC | 176 | 24 | 4183 | GSE17025 | GPL570 | tp | lcm |
getDataGEO
: Search by cancer type and data type [Gene
Expression]The user can query and download the cancer types supported by GEO,
using the function getDataGEO
:
FEA
: Functional Enrichment AnalysisThe user can perform a functional enrichment analysis using the
function FEA
. For each DEG in the gene set a z-score is
calculated. This score indicates how the genes act in the gene set.
The output can be visualized with a FEA plot.
FEAplot
: Functional Enrichment Analysis PlotThe user can plot the result of a functional enrichment analysis
using the function plotFEA
. A negative z-score indicates
that the process’ activity is decreased. A positive z-score indicates
that the process’ activity is increased.
The figure generated by the above code is shown below:
GRN
: Gene Regulatory NetworkThe user can perform a gene regulatory network analysis using the
function GRN
which infers the network using the parmigene
package. For illustrative purposes and to decrease runtime, we have set
nGenesPerm = 5
and nBoot = 5
in the example
below, however, we recommend setting these parameters to
nGenesPerm = 2000
and nBoot = 400
to achieve
optimal results, as they are set by default in the function
arguments.
URA
: Upstream Regulator AnalysisThe user can perform upstream regulator analysis using the function
URA
. This function is applied to each DEG in the enriched
gene set and its neighbors in the GRN.
data(dataGRN)
data(DEGsmatrix)
data(DiseaseList)
data(EAGenes)
dataFEA <- FEA(DEGsmatrix = DEGsmatrix)
BPselected <- dataFEA$Diseases.or.Functions.Annotation[1:5]
dataURA <- URA(dataGRN = dataGRN,
DEGsmatrix = DEGsmatrix,
BPname = BPselected,
nCores=1)
## Warning: executing %dopar% sequentially: no parallel backend registered
PRA
: Pattern Regognition AnalysisThe user can retrieve putative TSG/OCG using either selected
biological processes in the expert-based approach or a random forest
classifier trained on known COSMIC OCGs/TSGs in the machine learning
approach. The user must enter the chosen biological processes in the
BPname
argument to use the expert-based approach or set
BPname = NULL
to use the machine learning approach.
DMA
: Driver Mutation AnalysisThe user can identify driver mutations with DMA
in the
oncogenic mediators established by PRA
. The passenger or
driver status is estimated with CScape-somatic (Rogers, Mark F and Gaunt, Tom R and Campbell, Colin
2020).
This function will further generate three files:
DEG_Mutations_Annotations.rda, Oncogenic_mediators_mutation_summary.rda
and cscape_somatic_output.rda. These will be placed in the specified
results-folder. The user needs to download two CScape-somatic files in
order to run DMA named css_coding.vcf.gz and css_noncoding.vcf.gz,
respectively. These two files can be downloaded at http://cscape-somatic.biocompute.org.uk/#download. The
corresponding .tbi files (css_coding.vcf.gz.tbi and
css_noncoding.vcf.gz.tbi) must also be downloaded and be placed in the
same folder.
data(dataPRA)
data(dataMAF)
data(DEGsmatrix)
data(LOC_transcription)
data(LOC_translation)
data(LOC_protein)
data(NCG)
data(EncodePromoters)
dataDMA <- DMA(dataMAF = dataMAF,
dataDEGs = DEGsmatrix,
dataPRA = dataPRA,
results_folder = "DMAresults",
coding_file = "css_coding.vcf.gz",
noncoding_file = "css_noncoding.vcf.gz")
GMA
: Gene Methylation AnalysisThe user can predict driver genes with GMA
following
prediction of the oncogenic mediators established by PRA
.
GMA
predicts driver genes by investigating abnormal DNA
methylation patterns in the oncogenic mediators using the tool EpiMix
(Zheng, Yuanning and Jun, John and Brennan, Kevin
and Gevaert, Olivier 2023). Oncogenic mediators with hypo- and
hypermethylation patterns are predicted as OCGs and TSGs, respectively.
This function will generate these files:
DEG_Methylation_Annotations.rda,
Oncogenic_mediators_methylation_summary.rda,
EpiMix_Results_Enhancer.rds, EpiMix_Results_Regular.rds,
FunctionalPairs_Enhancer.csv, FunctionalPairs_Regular.csv, and
FunctionalProbes_Regular.rds. These will be placed in the specified
results-folder.
data("dataMethyl")
data("dataFilt")
data("dataPRA")
data("DEGsmatrix")
data("LUAD_sample_anno")
data("NCG")
data("EncodePromoters")
data("MetEvidenceDriver")
# Subset column names (sample names) in expression data to patient level
pattern <- "^(.{4}-.{2}-.{4}-.{2}).*"
colnames(dataFilt) <- sub(pattern, "\\1", colnames(dataFilt))
dataGMA <- GMA(dataMET = dataMethyl, dataEXP = dataFilt,
dataPRA = dataPRA, dataDEGs = DEGsmatrix,
sample_info = LUAD_sample_anno, met_platform = "HM450",
prevalence_filter = NULL,
output_dir = "./GMAresults", cores = 1, roadmap.epigenome.ids = "E096",
roadmap.epigenome.groups = NULL)
## Running Regular mode...
## Fetching probe annotation...
## sesameData not installed.
## Full functionality, documentation, and loading of data might not be possible without installing
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## require("GenomicRanges")
## Found 13 samples with both methylation and gene expression data.
## Modeling gene expression...
## | | | 0% | |= | 1% | |== | 3% | |=== | 4% | |==== | 5% | |===== | 7% | |====== | 8% | |======= | 10% | |======== | 11% | |========= | 12% | |========== | 14% | |=========== | 15% | |============ | 16% | |============ | 18% | |============= | 19% | |============== | 21% | |=============== | 22% | |================ | 23% | |================= | 25% | |================== | 26% | |=================== | 27% | |==================== | 29% | |===================== | 30% | |====================== | 32% | |======================= | 33% | |======================== | 34% | |========================= | 36% | |========================== | 37% | |=========================== | 38% | |============================ | 40% | |============================= | 41% | |============================== | 42% | |=============================== | 44% | |================================ | 45% | |================================= | 47% | |================================== | 48% | |=================================== | 49% | |=================================== | 51% | |==================================== | 52% | |===================================== | 53% | |====================================== | 55% | |======================================= | 56% | |======================================== | 58% | |========================================= | 59% | |========================================== | 60% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================= | 70% | |================================================== | 71% | |=================================================== | 73% | |==================================================== | 74% | |===================================================== | 75% | |====================================================== | 77% | |======================================================= | 78% | |======================================================== | 79% | |========================================================= | 81% | |========================================================== | 82% | |========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 86% | |============================================================= | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 99% | |======================================================================| 100%
## Found 73 transcriptionally predictive probes.
## Found 17 samples in group.1 and 6 samples in group.2
##
## Starting Beta mixture modeling.
## Running Beta mixture model on 73 probes and on 17 samples.
## | | | 0%
## Found 73 differentially methylated CpGs
## Identifying functional CpG-gene pairs...
## | | | 0% | |== | 3% | |==== | 6% | |====== | 9% | |======== | 12% | |=========== | 15% | |============= | 18% | |=============== | 21% | |================= | 24% | |=================== | 27% | |===================== | 30% | |======================= | 33% | |========================= | 36% | |============================ | 39% | |============================== | 42% | |================================ | 45% | |================================== | 48% | |==================================== | 52% | |====================================== | 55% | |======================================== | 58% | |========================================== | 61% | |============================================= | 64% | |=============================================== | 67% | |================================================= | 70% | |=================================================== | 73% | |===================================================== | 76% | |======================================================= | 79% | |========================================================= | 82% | |=========================================================== | 85% | |============================================================== | 88% | |================================================================ | 91% | |================================================================== | 94% | |==================================================================== | 97% | |======================================================================| 100%
## Found 84 functional probe-gene pairs.
## Saving the EpiMix results to the output directory...
## Running Enhancer mode...
## Fetching probe annotation...
## sesameData not installed.
## Full functionality, documentation, and loading of data might not be possible without installing
## loading from cache
## Found 17 samples in group.1 and 6 samples in group.2
## Fetching enhancer CpGs from Roadmap Epigenomics...
## Downloading chromatin states from the Roadmap Epigenomics...
## Identifed 65057 enhancer CpGs from the epigenome E096
## sesameData not installed.
## Full functionality, documentation, and loading of data might not be possible without installing
## loading from cache
## Returning distal probes: 160862
## Found 7 CpGs associated with distal enhancers in the methylation dataset
##
## Starting Beta mixture modeling.
## Running Beta mixture model on 7 probes and on 17 samples.
## | | | 0%
## Found 7 differentially methylated CpGs
## Modeling the gene expression for enhancers...
## Searching for the 20 near genes
## Identifying gene position for each probe
## Looking for differentially methylated enhancers associated with gene expression
## | | | 0% | |========== | 14% | |==================== | 29% | |============================== | 43% | |======================================== | 57% | |================================================== | 71% | |============================================================ | 86% | |======================================================================| 100%
## Found 2 functional probe-gene pairs.
## Saving the EpiMix results to the output directory...
## sesameData not installed.
## Full functionality, documentation, and loading of data might not be possible without installing
## loading from cache
## [1] "3 oncogenic mediator(s) out of 3 were found in the None evidence category"
GLS
: Gene Literature SearchThe user can perform a literature search on driver genes
using the GLS
function. This function takes as input driver
genes, a query and maximum number of records to retrieve from PubMed.
Standard PubMed syntax can be used in the query. For example, Boolean
operators AND, OR, NOT can be applied and tags such as [AU],
[TITLE/ABSTRACT], [Affiliation] can be used. GLS
fetches
data of PubMed records matching the specified query and outputs PubMed
IDs matching the query along with doi, title, abstract, year of
publication, keywords, and total number of PubMed publications. This is
done for each of the genes supplied in the input.
genes_query <- "BRCA1"
dataGLS <- GLS(genes = genes_query,
query_string = "AND cancer AND driver",
max_records = 20)
## Processing PubMed data .................... done!
## # A tibble: 6 × 8
## pmid gene doi title abstract year keywords pubmed_count
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 39673888 BRCA1 10.1016/j.prp.2024.… Enha… Advance… 2024 Biomark… 309
## 2 39645402 BRCA1 10.48095/ccrvch2024… Mole… In addi… 2024 Gastroi… 309
## 3 39522613 BRCA1 10.1016/j.annonc.20… Estr… Estroge… 2024 breast … 309
## 4 39484366 BRCA1 10.1101/2024.10.17.… Dele… Optimiz… 2024 <NA> 309
## 5 39474537 BRCA1 10.1159/000541160 Succ… Basal c… 2024 BRCA1 L… 309
## 6 39430730 BRCA1 10.3892/ol.2024.147… Whol… The inc… 2024 BRCA1; … 309
We now introduce the third layer of secondary evidence. It is based on the assumption that stability changes due to mutations on transcription factors can influence transcription. Here, we use data on change of stability upon mutation available in the MAVISp database, as well as data from the TRRUST database and MAF files from TCGA to i) identify transcription factors that have an effect on the transcription of genes of interest ii) identify potentially damaging mutations on them from MAF files and iii) use MAVISp data to assess the affect of such mutations, whether they might destabilizing or not. We release the TRRUST-derived data within the Moonlight2R package with a Creative Commons Attribution-ShareAlike 4.0 International license.
loadMAVISp
: Helper function for loading the MAVISp
databaseThe first step to perform this is loading the data form the MAVISp database.
Notice that we have included a very small portion of the MAVISp database with Moonlight2R for testing purposes; we recommend using the full database available at the MAVISp OSF repository. Also see the MAVISp preprint for specifics on what datasets are available in MAVISp.
The loadMAVISp
function is used to load the files
available from the MAVISp database. The user will need to decide on
whether they want to import the simple or ensemble mode and in the case
of ensemble mode, choose the ensemble they want to import. Furthermore,
the user can specify specific proteins to load. The output is a list of
tibbles, each one correponding to a file in the MAVISp database. The
output is used directly in the TFinfluence
function.
mavisp_db_location <- system.file('extdata', 'mavisp_db', package='Moonlight2R')
specific_protein <- loadMAVISp(mavispDB = mavisp_db_location,
mode = 'simple',
proteins_of_interest = c('RUNX1'))
all_proteins <- loadMAVISp(mavispDB = mavisp_db_location,
mode = 'simple')
ensemble <- loadMAVISp(mavispDB = mavisp_db_location,
mode = 'ensemble',
ensemble = 'cabsflex')
TFinfluence
: Effect of mutations on transcription
factorsAs previosuly introduceed, the TFinfluence
function can
be used as after the PRA function, to identify if there are mutations in
the transcription factors of the DEGs. The function uses data from
TRRUST to couple the DEGs with their transcription factor (TF). The
mutations in the TFs come from a MAF file, and the effect of the
mutation is obtained from the MAVISp database, available at an OSF repository.
Level of consequence
: Effect of mutations on three
different levelsThe user can investigate the predicted effect of different mutation types on the transcriptional level through the table LOC_transcription:
Variant_Classification | SNP | INS | DEL | DNP | TNP | ONP |
---|---|---|---|---|---|---|
5’Flank | 1 | 1 | 1 | 1 | 1 | 1 |
5’UTR | 1 | 1 | 1 | 1 | 1 | 1 |
Translation_Start_Site | 0 | 0 | 0 | 0 | 0 | 0 |
Missense_Mutation | 0 | NA | NA | 0 | 0 | 0 |
Nonsense_Mutation | 1 | 1 | 1 | 1 | 1 | 1 |
Nonstop_Mutation | 0 | 0 | 0 | 0 | 0 | 0 |
Splice_Site | 1 | 1 | 1 | 1 | 1 | 1 |
Splice_Region | 1 | 1 | 1 | 1 | 1 | 1 |
Silent | 0 | NA | NA | 0 | 0 | 0 |
In_Frame_Del | NA | 0 | 0 | NA | NA | NA |
In_Frame_Ins | NA | 0 | 0 | NA | NA | NA |
Frame_Shift_Del | NA | 0 | 0 | NA | NA | NA |
Frame_Shift_Ins | NA | 0 | 0 | NA | NA | NA |
Intron | 1 | 1 | 1 | 1 | 1 | 1 |
3’UTR | 1 | 1 | 1 | 1 | 1 | 1 |
3’Flank | 1 | 1 | 1 | 1 | 1 | 1 |
RNA | 1 | 1 | 1 | 1 | 1 | 1 |
IGR | 1 | 1 | 1 | 1 | 1 | 1 |
The user can investigate the predicted effect of different mutation types on the translational level through the table LOC_translation:
Variant_Classification | SNP | INS | DEL | DNP | TNP | ONP |
---|---|---|---|---|---|---|
5’Flank | 1 | 1 | 1 | 1 | 1 | 1 |
5’UTR | 1 | 1 | 1 | 1 | 1 | 1 |
Translation_Start_Site | 1 | 1 | 1 | 1 | 1 | 1 |
Missense_Mutation | 0 | NA | NA | 0 | 0 | 0 |
Nonsense_Mutation | 1 | 1 | 1 | 1 | 1 | 1 |
Nonstop_Mutation | 1 | 1 | 1 | 1 | 1 | 1 |
Splice_Site | 1 | 1 | 1 | 1 | 1 | 1 |
Splice_Region | 1 | 1 | 1 | 1 | 1 | 1 |
Silent | 1 | NA | NA | 1 | 1 | 1 |
In_Frame_Del | NA | 0 | 0 | NA | NA | NA |
In_Frame_Ins | NA | 0 | 0 | NA | NA | NA |
Frame_Shift_Del | NA | 0 | 0 | NA | NA | NA |
Frame_shift_Ins | NA | 0 | 0 | NA | NA | NA |
Intron | 1 | 1 | 1 | 1 | 1 | 1 |
3’UTR | 1 | 1 | 1 | 1 | 1 | 1 |
3’Flank | 1 | 1 | 1 | 1 | 1 | 1 |
IGR | 1 | 1 | 1 | 1 | 1 | 1 |
RNA | 1 | 1 | 1 | 1 | 1 | 1 |
The user can investigate the predicted effect of different mutation types on the protein level through the table LOC_protein:
Variant_Classification | SNP | INS | DEL | DNP | TNP | ONP |
---|---|---|---|---|---|---|
5’Flank | 0 | 0 | 0 | 0 | 0 | 0 |
5’UTR | 0 | 0 | 0 | 0 | 0 | 0 |
Translation_Start_Site | 1 | 1 | 1 | 1 | 1 | 1 |
Missense_Mutation | 1 | NA | NA | 1 | 1 | 1 |
Nonsense_Mutation | 1 | 1 | 1 | 1 | 1 | 1 |
Nonstop_Mutation | 1 | 1 | 1 | 1 | 1 | 1 |
Splice_Site | 1 | 1 | 1 | 1 | 1 | 1 |
Splice_Region | 1 | 1 | 1 | 1 | 1 | 1 |
Silent | 0 | NA | NA | 0 | 0 | 0 |
In_Frame_Del | NA | 1 | 1 | NA | NA | NA |
In_Frame_Ins | NA | 1 | 1 | NA | NA | NA |
Frame_Shift_Del | NA | 1 | 1 | NA | NA | NA |
Frame_Shift_Ins | NA | 1 | 1 | NA | NA | NA |
Intron | 1 | 1 | 1 | 1 | 1 | 1 |
3’UTR | 0 | 0 | 0 | 0 | 0 | 0 |
3’Flank | 0 | 0 | 0 | 0 | 0 | 0 |
RNA | 0 | 0 | 0 | 0 | 0 | 0 |
IGR | 0 | 0 | 0 | 0 | 0 | 0 |
plotNetworkHive
: GRN hive visualization taking into
account COSMIC cancer genesIn the following plot the nodes are separated into three groups: known tumor suppressor genes (yellow), known oncogenes (green) and the rest (gray).
plotDMA
: Heatmap of the driver/passenger status of
mutations in TSGs/OCGsIn the following plot the driver genes with driver mutations are shown.
plotMoonlight
: Heatmap of Moonlight Gene Z-scores for
mutation-driven TSGs/OCGsIn the following plot the top 50 genes with the most driver mutations are visualised. The values are the moonlight gene z-score for the two biological processes
plotGMA
: Heatmap of hypo/hyper/dual methylated CpG
sites in TSGs/OCGsThis function plots results of the GMA
. It visualizes
the number of hypo/hyper/dual methylated CpG sites in oncogenic
mediators or in a user-supplied gene list. The results are visualized
either in a single heatmap or split into different ones which is
specified in the function’s three modes: split, complete and
genelist.
data("DEG_Methylation_Annotations")
data("Oncogenic_mediators_methylation_summary")
genes <- c("ACAN", "ACE2", "ADAM19", "AFAP1L1")
plotGMA(DEG_Methylation_Annotations = DEG_Methylation_Annotations,
Oncogenic_mediators_methylation_summary = Oncogenic_mediators_methylation_summary,
type = "genelist", genelist = genes,
additionalFilename = "./GMAresults/")
plotMoonlightMet
: Heatmap of Moonlight Gene Z-scores
for methylation-driven TSGs/OCGsThis function visualizes the effect of genes on biological processes and total number of hypo/hyper/dual methylated CpG sites in genes.
data("DEG_Methylation_Annotations")
data("Oncogenic_mediators_methylation_summary")
data("dataURA_plot")
genes <- c("ACAN", "ACE2", "ADAM19", "AFAP1L1")
plotMoonlightMet(DEG_Methylation_Annotations = DEG_Methylation_Annotations,
Oncogenic_mediators_methylation_summary = Oncogenic_mediators_methylation_summary,
dataURA = dataURA_plot,
genes = genes,
additionalFilename = "./GMAresults/")
plotMetExp
: Visualize results from EpiMix of expression
and methylation in genesThis function visualizes results of EpiMix (Zheng, Yuanning and Jun, John and Brennan, Kevin and Gevaert, Olivier 2023) in three plots: one that visualizes the distribution of DNA methylation data (MixtureModelPlot), one that visualizes gene expression levels (ViolinPlot) and one that visualizes relationship between DNA methylation and gene expression (CorrelationPlot).
data("EpiMix_Results_Regular")
data("dataMethyl")
data("dataFilt")
# Subset column names (sample names) in expression data to patient level
pattern <- "^(.{4}-.{2}-.{4}-.{2}).*"
colnames(dataFilt) <- sub(pattern, "\\1", colnames(dataFilt))
plotMetExp(EpiMix_results = EpiMix_Results_Regular,
probe_name = "cg03035704",
dataMET = dataMethyl,
dataEXP = dataFilt,
gene_of_interest = "ACVRL1",
additionalFilename = "./GMAresults/")
## p value = 0.009484347 R-squared = 0.641
This vignette shows a complete workflow of the ‘Moonlight2R’ package. The code is divided into the following case studies:
For illustrative purposes and to decrease runtime, we have set
nGenesPerm = 5
and nBoot = 5
in the call of
GRN
in the following code block, however, we recommend
setting these parameters to nGenesPerm = 2000
and
nBoot = 400
to achieve optimal results, as they are set by
default in the function arguments.
data(DEGsmatrix)
data(dataFilt)
data(DiseaseList)
data(EAGenes)
data(tabGrowBlock)
data(knownDriverGenes)
dataFEA <- FEA(DEGsmatrix = DEGsmatrix)
dataGRN <- GRN(TFs = sample(rownames(DEGsmatrix), 100),
DEGsmatrix = DEGsmatrix,
DiffGenes = TRUE,
normCounts = dataFilt,
nGenesPerm = 5,
nBoot = 5,
kNearest = 3)
dataURA <- URA(dataGRN = dataGRN,
DEGsmatrix = DEGsmatrix,
BPname = c("apoptosis",
"proliferation of cells"))
dataDual <- PRA(dataURA = dataURA,
BPname = c("apoptosis",
"proliferation of cells"),
thres.role = 0)
oncogenic_mediators <- list("TSG"=names(dataDual$TSG), "OCG"=names(dataDual$OCG))
plotURA
: Upstream regulatory analysis plotThe user can plot the result of the upstream regulatory analysis
using the function plotURA
.
The figure resulted from the code above is shown below:
For illustrative purposes and to decrease runtime, we have set
nGenesPerm = 5
and nBoot = 5
in the following
code block, however, we recommend setting these parameters to
nGenesPerm = 2000
and nBoot = 400
to achieve
optimal results, as they are set by default in the function
arguments.
data(dataFilt)
data(DEGsmatrix)
data(dataMAF)
data(DiseaseList)
data(EAGenes)
data(tabGrowBlock)
data(knownDriverGenes)
data(LOC_transcription)
data(LOC_translation)
data(LOC_protein)
data(NCG)
data(EncodePromoters)
listMoonlight <- moonlight(dataDEGs = DEGsmatrix,
dataFilt = dataFilt,
nTF = 100,
DiffGenes = TRUE,
nGenesPerm = 5,
nBoot = 5,
BPname = c("apoptosis","proliferation of cells"),
dataMAF = dataMAF,
path_cscape_coding = "css_coding.vcf.gz",
path_cscape_noncoding = "css_noncoding.vcf.gz")
save(listMoonlight, file = paste0("listMoonlight_ncancer4.Rdata"))
plotCircos
: Moonlight Circos PlotAn example of running Moonlight on five cancer types is visualized below in a circos plot. Outer ring: color by cancer type, Inner ring: OCGs and TSGs, Inner connections: green: common OCGs yellow: common TSGs red: possible dual role
The figure generated by the code above is shown below:
For illustrative purposes and to decrease runtime, we have set
nGenesPerm = 5
and nBoot = 5
in the call of
GRN
in the following code block, however, we recommend
setting these parameters to nGenesPerm = 2000
and
nBoot = 400
to achieve optimal results, as they are set by
default in the function arguments.
data(DEGsmatrix)
data(dataFilt)
data(dataMAF)
data(DiseaseList)
data(EAGenes)
data(tabGrowBlock)
data(knownDriverGenes)
data(LOC_transcription)
data(LOC_translation)
data(LOC_protein)
data(NCG)
data(EncodePromoters)
# Perform gene regulatory network analysis
dataGRN <- GRN(TFs = rownames(DEGsmatrix),
DEGsmatrix = DEGsmatrix,
DiffGenes = TRUE,
normCounts = dataFilt,
nGenesPerm = 5,
kNearest = 3,
nBoot = 5)
# Perform upstream regulatory analysis
# As example, we use apoptosis and proliferation of cells as the biological processes
dataURA <- URA(dataGRN = dataGRN,
DEGsmatrix = DEGsmatrix,
BPname = c("apoptosis",
"proliferation of cells"),
nCores = 1)
# Perform pattern recognition analysis
dataPRA <- PRA(dataURA = dataURA,
BPname = c("apoptosis",
"proliferation of cells"),
thres.role = 0)
# Perform driver mutation analysis
dataDMA <- DMA(dataMAF = dataMAF,
dataDEGs = DEGsmatrix,
dataPRA = dataPRA,
results_folder = "DMAresults",
coding_file = "css_coding.vcf.gz",
noncoding_file = "css_noncoding.vcf.gz")
Next, we analyze the predicted driver genes and their mutations.
data(Oncogenic_mediators_mutation_summary)
data(DEG_Mutations_Annotations)
# Extract oncogenic mediators that contain at least one driver mutation
# These are the driver genes
knitr::kable(Oncogenic_mediators_mutation_summary %>%
filter(!is.na(CScape_Driver)))
Hugo_Symbol | Moonlight_Oncogenic_Mediator | CScape_Passenger | CScape_Driver | CScape_Unclassified | Transcription_mut_sum | Translation_mut_sum | Protein_mut_sum | Total_Mutations | NCG_driver | NCG_cgc_annotation | NCG_vogelstein_annotation | NCG_saito_annotation | NCG_pubmed_id | NCG_cancer_type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ABCG2 | OCG | NA | 1 | NA | 0 | 0 | 1 | 1 | Candidate | NA | NA | NA | 30718927 | esophageal_adenocarcinoma |
# Extract mutation annotations of the predicted driver genes
driver_mut <- DEG_Mutations_Annotations %>%
filter(!is.na(Moonlight_Oncogenic_Mediator),
CScape_Mut_Class == "Driver")
# Extract driver genes with a predicted effect on the transcriptional level
transcription_mut <- Oncogenic_mediators_mutation_summary %>%
filter(!is.na(CScape_Driver)) %>%
filter(Transcription_mut_sum > 0)
# Extract mutation annotations of predicted driver genes that have a driver mutation
# in its promoter region with a potential effect on the transcriptional level
promoters <- DEG_Mutations_Annotations %>%
filter(!is.na(Moonlight_Oncogenic_Mediator),
CScape_Mut_Class == "Driver",
Potential_Effect_on_Transcription == 1,
Annotation == 'Promoter')
For illustrative purposes and to decrease runtime, we have set
nGenesPerm = 5
and nBoot = 5
in the call of
GRN
in the following code block, however, we recommend
setting these parameters to nGenesPerm = 2000
and
nBoot = 400
to achieve optimal results, as they are set by
default in the function arguments.
data(DEGsmatrix)
data(dataFilt)
data(dataMAF)
data(DiseaseList)
data(EAGenes)
data(tabGrowBlock)
data(knownDriverGenes)
data(LOC_transcription)
data(LOC_translation)
data(LOC_protein)
data(NCG)
data(EncodePromoters)
data(dataMethyl)
data(LUAD_sample_anno)
data(MetEvidenceDriver)
# Perform gene regulatory network analysis
dataGRN <- GRN(TFs = rownames(DEGsmatrix),
DEGsmatrix = DEGsmatrix,
DiffGenes = TRUE,
normCounts = dataFilt,
nGenesPerm = 5,
kNearest = 3,
nBoot = 5)
# Perform upstream regulatory analysis
# As example, we use apoptosis and proliferation of cells as the biological processes
dataURA <- URA(dataGRN = dataGRN,
DEGsmatrix = DEGsmatrix,
BPname = c("apoptosis",
"proliferation of cells"),
nCores = 1)
# Perform pattern recognition analysis
dataPRA <- PRA(dataURA = dataURA,
BPname = c("apoptosis",
"proliferation of cells"),
thres.role = 0)
# Subset column names (sample names) in expression data to patient level
pattern <- "^(.{4}-.{2}-.{4}-.{2}).*"
colnames(dataFilt) <- sub(pattern, "\\1", colnames(dataFilt))
# Perform gene methylation analysis
dataGMA <- GMA(dataMET = dataMethyl, dataEXP = dataFilt,
dataPRA = dataPRA, dataDEGs = DEGsmatrix,
sample_info = LUAD_sample_anno, met_platform = "HM450",
prevalence_filter = NULL,
output_dir = "./GMAresults", cores = 1,
roadmap.epigenome.ids = "E096",
roadmap.epigenome.groups = NULL)
Please cite the MoonlightR and Moonlight2R packages:
“Interpreting pathways to discover cancer driver genes with Moonlight.” Nature Communications (2020): 10.1038/s41467-019-13803-0. (Colaprico, Antonio and Olsen, Catharina and Bailey, Matthew H. and Odom, Gabriel J. and Terkelsen, Thilde and Silva, Tiago C. and Olsen, André V. and Cantini, Laura and Zinovyev, Andrei and Barillot, Emmanuel and Noushmehr, Houtan and Bertoli, Gloria and Castiglioni, Isabella and Cava, Claudia and Bontempi, Gianluca and Chen, Xi Steven and Papaleo, Elena 2020)
“A workflow to study mechanistic indicators for driver gene prediction with Moonlight.” Briefings in Bioinformatics (2023): 10.1093/bib/bbad274. (Nourbakhsh, Mona and Saksager, Astrid and Tom, Nikola and Chen, Xi Steven and Colaprico, Antonio and Olsen, Catharina and Tiberti, Matteo and Papaleo, Elena 2023)
Related publications to this vignette:
“TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.” Nucleic acids research (2015): gkv1507. (Colaprico, Antonio and Silva, Tiago C. and Olsen, Catharina and Garofano, Luciano and Cava, Claudia and Garolini, Davide and Sabedot, Thais S. and Malta, Tathiane M. and Pagnotta, Stefano M. and Castiglioni, Isabella and Ceccarelli, Michele and Bontempi, Gianluca and Noushmehr, Houtan 2016)
“TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages”. F1000Research 10.12688/f1000research.8923.1 (Silva, TC and Colaprico, A and Olsen, C and D’Angelo, F and Bontempi, G and Ceccarelli, M and Noushmehr, H 2016)
Session Information ******
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2 IRanges_2.41.2
## [4] S4Vectors_0.45.2 BiocGenerics_0.53.3 generics_0.1.3
## [7] dplyr_1.1.4 magrittr_2.0.3 Moonlight2R_1.5.2
## [10] doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
##
## loaded via a namespace (and not attached):
## [1] RISmed_2.3.0 splines_4.4.2
## [3] BiocIO_1.17.1 ggplotify_0.1.2
## [5] bitops_1.0-9 filelock_1.0.3
## [7] tibble_3.2.1 R.oo_1.27.0
## [9] XML_3.99-0.17 tidyHeatmap_1.8.1
## [11] lifecycle_1.0.4 httr2_1.0.7
## [13] tcltk_4.4.2 vroom_1.6.5
## [15] lattice_0.22-6 dendextend_1.19.0
## [17] limma_3.63.2 EpiMix.data_1.8.0
## [19] sass_0.4.9 rmarkdown_2.29
## [21] jquerylib_0.1.4 yaml_2.3.10
## [23] ggtangle_0.0.6 askpass_1.2.1
## [25] easyPubMed_2.13 cowplot_1.1.3
## [27] DBI_1.2.3 buildtools_1.0.0
## [29] RColorBrewer_1.1-3 abind_1.4-8
## [31] zlibbioc_1.52.0 purrr_1.0.2
## [33] R.utils_2.12.3 RCurl_1.98-1.16
## [35] rgl_1.3.14 yulab.utils_0.1.8
## [37] rappdirs_0.3.3 circlize_0.4.16
## [39] GenomeInfoDbData_1.2.13 enrichplot_1.27.3
## [41] ggrepel_0.9.6 parmigene_1.1.1
## [43] tidytree_0.4.6 maketools_1.3.1
## [45] rentrez_1.2.3 codetools_0.2-20
## [47] DelayedArray_0.33.3 DOSE_4.1.0
## [49] xml2_1.3.6 tidyselect_1.2.1
## [51] shape_1.4.6.1 aplot_0.2.4
## [53] UCSC.utils_1.3.0 farver_2.1.2
## [55] viridis_0.6.5 matrixStats_1.4.1
## [57] BiocFileCache_2.15.0 base64enc_0.1-3
## [59] GenomicAlignments_1.43.0 jsonlite_1.8.9
## [61] GetoptLong_1.0.5 systemfonts_1.1.0
## [63] tools_4.4.2 progress_1.2.3
## [65] ragg_1.3.3 seqminer_9.7
## [67] treeio_1.31.0 snow_0.4-4
## [69] Rcpp_1.0.13-1 glue_1.8.0
## [71] gridExtra_2.3 SparseArray_1.7.2
## [73] mgcv_1.9-1 xfun_0.49
## [75] qvalue_2.39.0 MatrixGenerics_1.19.0
## [77] withr_3.0.2 BiocManager_1.30.25
## [79] fastmap_1.2.0 caTools_1.18.3
## [81] digest_0.6.37 mime_0.12
## [83] gridGraphics_0.5-1 R6_2.5.1
## [85] textshaping_0.4.1 qpdf_1.3.4
## [87] colorspace_2.1-1 GO.db_3.20.0
## [89] RPMM_1.25 gtools_3.9.5
## [91] fuzzyjoin_0.1.6 HiveR_0.4.0
## [93] jpeg_0.1-10 biomaRt_2.63.0
## [95] RSQLite_2.3.9 R.methodsS3_1.8.2
## [97] utf8_1.2.4 tidyr_1.3.1
## [99] data.table_1.16.4 rtracklayer_1.67.0
## [101] prettyunits_1.2.0 httr_1.4.7
## [103] htmlwidgets_1.6.4 S4Arrays_1.7.1
## [105] pkgconfig_2.0.3 gtable_0.3.6
## [107] blob_1.2.4 ComplexHeatmap_2.23.0
## [109] XVector_0.47.0 sys_3.4.3
## [111] clusterProfiler_4.15.1 htmltools_0.5.8.1
## [113] fgsea_1.33.0 clue_0.3-66
## [115] scales_1.3.0 Biobase_2.67.0
## [117] png_0.1-8 doSNOW_1.0.20
## [119] ggfun_0.1.8 knitr_1.49
## [121] tzdb_0.4.0 reshape2_1.4.4
## [123] rjson_0.2.23 nlme_3.1-166
## [125] curl_6.0.1 org.Hs.eg.db_3.20.0
## [127] cachem_1.1.0 GlobalOptions_0.1.2
## [129] stringr_1.5.1 KernSmooth_2.23-24
## [131] BiocVersion_3.21.1 AnnotationDbi_1.69.0
## [133] restfulr_0.0.15 GEOquery_2.75.0
## [135] pillar_1.10.0 grid_4.4.2
## [137] vctrs_0.6.5 gplots_3.2.0
## [139] randomForest_4.7-1.2 dbplyr_2.5.0
## [141] cluster_2.1.8 evaluate_1.0.1
## [143] readr_2.1.5 GenomicFeatures_1.59.1
## [145] cli_3.6.3 compiler_4.4.2
## [147] Rsamtools_2.23.1 rlang_1.1.4
## [149] crayon_1.5.3 labeling_0.4.3
## [151] plyr_1.8.9 fs_1.6.5
## [153] stringi_1.8.4 viridisLite_0.4.2
## [155] BiocParallel_1.41.0 munsell_0.5.1
## [157] Biostrings_2.75.3 lazyeval_0.2.2
## [159] EpiMix_1.9.0 GOSemSim_2.33.0
## [161] Matrix_1.7-1 ExperimentHub_2.15.0
## [163] hms_1.1.3 patchwork_1.3.0
## [165] bit64_4.5.2 ggplot2_3.5.1
## [167] KEGGREST_1.47.0 statmod_1.5.0
## [169] SummarizedExperiment_1.37.0 AnnotationHub_3.15.0
## [171] igraph_2.1.2 memoise_2.0.1
## [173] bslib_0.8.0 ggtree_3.15.0
## [175] fastmatch_1.1-4 bit_4.5.0.1
## [177] ELMER.data_2.30.0 downloader_0.4
## [179] gson_0.1.0 ape_5.8-1
Obtain Input
Download
: Get GEO data
getDataGEO
:
Search by cancer type and data type [Gene Expression]FEA
: Functional
Enrichment AnalysisFEAplot
:
Functional Enrichment Analysis PlotGRN
: Gene Regulatory
NetworkURA
: Upstream
Regulator AnalysisPRA
: Pattern
Regognition AnalysisDMA
: Driver Mutation
AnalysisGMA
: Gene
Methylation AnalysisGLS
: Gene Literature
SearchloadMAVISp
:
Helper function for loading the MAVISp databaseTFinfluence
:
Effect of mutations on transcription factorsLevel of consequence
:
Effect of mutations on three different levelsplotNetworkHive
:
GRN hive visualization taking into account COSMIC cancer genesplotDMA
:
Heatmap of the driver/passenger status of mutations in
TSGs/OCGsplotMoonlight
:
Heatmap of Moonlight Gene Z-scores for mutation-driven
TSGs/OCGsplotGMA
:
Heatmap of hypo/hyper/dual methylated CpG sites in TSGs/OCGsplotMoonlightMet
:
Heatmap of Moonlight Gene Z-scores for methylation-driven
TSGs/OCGsplotMetExp
:
Visualize results from EpiMix of expression and methylation in
genesplotURA
:
Upstream regulatory analysis plotplotCircos
:
Moonlight Circos Plot