In order to make light of cancer development, it is crucial to understand which genes play a role in the mechanisms linked to this disease and moreover which role that is. Commonly biological processes such as proliferation and apoptosis have been linked to cancer progression. Based on expression data we perform functional enrichment analysis, infer gene regulatory networks and upstream regulator analysis to score the importance of well-known biological processes with respect to the studied cancer. We then use these scores to predict two specific roles: genes that act as tumor suppressor genes (TSGs) and genes that act as oncogenes (OCGs). This methodology not only allows us to identify genes with dual role (TSG in one cancer type and OCG in another) but also to elucidate the underlying biological processes.
Cancer development is influenced by mutations in two distinctly
different categories of genes, known as tumor suppressor genes (TSG) and
oncogenes (OCG). The occurrence of mutations in genes of the first
category leads to faster cell proliferation while mutations in genes of
second category increases or changes their function. We propose
MoonlightR
a new approach to define TSGs and OCGs based on
functional enrichment analysis, infer gene regulatory networks and
upstream regulator analysis to score the importance of well-known
biological processes with respect to the studied cancer.
Moonlight’s pipeline is shown below:
The proposed pipeline consists of following eight steps:
1 For the devel version of MoonlightR we use a short extract of ten biological functions from QIAGEN’S Ingenuity Pathway Analysis (IPA). We are still working to integrate the package.
To install use the code below.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("MoonlightR")
Please cite TCGAbiolinks package:
Related publications to this package:
Download
: Get TCGA dataYou can search TCGA data using the getDataTCGA
function.
getDataTCGA
: Search by cancer type and data type [Gene
Expression]The user can query and download the cancer types supported by TCGA,
using the function getDataTCGA
: In this example we used
LUAD gene expression data with only 4 samples to reduce time
downloading.
Download
: Get GEO dataYou can search GEO data using the getDataGEO
function.
GEO_TCGAtab: a 18x12 matrix that provides the GEO data set we matched to one of the 18 given TCGA cancer types
knitr::kable(GEO_TCGAtab, digits = 2,
caption = "Table with GEO data set matched to one
of the 18 given TCGA cancer types ",
row.names = TRUE)
Cancer | TP | NT | DEG. | Dataset | TP.1 | NT.1 | Platform | DEG.. | Common | GEO_Normal | GEO_Tumor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | BLCA | 408 | 19 | 2937 | GSE13507 | 165 | 10 | GPL65000 | 2099 | 896 | control | cancer |
2 | BRCA | 1097 | 114 | 3390 | GSE39004 | 61 | 47 | GPL6244 | 2449 | 1248 | normal | Tumor |
3 | CHOL | 36 | 9 | 5015 | GSE26566 | 104 | 59 | GPL6104 | 3983 | 2587 | Surrounding | Tumor |
4 | COAD | 286 | 41 | 3788 | GSE41657 | 25 | 12 | GPL6480 | 3523 | 1367 | N | A |
5 | ESCA | 184 | 11 | 2525 | GSE20347 | 17 | 17 | GPL571 | 1316 | 406 | normal | carcinoma |
6 | GBM | 156 | 5 | 4828 | GSE50161 | 34 | 13 | GPL570 | 4504 | 2660 | normal | GBM |
7 | HNSC | 520 | 44 | 2973 | GSE6631 | 22 | 22 | GPL8300 | 142 | 129 | normal | cancer |
8 | KICH | 66 | 25 | 4355 | GSE15641 | 6 | 23 | GPL96 | 1789 | 680 | normal | chromophobe |
9 | KIRC | 533 | 72 | 3618 | GSE15641 | 32 | 23 | GPL96 | 2911 | 939 | normal | clear cell RCC |
10 | KIRP | 290 | 32 | 3748 | GSE15641 | 11 | 23 | GPL96 | 2020 | 756 | normal | papillary RCC |
11 | LIHC | 371 | 50 | 3043 | GSE45267 | 46 | 41 | GPL570 | 1583 | 860 | normal liver | HCC sample |
12 | LUAD | 515 | 59 | 3498 | GSE10072 | 58 | 49 | GPL96 | 666 | 555 | normal | tumor |
13 | LUSC | 503 | 51 | 4984 | GSE33479 | 14 | 27 | GPL6480 | 3729 | 1706 | normal | squamous cell carcinoma |
14 | PRAD | 497 | 52 | 1860 | GSE6919 | 81 | 90 | GPL8300 | 246 | 149 | normal prostate | tumor samples |
15 | READ | 94 | 10 | 3628 | GSE20842 | 65 | 65 | GPL4133 | 2172 | 1261 | M | T |
16 | STAD | 415 | 35 | 2622 | GSE2685 | 10 | 10 | GPL80 | 487 | 164 | N | T |
17 | THCA | 505 | 59 | 1994 | GSE33630 | 60 | 45 | GPL570 | 1451 | 781 | N | T |
18 | UCEC | 176 | 24 | 4183 | GSE17025 | GPL570 | tp | lcm |
Analysis
: To analyze TCGA dataDPA
: Differential Phenotype AnalysisDifferential phenotype analysis is able to identify genes or probes that are significantly different between two phenotypes such as normal vs. tumor, or normal vs. stageI, normal vs. molecular subtype.
For gene expression data, DPA is running a differential expression
analysis (DEA) to identify differentially expressed genes (DEGs) using
the TCGAanalyze_DEA
function from .
For methylation data, DPA is running a differentially methylated
regions analysis (DMR) to identify differentially methylated CpG sites
using the TCGAanalyze_DMR
the TCGAanalyze_DMR
function from .
For gene expression data, DPA dealing with GEO data is running a
differential expression analysis (DEA) to identify differentially
expressed genes (DEGs) using to the eBayes
and
topTable
functions from .
data(GEO_TCGAtab)
DataAnalysisGEO<- "../GEO_dataset/"
i<-5
cancer <- GEO_TCGAtab$Cancer[i]
cancerGEO <- GEO_TCGAtab$Dataset[i]
cancerPLT <-GEO_TCGAtab$Platform[i]
fileCancerGEO <- paste0(cancer,"_GEO_",cancerGEO,"_",cancerPLT, ".RData")
dataFilt <- getDataGEO(TCGAtumor = cancer)
xContrast <- c("G1-G0")
GEOdegs <- DPA(dataConsortium = "GEO",
gset = dataFilt ,
colDescription = "title",
samplesType = c(GEO_TCGAtab$GEO_Normal[i],
GEO_TCGAtab$GEO_Tumor[i]),
fdr.cut = 0.01,
logFC.cut = 1,
gsetFile = paste0(DataAnalysisGEO,fileCancerGEO))
We can visualize those differentially expressed genes (DEGs) with a
volcano plot using the TCGAVisualize_volcano
function from
.
library(TCGAbiolinks)
TCGAVisualize_volcano(DEGsmatrix$logFC, DEGsmatrix$FDR,
filename = "DEGs_volcano.png",
x.cut = 1,
y.cut = 0.05,
names = rownames(DEGsmatrix),
color = c("black","red","dodgerblue3"),
names.size = 2,
show.names = "highlighted",
highlight = c("gene1","gene2"),
xlab = " Gene expression fold change (Log2)",
legend = "State",
title = "Volcano plot (Normal NT vs Tumor TP)",
width = 10)
## Saving file as: DEGs_volcano.png
The figure generated by the code above is shown below:
FEA
: Functional Enrichment AnalysisThe user can perform a functional enrichment analysis using the
function FEAcomplete
. For each DEG in the gene set a
z-score is calculated. This score indicates how the genes act in the
gene set.
The output can be visualized with a FEA plot.
FEAplot
: Functional Enrichment Analysis PlotThe user can plot the result of a functional enrichment analysis
using the function plotFEA
. A negative z-score indicates
that the process’ activity is decreased. A positive z-score indicates
that the process’ activity is increased.
The figure generated by the above code is shown below:
GRN
: Gene Regulatory NetworkThe user can perform a gene regulatory network analysis using the
function GRN
which infers the network using the parmigene
package.
URA
: Upstream Regulator AnalysisThe user can perform upstream regulator analysis using the function
URA
. This function is applied to each DEG in the enriched
gene set and its neighbors in the GRN.
PRA
: Pattern Regognition AnalysisThe user can retrieve TSG/OCG candidates using either selected biological processes or a random forest classifier trained on known COSMIC OCGs/TSGs.
plotNetworkHive
: GRN hive visualization taking into
account Cosmic cancer genesIn the following plot the nodes are separated into three groups: known tumor suppressor genes (yellow), known oncogenes (green) and the rest (gray).
This vignette shows a complete workflow of the ‘MoonlightR’ package except for the data download. The code is divided into three case study:
dataDEGs <- DPA(dataFilt = dataFilt,
dataType = "Gene expression")
dataFEA <- FEA(DEGsmatrix = dataDEGs)
dataGRN <- GRN(TFs = rownames(dataDEGs)[1:100],
DEGsmatrix = dataDEGs,
DiffGenes = TRUE,
normCounts = dataFilt)
dataURA <- URA(dataGRN = dataGRN,
DEGsmatrix = dataDEGs,
BPname = c("apoptosis",
"proliferation of cells"))
dataDual <- PRA(dataURA = dataURA,
BPname = c("apoptosis",
"proliferation of cells"),
thres.role = 0)
CancerGenes <- list("TSG"=names(dataDual$TSG), "OCG"=names(dataDual$OCG))
plotURA
: Upstream regulatory analysis plotThe user can plot the result of the upstream regulatory analysis
using the function plotURA
.
plotURA(dataURA = dataURA[c(names(dataDual$TSG), names(dataDual$OCG)),, drop = FALSE], additionalFilename = "_exampleVignette")
The figure resulted from the code above is shown below:
cancerList <- c("BLCA","COAD","ESCA","HNSC","STAD")
listMoonlight <- moonlight(cancerType = cancerList,
dataType = "Gene expression",
directory = "data",
nSample = 10,
nTF = 100,
DiffGenes = TRUE,
BPname = c("apoptosis","proliferation of cells"))
save(listMoonlight, file = paste0("listMoonlight_ncancer4.Rdata"))
plotCircos
: Moonlight Circos PlotThe results of the moonlight pipeline can be visualized with a circos plot. Outer ring: color by cancer type, Inner ring: OCGs and TSGs, Inner connections: green: common OCGs yellow: common TSGs red: possible dual role
The figure generated by the code above is shown below:
listMoonlight <- NULL
for (i in 1:4){
dataDual <- moonlight(cancerType = "BRCA",
dataType = "Gene expression",
directory = "data",
nSample = 10,
nTF = 5,
DiffGenes = TRUE,
BPname = c("apoptosis","proliferation of cells"),
stage = i)
listMoonlight <- c(listMoonlight, list(dataDual))
save(dataDual, file = paste0("dataDual_stage",as.roman(i), ".Rdata"))
}
names(listMoonlight) <- c("stage1", "stage2", "stage3", "stage4")
# Prepare mutation data for stages
mutation <- GDCquery_Maf(tumor = "BRCA")
res.mutation <- NULL
for(stage in 1:4){
curStage <- paste0("Stage ", as.roman(stage))
dataClin$tumor_stage <- toupper(dataClin$tumor_stage)
dataClin$tumor_stage <- gsub("[ABCDEFGH]","",dataClin$tumor_stage)
dataClin$tumor_stage <- gsub("ST","Stage",dataClin$tumor_stage)
dataStg <- dataClin[dataClin$tumor_stage %in% curStage,]
message(paste(curStage, "with", nrow(dataStg), "samples"))
dataSmTP <- mutation$Tumor_Sample_Barcode
dataStgC <- dataSmTP[substr(dataSmTP,1,12) %in% dataStg$bcr_patient_barcode]
dataSmTP <- dataStgC
info.mutation <- mutation[mutation$Tumor_Sample_Barcode %in% dataSmTP,]
ind <- which(info.mutation[,"Consequence"]=="inframe_deletion")
ind2 <- which(info.mutation[,"Consequence"]=="inframe_insertion")
ind3 <- which(info.mutation[,"Consequence"]=="missense_variant")
res.mutation <- c(res.mutation, list(info.mutation[c(ind, ind2, ind3),c(1,51)]))
}
names(res.mutation) <- c("stage1", "stage2", "stage3", "stage4")
tmp <- NULL
tmp <- c(tmp, list(listMoonlight[[1]][[1]]))
tmp <- c(tmp, list(listMoonlight[[2]][[1]]))
tmp <- c(tmp, list(listMoonlight[[3]][[1]]))
tmp <- c(tmp, list(listMoonlight[[4]][[1]]))
names(tmp) <- names(listMoonlight)
mutation <- GDCquery_Maf(tumor = "BRCA")
plotCircos(listMoonlight=listMoonlight,listMutation=res.mutation, additionalFilename="proc2_wmutation", intensityColDual=0.2,fontSize = 2)
The results of the moonlight pipeline can be visualized with a circos plot. Outer ring: color by cancer type, Inner ring: OCGs and TSGs, Inner connections: green: common OCGs yellow: common TSGs red: possible dual role
The figure generated by the code above is shown below:
Session Information ******
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] TCGAbiolinks_2.35.0 png_0.1-8 MoonlightR_1.33.0
## [4] doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
## [7] dbplyr_2.5.0
##
## loaded via a namespace (and not attached):
## [1] RISmed_2.3.0 splines_4.4.2
## [3] later_1.3.2 filelock_1.0.3
## [5] bitops_1.0-9 ggplotify_0.1.2
## [7] tibble_3.2.1 R.oo_1.27.0
## [9] XML_3.99-0.17 httr2_1.0.6
## [11] lifecycle_1.0.4 tcltk_4.4.2
## [13] rprojroot_2.0.4 lattice_0.22-6
## [15] magrittr_2.0.3 limma_3.63.2
## [17] sass_0.4.9 rmarkdown_2.29
## [19] jquerylib_0.1.4 yaml_2.3.10
## [21] remotes_2.5.0 httpuv_1.6.15
## [23] ggtangle_0.0.4 sessioninfo_1.2.2
## [25] pkgbuild_1.4.5 cowplot_1.1.3
## [27] DBI_1.2.3 buildtools_1.0.0
## [29] RColorBrewer_1.1-3 abind_1.4-8
## [31] pkgload_1.4.0 zlibbioc_1.52.0
## [33] rvest_1.0.4 GenomicRanges_1.59.0
## [35] purrr_1.0.2 R.utils_2.12.3
## [37] BiocGenerics_0.53.3 rgl_1.3.14
## [39] yulab.utils_0.1.8 rappdirs_0.3.3
## [41] circlize_0.4.16 GenomeInfoDbData_1.2.13
## [43] IRanges_2.41.1 S4Vectors_0.45.2
## [45] enrichplot_1.27.1 ggrepel_0.9.6
## [47] parmigene_1.1.1 rentrez_1.2.3
## [49] tidytree_0.4.6 maketools_1.3.1
## [51] codetools_0.2-20 DelayedArray_0.33.2
## [53] xml2_1.3.6 DOSE_4.1.0
## [55] tidyselect_1.2.1 shape_1.4.6.1
## [57] aplot_0.2.3 UCSC.utils_1.3.0
## [59] farver_2.1.2 TCGAbiolinksGUI.data_1.26.0
## [61] BiocFileCache_2.15.0 matrixStats_1.4.1
## [63] stats4_4.4.2 base64enc_0.1-3
## [65] jsonlite_1.8.9 ellipsis_0.3.2
## [67] systemfonts_1.1.0 progress_1.2.3
## [69] tools_4.4.2 ragg_1.3.3
## [71] treeio_1.31.0 Rcpp_1.0.13-1
## [73] glue_1.8.0 SparseArray_1.7.2
## [75] xfun_0.49 qvalue_2.39.0
## [77] MatrixGenerics_1.19.0 usethis_3.0.0
## [79] GenomeInfoDb_1.43.1 dplyr_1.1.4
## [81] withr_3.0.2 fastmap_1.2.0
## [83] fansi_1.0.6 caTools_1.18.3
## [85] digest_0.6.37 R6_2.5.1
## [87] mime_0.12 gridGraphics_0.5-1
## [89] textshaping_0.4.0 colorspace_2.1-1
## [91] GO.db_3.20.0 gtools_3.9.5
## [93] HiveR_0.4.0 biomaRt_2.63.0
## [95] jpeg_0.1-10 RSQLite_2.3.8
## [97] R.methodsS3_1.8.2 utf8_1.2.4
## [99] tidyr_1.3.1 generics_0.1.3
## [101] data.table_1.16.2 prettyunits_1.2.0
## [103] httr_1.4.7 htmlwidgets_1.6.4
## [105] S4Arrays_1.7.1 pkgconfig_2.0.3
## [107] gtable_0.3.6 blob_1.2.4
## [109] XVector_0.47.0 sys_3.4.3
## [111] clusterProfiler_4.15.0 htmltools_0.5.8.1
## [113] profvis_0.4.0 fgsea_1.33.0
## [115] scales_1.3.0 Biobase_2.67.0
## [117] ggfun_0.1.7 knitr_1.49
## [119] rstudioapi_0.17.1 tzdb_0.4.0
## [121] reshape2_1.4.4 curl_6.0.1
## [123] nlme_3.1-166 cachem_1.1.0
## [125] GlobalOptions_0.1.2 stringr_1.5.1
## [127] KernSmooth_2.23-24 miniUI_0.1.1.1
## [129] AnnotationDbi_1.69.0 desc_1.4.3
## [131] GEOquery_2.75.0 pillar_1.9.0
## [133] vctrs_0.6.5 gplots_3.2.0
## [135] urlchecker_1.0.1 promises_1.3.0
## [137] randomForest_4.7-1.2 xtable_1.8-4
## [139] evaluate_1.0.1 readr_2.1.5
## [141] cli_3.6.3 compiler_4.4.2
## [143] rlang_1.1.4 crayon_1.5.3
## [145] labeling_0.4.3 plyr_1.8.9
## [147] fs_1.6.5 stringi_1.8.4
## [149] BiocParallel_1.41.0 munsell_0.5.1
## [151] Biostrings_2.75.1 lazyeval_0.2.2
## [153] devtools_2.4.5 GOSemSim_2.33.0
## [155] Matrix_1.7-1 hms_1.1.3
## [157] patchwork_1.3.0 bit64_4.5.2
## [159] ggplot2_3.5.1 KEGGREST_1.47.0
## [161] statmod_1.5.0 shiny_1.9.1
## [163] SummarizedExperiment_1.37.0 igraph_2.1.1
## [165] memoise_2.0.1 bslib_0.8.0
## [167] ggtree_3.15.0 fastmatch_1.1-4
## [169] bit_4.5.0 downloader_0.4
## [171] ape_5.8 gson_0.1.0