3.6 - TCGA.pipe: Running ELMER for TCGA data in a compact way


TCGA.pipe: Running ELMER for TCGA data in a compact way

TCGA.pipe is a function for easily downloading TCGA data from GDC using TCGAbiolinks package (Colaprico et al. 2016) and performing all the analyses in ELMER. For illustration purpose, we skip the downloading step. The user can use the getTCGA function to download TCGA data or use TCGA.pipe by including “download” in the analysis option.

The following command will do distal DNA methylation analysis and predict putative target genes, motif analysis and identify regulatory transcription factors.

TCGA.pipe("LUSC",
          wd = "./ELMER.example",
          cores = parallel::detectCores()/2,
          mode = "unsupervised"
          permu.size = 300,
          Pe = 0.01,
          analysis = c("distal.probes","diffMeth","pair","motif","TF.search"),
          diff.dir = "hypo",
          rm.chr = paste0("chr",c("X","Y")))
TCGA.pipe: Mode argument

In this new version we added the argument mode in the TCGA.pipe function. This will automatically set the minSubgroupFrac to the following values:

Modes available:

  • unsupervised:
    • Use 20% of each group to identify differently methylated regions (minSubgroupFrac = 0.2 in get.diff.meth)
    • Use 40% of all samples to create Unmethytlated (U) and Methylated (M) groups in the other steps (the lowest quintile of samples is the U group and the highest quintile samples is the M group) (minSubgroupFrac = 0.4 in get.pairs and get.TFs functions)
  • supervised:
    • Use all samples in all functions and set Unmethytlated (U) and Methylated (M) one of the group selected in the analysis.

The unsupervised mode should be used when want to be able to detect a specific (possibly unknown) molecular subtype among tumor; these subtypes often make up only a minority of samples, and 20% was chosen as a lower bound for the purposes of statistical power. If you are using pre-defined group labels, such as treated replicates vs. untreated replicated, use supervised mode (all samples),

For more information please read the analysis section of the vignette.

Using mutation data to identify groups

We add in TCGA.pipe function (download step) the option to identify mutant samples to perform WT vs Mutant analysis. It will download open MAF file from GDC database (Grossman et al. 2016), select a gene and identify the which are the mutant samples based on the following classification: (it can be changed using the atgument mutant_variant_classification).

Mutations classification
Argument Description
Frame_Shift_Del Mutant
Frame_Shift_Ins Mutant
Missense_Mutation Mutant
Nonsense_Mutation Mutant
Splice_Site Mutant
In_Frame_Del Mutant
In_Frame_Ins Mutant
Translation_Start_Site Mutant
Nonstop_Mutation Mutant
Silent WT
3’UTR WT
5’UTR WT
3’Flank WT
5’Flank WT
IGR1 (intergenic region) WT
Intron WT
RNA WT
Target_region WT

The arguments to be used are below:

TCGA.pipe mutation arguments
Argument Description
genes List of genes for which mutations will be verified. A column in the MAE with the name of the gene will be created with two groups WT (tumor samples without mutation), MUT (tumor samples w/ mutation), NA (not tumor samples)
mutant_variant_classification List of GDC variant classification from MAF files to consider a samples mutant. Only used when argument gene is set.
group.col A column defining the groups of the sample. You can view the available columns using: colnames(MultiAssayExperiment::colData(data)).
group1 A group from group.col. ELMER will run group1 vs group2. That means, if direction is hyper, get probes hypermethylated in group 1 compared to group 2.
group2 A group from group.col. ELMER will run group1 vs group2. That means, if direction is hyper, get probes hypermethylated in group 1 compared to group 2.

Here is an example we TCGA-LUSC data is downloaded and we will compare TP53 Mutant vs TP53 WT samples.

TCGA.pipe("LUSC",
          wd = "./ELMER.example",
          cores = parallel::detectCores()/2,
          mode = "supervised"
          genes = "TP53",
          group.col = "TP53",
          group1 = "Mutant",
          group2 = "WT",
          permu.size = 300,
          Pe = 0.01,
          analysis = c("download","diffMeth","pair","motif","TF.search"),
          diff.dir = "hypo",
          rm.chr = paste0("chr",c("X","Y")))

Session Info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] MultiAssayExperiment_1.31.5 SummarizedExperiment_1.35.5
##  [3] Biobase_2.67.0              MatrixGenerics_1.17.1      
##  [5] matrixStats_1.4.1           GenomicRanges_1.57.2       
##  [7] GenomeInfoDb_1.41.2         IRanges_2.39.2             
##  [9] S4Vectors_0.43.2            sesameData_1.23.0          
## [11] ExperimentHub_2.13.1        AnnotationHub_3.15.0       
## [13] BiocFileCache_2.15.0        dbplyr_2.5.0               
## [15] BiocGenerics_0.53.0         BiocStyle_2.35.0           
## [17] dplyr_1.1.4                 DT_0.33                    
## [19] ELMER_2.31.0                ELMER.data_2.29.0          
## 
## loaded via a namespace (and not attached):
##   [1] BiocIO_1.17.0               bitops_1.0-9               
##   [3] filelock_1.0.3              tibble_3.2.1               
##   [5] XML_3.99-0.17               rpart_4.1.23               
##   [7] lifecycle_1.0.4             httr2_1.0.5                
##   [9] rstatix_0.7.2               doParallel_1.0.17          
##  [11] vroom_1.6.5                 lattice_0.22-6             
##  [13] ensembldb_2.29.1            crosstalk_1.2.1            
##  [15] backports_1.5.0             magrittr_2.0.3             
##  [17] Hmisc_5.2-0                 plotly_4.10.4              
##  [19] sass_0.4.9                  rmarkdown_2.28             
##  [21] jquerylib_0.1.4             yaml_2.3.10                
##  [23] Gviz_1.49.0                 DBI_1.2.3                  
##  [25] buildtools_1.0.0            RColorBrewer_1.1-3         
##  [27] abind_1.4-8                 zlibbioc_1.51.2            
##  [29] rvest_1.0.4                 purrr_1.0.2                
##  [31] AnnotationFilter_1.31.0     biovizBase_1.55.0          
##  [33] RCurl_1.98-1.16             nnet_7.3-19                
##  [35] VariantAnnotation_1.51.2    rappdirs_0.3.3             
##  [37] circlize_0.4.16             GenomeInfoDbData_1.2.13    
##  [39] ggrepel_0.9.6               maketools_1.3.1            
##  [41] codetools_0.2-20            DelayedArray_0.31.14       
##  [43] xml2_1.3.6                  tidyselect_1.2.1           
##  [45] shape_1.4.6.1               farver_2.1.2               
##  [47] UCSC.utils_1.1.0            TCGAbiolinksGUI.data_1.25.0
##  [49] base64enc_0.1-3             GenomicAlignments_1.41.0   
##  [51] jsonlite_1.8.9              GetoptLong_1.0.5           
##  [53] Formula_1.2-5               iterators_1.0.14           
##  [55] foreach_1.5.2               tools_4.4.1                
##  [57] progress_1.2.3              Rcpp_1.0.13                
##  [59] glue_1.8.0                  BiocBaseUtils_1.9.0        
##  [61] gridExtra_2.3               SparseArray_1.5.45         
##  [63] xfun_0.48                   withr_3.0.2                
##  [65] BiocManager_1.30.25         fastmap_1.2.0              
##  [67] latticeExtra_0.6-30         fansi_1.0.6                
##  [69] digest_0.6.37               mime_0.12                  
##  [71] R6_2.5.1                    colorspace_2.1-1           
##  [73] jpeg_0.1-10                 dichromat_2.0-0.1          
##  [75] biomaRt_2.63.0              RSQLite_2.3.7              
##  [77] utf8_1.2.4                  tidyr_1.3.1                
##  [79] generics_0.1.3              data.table_1.16.2          
##  [81] rtracklayer_1.65.0          prettyunits_1.2.0          
##  [83] httr_1.4.7                  htmlwidgets_1.6.4          
##  [85] S4Arrays_1.5.11             pkgconfig_2.0.3            
##  [87] gtable_0.3.6                blob_1.2.4                 
##  [89] ComplexHeatmap_2.21.1       XVector_0.45.0             
##  [91] sys_3.4.3                   htmltools_0.5.8.1          
##  [93] carData_3.0-5               ProtGenerics_1.37.1        
##  [95] clue_0.3-65                 scales_1.3.0               
##  [97] png_0.1-8                   knitr_1.48                 
##  [99] rstudioapi_0.17.1           tzdb_0.4.0                 
## [101] reshape2_1.4.4              rjson_0.2.23               
## [103] checkmate_2.3.2             curl_5.2.3                 
## [105] cachem_1.1.0                GlobalOptions_0.1.2        
## [107] stringr_1.5.1               BiocVersion_3.21.1         
## [109] parallel_4.4.1              foreign_0.8-87             
## [111] AnnotationDbi_1.69.0        restfulr_0.0.15            
## [113] pillar_1.9.0                grid_4.4.1                 
## [115] reshape_0.8.9               vctrs_0.6.5                
## [117] ggpubr_0.6.0                car_3.1-3                  
## [119] cluster_2.1.6               htmlTable_2.4.3            
## [121] evaluate_1.0.1              TCGAbiolinks_2.33.0        
## [123] readr_2.1.5                 GenomicFeatures_1.57.1     
## [125] cli_3.6.3                   compiler_4.4.1             
## [127] Rsamtools_2.21.2            rlang_1.1.4                
## [129] crayon_1.5.3                ggsignif_0.6.4             
## [131] labeling_0.4.3              interp_1.1-6               
## [133] plyr_1.8.9                  stringi_1.8.4              
## [135] viridisLite_0.4.2           deldir_2.0-4               
## [137] BiocParallel_1.39.0         munsell_0.5.1              
## [139] Biostrings_2.75.0           lazyeval_0.2.2             
## [141] Matrix_1.7-1                BSgenome_1.73.1            
## [143] hms_1.1.3                   bit64_4.5.2                
## [145] ggplot2_3.5.1               KEGGREST_1.45.1            
## [147] highr_0.11                  broom_1.0.7                
## [149] memoise_2.0.1               bslib_0.8.0                
## [151] bit_4.5.0                   downloader_0.4

Bibliography

Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. 2016. “TCGAbiolinks: An r/Bioconductor Package for Integrative Analysis of TCGA Data.” Nucleic Acids Research 44 (8): e71. https://doi.org/10.1093/nar/gkv1507.
Grossman, Robert L et al. 2016. “Toward a Shared Vision for Cancer Genomic Data.” New England Journal of Medicine 375 (12): 1109–12.