terraTCGAdata Introduction

terraTCGAData

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("terraTCGAdata")

Overview

The terraTCGAdata R package aims to import TCGA datasets, as MultiAssayExperiment, available on the Terra platform. The package provides a set of functions that allow the discovery of relevant datasets. It provides one main function and two helper functions:

  1. terraTCGAdata allows the creation of the MultiAssayExperiment object from the different indicated resources.

  2. The getClinicalTable and getAssayTable functions allow for the discovery of datasets within the Terra data model. The column names from these tables can be provided as inputs to the terraTCGAdata function.

Data

Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data resources are linked within the data model). Particularly the workspaces that are labelled OpenAccess_V1-0. Datasets harmonized to the hg38 genome, such as those from the Genomic Data Commons data repository, use a different data model / workflow and are not compatible with the functions in this package. For those that are, we make use of the Terra data model and represent the data as MultiAssayExperiment.

For more information on MultiAssayExperiment, please see the vignette in that package.

Requirements

Loading packages

library(AnVIL)
library(terraTCGAdata)

gcloud sdk installation

A valid GCloud SDK installation is required to use the package. To get set up, see the Bioconductor tutorials for running RStudio on Terra. Use the gcloud_exists() function from the AnVIL package to identify whether it is installed in your system.

gcloud_exists()
## [1] FALSE

You can also use the gcloud_project to set a project name by specifying the project argument:

gcloud_project()

Default Data Workspace

To get a table of available TCGA workspaces, use the selectTCGAworkspace() function:

selectTCGAworkspace()

You can also set the package-wide option with the terraTCGAworkspace function and check the setting with getOption('terraTCGAdata.workspace') or by running terraTCGAworkspace function.

terraTCGAworkspace("TCGA_COAD_OpenAccess_V1-0_DATA")
getOption("terraTCGAdata.workspace")

Clinical data resources

In order to determine what datasets to download, use the getClinicalTable function to list all of the columns that correspond to clinical data from the different collection centers.

ct <- getClinicalTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
ct
names(ct)

Clinical data download

After picking the column in the getClinicalTable output, use the column name as input to the getClinical function to obtain the data:

column_name <- "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin"
clin <- getClinical(
    columnName = column_name,
    participants = TRUE,
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
clin[, 1:6]
dim(clin)

Assay data resources

We use the same approach for assay data. We first produce a list of assays from the getAssayTable and then we select one along with any sample codes of interest.

at <- getAssayTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
at
names(at)

Summary of sample types in the data

You can get a summary table of all the samples in the adata by using the sampleTypesTable:

sampleTypesTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")

Intermediate function for obtaining only the data

Note that if you have the package-wide option set, the workspace argument is not needed in the function call.

prot <- getAssayData(
    assayName = "protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
    sampleCode = c("01", "10"),
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA",
    sampleIdx = 1:4
)
head(prot)

MultiAssayExperiment

Finally, once you have collected all the relevant column names, these can be inputs to the main terraTCGAdata function:

mae <- terraTCGAdata(
    clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin",
    assays =
        c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
        "rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"),
    sampleCode = NULL,
    split = FALSE,
    sampleIdx = 1:4,
    workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
mae

We expect that most OpenAccess_V1-0 cancer datasets follow this data model. If you encounter any errors, please provide a minimally reproducible example at https://github.com/waldronlab/terraTCGAdata.

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] terraTCGAdata_1.11.0        MultiAssayExperiment_1.33.4
##  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [7] IRanges_2.41.2              S4Vectors_0.45.2           
##  [9] BiocGenerics_0.53.3         generics_0.1.3             
## [11] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [13] AnVILGCP_1.1.1              AnVIL_1.19.4               
## [15] AnVILBase_1.1.0             dplyr_1.1.4                
## [17] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] xfun_0.49               bslib_0.8.0             httr2_1.0.7            
##  [4] htmlwidgets_1.6.4       lattice_0.22-6          vctrs_0.6.5            
##  [7] tools_4.4.2             parallel_4.4.2          tibble_3.2.1           
## [10] pkgconfig_2.0.3         BiocBaseUtils_1.9.0     Matrix_1.7-1           
## [13] rapiclient_0.1.8        lifecycle_1.0.4         GenomeInfoDbData_1.2.13
## [16] compiler_4.4.2          codetools_0.2-20        httpuv_1.6.15          
## [19] htmltools_0.5.8.1       sys_3.4.3               buildtools_1.0.0       
## [22] sass_0.4.9              yaml_2.3.10             later_1.4.1            
## [25] pillar_1.10.0           crayon_1.5.3            jquerylib_0.1.4        
## [28] tidyr_1.3.1             DT_0.33                 DelayedArray_0.33.3    
## [31] cachem_1.1.0            abind_1.4-8             mime_0.12              
## [34] tidyselect_1.2.1        digest_0.6.37           purrr_1.0.2            
## [37] maketools_1.3.1         grid_4.4.2              fastmap_1.2.0          
## [40] SparseArray_1.7.2       cli_3.6.3               magrittr_2.0.3         
## [43] S4Arrays_1.7.1          UCSC.utils_1.3.0        promises_1.3.2         
## [46] rappdirs_0.3.3          rmarkdown_2.29          lambda.r_1.2.4         
## [49] XVector_0.47.1          httr_1.4.7              futile.logger_1.4.3    
## [52] shiny_1.10.0            evaluate_1.0.1          knitr_1.49             
## [55] miniUI_0.1.1.1          rlang_1.1.4             futile.options_1.0.1   
## [58] Rcpp_1.0.13-1           xtable_1.8-4            glue_1.8.0             
## [61] BiocManager_1.30.25     formatR_1.14            jsonlite_1.8.9         
## [64] R6_2.5.1