The terraTCGAdata
R package aims to import TCGA
datasets, as MultiAssayExperiment,
available on the Terra platform. The package provides a set of functions
that allow the discovery of relevant datasets. It provides one main
function and two helper functions:
terraTCGAdata
allows the creation of the
MultiAssayExperiment
object from the different indicated
resources.
The getClinicalTable
and getAssayTable
functions allow for the discovery of datasets within the Terra data
model. The column names from these tables can be provided as inputs to
the terraTCGAdata
function.
Some public Terra workspaces come pre-packaged with TCGA data (i.e.,
cloud data resources are linked within the data model). Particularly the
workspaces that are labelled OpenAccess_V1-0
. Datasets
harmonized to the hg38 genome, such as those from the Genomic Data
Commons data repository, use a different data model / workflow and are
not compatible with the functions in this package. For those that are,
we make use of the Terra data model and represent the data as
MultiAssayExperiment
.
For more information on MultiAssayExperiment
, please see
the vignette in that package.
A valid GCloud SDK installation is required to use the package. To
get set up, see the Bioconductor tutorials for running RStudio on Terra.
Use the gcloud_exists()
function from the AnVIL
package to identify whether it is installed in your system.
## [1] FALSE
You can also use the gcloud_project
to set a project
name by specifying the project argument:
To get a table of available TCGA workspaces, use the
selectTCGAworkspace()
function:
You can also set the package-wide option with the
terraTCGAworkspace
function and check the setting with
getOption('terraTCGAdata.workspace')
or by running
terraTCGAworkspace
function.
In order to determine what datasets to download, use the
getClinicalTable
function to list all of the columns that
correspond to clinical data from the different collection centers.
After picking the column in the getClinicalTable
output,
use the column name as input to the getClinical
function to
obtain the data:
We use the same approach for assay data. We first produce a list of
assays from the getAssayTable
and then we select one along
with any sample codes of interest.
You can get a summary table of all the samples in the adata by using
the sampleTypesTable
:
Note that if you have the package-wide option set, the workspace argument is not needed in the function call.
Finally, once you have collected all the relevant column names, these
can be inputs to the main terraTCGAdata
function:
mae <- terraTCGAdata(
clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin",
assays =
c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
"rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"),
sampleCode = NULL,
split = FALSE,
sampleIdx = 1:4,
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
mae
We expect that most OpenAccess_V1-0
cancer datasets
follow this data model. If you encounter any errors, please provide a
minimally reproducible example at https://github.com/waldronlab/terraTCGAdata.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] terraTCGAdata_1.11.0 MultiAssayExperiment_1.33.1
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [7] IRanges_2.41.1 S4Vectors_0.45.2
## [9] BiocGenerics_0.53.3 generics_0.1.3
## [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [13] AnVILGCP_1.1.1 AnVIL_1.19.3
## [15] AnVILBase_1.1.0 dplyr_1.1.4
## [17] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] xfun_0.49 bslib_0.8.0 httr2_1.0.7
## [4] htmlwidgets_1.6.4 lattice_0.22-6 vctrs_0.6.5
## [7] tools_4.4.2 parallel_4.4.2 tibble_3.2.1
## [10] fansi_1.0.6 pkgconfig_2.0.3 BiocBaseUtils_1.9.0
## [13] Matrix_1.7-1 rapiclient_0.1.8 lifecycle_1.0.4
## [16] GenomeInfoDbData_1.2.13 compiler_4.4.2 codetools_0.2-20
## [19] httpuv_1.6.15 htmltools_0.5.8.1 sys_3.4.3
## [22] buildtools_1.0.0 sass_0.4.9 yaml_2.3.10
## [25] crayon_1.5.3 later_1.4.1 pillar_1.9.0
## [28] jquerylib_0.1.4 tidyr_1.3.1 DT_0.33
## [31] DelayedArray_0.33.2 cachem_1.1.0 abind_1.4-8
## [34] mime_0.12 tidyselect_1.2.1 digest_0.6.37
## [37] purrr_1.0.2 maketools_1.3.1 grid_4.4.2
## [40] fastmap_1.2.0 SparseArray_1.7.2 cli_3.6.3
## [43] magrittr_2.0.3 S4Arrays_1.7.1 utf8_1.2.4
## [46] UCSC.utils_1.3.0 promises_1.3.2 rappdirs_0.3.3
## [49] rmarkdown_2.29 lambda.r_1.2.4 XVector_0.47.0
## [52] httr_1.4.7 futile.logger_1.4.3 shiny_1.9.1
## [55] evaluate_1.0.1 knitr_1.49 miniUI_0.1.1.1
## [58] rlang_1.1.4 futile.options_1.0.1 Rcpp_1.0.13-1
## [61] xtable_1.8-4 glue_1.8.0 BiocManager_1.30.25
## [64] formatR_1.14 jsonlite_1.8.9 R6_2.5.1
## [67] zlibbioc_1.52.0