Title: | Exposes and Makes Available Data from the cBioPortal Web Resources |
---|---|
Description: | The cBioPortalData R package accesses study datasets from the cBio Cancer Genomics Portal. It accesses the data either from the pre-packaged zip / tar files or from the API interface that was recently implemented by the cBioPortal Data Team. The package can provide data in either tabular format or with MultiAssayExperiment object that uses familiar Bioconductor data representations. |
Authors: | Levi Waldron [aut],
Marcel Ramos [aut, cre] |
Maintainer: | Marcel Ramos <[email protected]> |
License: | AGPL-3 |
Version: | 2.19.13 |
Built: | 2025-02-12 09:26:06 UTC |
Source: | https://github.com/bioc/cBioPortalData |
Managing data downloads is important to save disk space and
avoid re-downloading data files. This can be done via the integrated
BiocFileCache
system.
cBioCache(..., ask = interactive()) setCache( directory = tools::R_user_dir("cBioPortalData", "cache"), verbose = TRUE, ask = interactive() ) removePackCache(cancer_study_id, dry.run = TRUE)
cBioCache(..., ask = interactive()) setCache( directory = tools::R_user_dir("cBioPortalData", "cache"), verbose = TRUE, ask = interactive() ) removePackCache(cancer_study_id, dry.run = TRUE)
... |
For |
ask |
logical (default TRUE when interactive session) Confirm the file location of the cache directory |
directory |
The file location where the cache is located. Once set future downloads will go to this folder. |
verbose |
Whether to print descriptive messages |
cancer_study_id |
|
dry.run |
logical Whether or not to remove cache files (default TRUE). |
cBioCache: The path to the cache location
Get the directory location of the cache. It will prompt the user to create
a cache if not already created. A specific directory can be used via
setCache
.
Specify the directory location of the data cache. By default, it will go to the user directory as given by:
tools::R_user_dir("cBioPortalData", "cache")
Some files may become corrupt when downloading, this function allows
the user to delete the tarball associated with a cancer_study_id
in the
cache. This only works for the cBioDataPack
function. To remove the entire
cBioPortalData
cache, run unlink("~/.cache/cBioPortalData")
.
cBioCache() removePackCache("acc_tcga", dry.run = TRUE)
cBioCache() removePackCache("acc_tcga", dry.run = TRUE)
cBioPortalData
no longer caches data from API responses;
therefore, removeDataCache
is no longer needed. It will be removed
as soon as the next release of Bioconductor.
removeDataCache( api, studyId = NA_character_, genePanelId = NA_character_, genes = NA_character_, molecularProfileIds = NULL, sampleListId = NULL, sampleIds = NULL, by = c("entrezGeneId", "hugoGeneSymbol"), dry.run = TRUE, ... )
removeDataCache( api, studyId = NA_character_, genePanelId = NA_character_, genes = NA_character_, molecularProfileIds = NULL, sampleListId = NULL, sampleIds = NULL, by = c("entrezGeneId", "hugoGeneSymbol"), dry.run = TRUE, ... )
api |
An API object of class |
studyId |
|
genePanelId |
|
genes |
|
molecularProfileIds |
|
sampleListId |
|
sampleIds |
|
by |
|
dry.run |
logical Whether or not to remove cache files (default TRUE). |
... |
Additional arguments to lower level API functions |
removeDataCache: The path to the cache location when
dry.run = FALSE
if the file exists. Otherwise, when dry.run = TRUE
,
the function return the output of the file.remove
operation.
Remove the computed cache location based on the function inputs to
cBioPortalData()
. To remove the cache, simply replace the
cBiocPortalData()
function name with removeDataCache()
; see the example.
If the computed cache location is not found, it will return an empty vector.
cbio <- cBioPortal() cBioPortalData( cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "AmpliSeq", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA", "acc_tcga_mutations") ) removeDataCache( cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "AmpliSeq", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA", "acc_tcga_mutations"), dry.run = TRUE )
cbio <- cBioPortal() cBioPortalData( cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "AmpliSeq", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA", "acc_tcga_mutations") ) removeDataCache( cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "AmpliSeq", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA", "acc_tcga_mutations"), dry.run = TRUE )
The cBioDataPack
function allows the user to
download and process cancer study datasets found in MSKCC's cBioPortal.
Output datasets use the
MultiAssayExperiment
data representation to faciliate analysis and data management operations.
cBioDataPack( cancer_study_id, use_cache = TRUE, names.field = c("Hugo_Symbol", "Entrez_Gene_Id", "Gene"), cleanup = TRUE, ask = interactive(), check_build = TRUE )
cBioDataPack( cancer_study_id, use_cache = TRUE, names.field = c("Hugo_Symbol", "Entrez_Gene_Id", "Gene"), cleanup = TRUE, ask = interactive(), check_build = TRUE )
cancer_study_id |
|
use_cache |
|
names.field |
|
cleanup |
|
ask |
|
check_build |
logical(1L) Whether to check the build status of the
|
The full list of study identifiers (studyId
s) can obtained from
getStudies()
. Currently, only ~ 72% of datasets can be represented as
MultiAssayExperiment
data objects from the data tarballs. Refer to
getStudies(..., buildReport = TRUE)
and its "pack_build"
column to see
which study identifiers are not building. Users who would like to prioritize
particular datasets should open GitHub issues at the URL in the
DESCRIPTION
file. For a more fine-grained approach to downloading data
from the cBioPortal API, refer to the cBioPortalData
function.
A MultiAssayExperiment object
The cBioDataPack
function accesses data from the cBio_URL
option.
By default, it points to an Amazon S3 bucket location. Previously, it
pointed to 'http://download.cbioportal.org'. This recent change
(> 2.1.17) should provide faster and more reliable downloads for all users.
See the URL using cBioPortalData:::.url_location
. This can be changed
if there are mirrors that host this data by setting the cBio_URL
option
with getOption("cBio_URL", "https://some.url.com/")
before running the
function.
Levi Waldron, Marcel R., Ino dB.
https://www.cbioportal.org/datasets, cBioPortalData, removePackCache
cbio <- cBioPortal() head(getStudies(cbio)[["studyId"]]) mae <- cBioDataPack("acc_tcga")
cbio <- cBioPortal() head(getStudies(cbio)[["studyId"]]) mae <- cBioDataPack("acc_tcga")
This section of the documentation lists the functions that allow
users to access the cBioPortal API. The main representation of the API can
be obtained from the cBioPortal
function. The supporting functions listed
here give access to specific parts of the API and allow the user to explore
the API with individual calls. Many of the functions here are listed for
documentation purposes and are recommended for advanced usage only. Users
should only need to use the cBioPortalData
main function to obtain data.
cBioPortal( hostname = "www.cbioportal.org", protocol = "https", api. = "/api/v2/api-docs", token = character() ) getStudies(api, buildReport = FALSE) clinicalData(api, studyId = NA_character_) molecularProfiles( api, studyId = NA_character_, projection = c("SUMMARY", "ID", "DETAILED", "META") ) fetchData( api, studyId, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL ) mutationData( api, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL ) molecularData( api, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL ) copyNumberData( api, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL, sampleListId = NULL, discreteCopyNumberEventType = c("HOMDEL_AND_AMP", "HOMDEL", "AMP", "GAIN", "HETLOSS", "DIPLOID", "ALL"), projection = c("SUMMARY", "ID", "DETAILED", "META") ) searchOps(api, keyword) samplesInSampleLists(api, sampleListIds = NA_character_) sampleLists(api, studyId = NA_character_) allSamples(api, studyId = NA_character_) getSampleInfo( api, studyId = NA_character_, sampleListIds = NULL, projection = c("SUMMARY", "ID", "DETAILED", "META") ) genePanels(api) getGenePanel(api, genePanelId = NA_character_) genePanelMolecular( api, molecularProfileId = NA_character_, sampleListId = NULL, sampleIds = NULL ) getGenePanelMolecular(api, molecularProfileIds = NA_character_, sampleIds) geneTable(api, pageSize = 1000, pageNumber = 0, ...) queryGeneTable( api, by = c("entrezGeneId", "hugoGeneSymbol"), genes = NA_character_, genePanelId = NA_character_ ) getDataByGenes( api, studyId = NA_character_, genes = NA_character_, genePanelId = NA_character_, by = c("entrezGeneId", "hugoGeneSymbol"), molecularProfileIds = NULL, sampleListId = NULL, sampleIds = NULL, ... )
cBioPortal( hostname = "www.cbioportal.org", protocol = "https", api. = "/api/v2/api-docs", token = character() ) getStudies(api, buildReport = FALSE) clinicalData(api, studyId = NA_character_) molecularProfiles( api, studyId = NA_character_, projection = c("SUMMARY", "ID", "DETAILED", "META") ) fetchData( api, studyId, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL ) mutationData( api, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL ) molecularData( api, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL ) copyNumberData( api, molecularProfileIds = NA_character_, entrezGeneIds = NULL, sampleIds = NULL, sampleListId = NULL, discreteCopyNumberEventType = c("HOMDEL_AND_AMP", "HOMDEL", "AMP", "GAIN", "HETLOSS", "DIPLOID", "ALL"), projection = c("SUMMARY", "ID", "DETAILED", "META") ) searchOps(api, keyword) samplesInSampleLists(api, sampleListIds = NA_character_) sampleLists(api, studyId = NA_character_) allSamples(api, studyId = NA_character_) getSampleInfo( api, studyId = NA_character_, sampleListIds = NULL, projection = c("SUMMARY", "ID", "DETAILED", "META") ) genePanels(api) getGenePanel(api, genePanelId = NA_character_) genePanelMolecular( api, molecularProfileId = NA_character_, sampleListId = NULL, sampleIds = NULL ) getGenePanelMolecular(api, molecularProfileIds = NA_character_, sampleIds) geneTable(api, pageSize = 1000, pageNumber = 0, ...) queryGeneTable( api, by = c("entrezGeneId", "hugoGeneSymbol"), genes = NA_character_, genePanelId = NA_character_ ) getDataByGenes( api, studyId = NA_character_, genes = NA_character_, genePanelId = NA_character_, by = c("entrezGeneId", "hugoGeneSymbol"), molecularProfileIds = NULL, sampleListId = NULL, sampleIds = NULL, ... )
hostname |
|
protocol |
|
api. |
|
token |
|
api |
An API object of class |
buildReport |
|
studyId |
|
projection |
|
molecularProfileIds |
|
entrezGeneIds |
|
sampleIds |
|
sampleListId |
|
discreteCopyNumberEventType |
|
keyword |
|
sampleListIds |
|
genePanelId |
|
molecularProfileId |
|
pageSize |
|
pageNumber |
|
... |
Additional arguments to lower level API functions |
by |
|
genes |
|
cBioPortal: An API object of class 'cBioPortal'
cBioPortalData: A data object of class 'MultiAssayExperiment'
getStudies: Obtain a table of studies and associated metadata and
optionally include a buildReport
status (default FALSE) for each
study. When enabled, the 'api_build' and 'pack_build' columns will
be added to the table and will show if MultiAssayExperiment
objects
can be generated for that particular study identifier (studyId
). The
'api_build' column corresponds to datasets obtained with
cBioPortalData
and the 'pack_build' column corresponds to datsets
loaded via cBioDataPack
.
searchOps - Search through API operations with a keyword
sampleLists - obtain all sampleListIds
for a particular studyId
allSamples - obtain all samples within a particular studyId
genePanels - Show all available gene panels
geneTable - Get a table of all genes by 'entrezGeneId' and 'hugoGeneSymbol'
queryGeneTable - Get a table for only the genes
or genePanelId
of
interest. Gene inputs are identified with the by
argument
clinicalData - Obtain clinical data for a particular study identifier ('studyId')
molecularProfiles - Produce a molecular profiles dataset for a given study identifier ('studyId')
fetchData - A convenience function to download both mutation and
molecular data with molecularProfileId
, entrezGeneIds
, and
sampleIds
mutationData - Produce a dataset of mutation data using
molecularProfileId
, entrezGeneIds
, and sampleIds
molecularData - Produce a dataset of molecular profile data based on
molecularProfileId
, entrezGeneIds
, and sampleIds
copyNumberData - Produce a dataset of copy number data based on
molecularProfileId
, sampleListId
, discreteCopyNumberEventType
, and
projection
samplesInSampleLists - get all samples associated with a 'sampleListId'
getSampleInfo - Obtain sample metadata for a particular studyId
or
sampleListId
getGenePanels - Obtain the gene panel for a particular 'genePanelId'
genePanelMolecular - get gene panel data for a particular
molecularProfileId
and either a vector of sampleListId
or sampleId
getGenePanelMolecular - get gene panel data for multiple
molecularProfileId
s and a vector of sampleIds
getDataByGenes - Download data for a number of genes within
molecularProfileId
indicators, optionally a sampleListId
can be
provided.
cbio <- cBioPortal() getStudies(api = cbio) clinicalData(cbio, "acc_tcga") molecularProfiles(cbio, "acc_tcga") fetchData( api = cbio, studyId = "acc_tcga", molecularProfileIds = c( "acc_tcga_mutations", "acc_tcga_gistic", "acc_tcga_rppa" ), entrezGeneIds = 1:1000, sampleIds = c("TCGA-OR-A5J1-01", "TCGA-OR-A5J2-01") ) mutationData( api = cbio, molecularProfileIds = "acc_tcga_mutations", entrezGeneIds = 1:1000, sampleIds = c("TCGA-OR-A5J1-01", "TCGA-OR-A5J2-01") ) molecularData( api = cbio, molecularProfileIds = c("acc_tcga_rna_seq_v2_mrna", "acc_tcga_rppa"), entrezGeneIds = 1:100, sampleIds = c("TCGA-OR-A5J1-01", "TCGA-OR-A5J2-01") ) ## obtain molecularProfileId for discrete copy number alteration data molecularProfiles(cbio, "acc_tcga") |> dplyr::filter( molecularAlterationType == "COPY_NUMBER_ALTERATION" & datatype == "DISCRETE" ) copyNumberData( api = cbio, molecularProfileIds = "acc_tcga_gistic", entrezGeneIds = 25, sampleListId = "acc_tcga_all" ) searchOps(api = cbio, keyword = "molecular") samplesInSampleLists( api = cbio, sampleListIds = c("acc_tcga_rppa", "acc_tcga_cnaseq") ) sampleLists(api = cbio, studyId = "acc_tcga") genePanels(cbio) getGenePanel(cbio, "AmpliSeq") queryGeneTable(api = cbio, by = "entrezGeneId", genes = 7157) getDataByGenes( api = cbio, studyId = "acc_tcga", genes = 1, by = "entrezGeneId", molecularProfileIds = "acc_tcga_rna_seq_v2_mrna", sampleListId = "acc_tcga_rna_seq_v2_mrna" )
cbio <- cBioPortal() getStudies(api = cbio) clinicalData(cbio, "acc_tcga") molecularProfiles(cbio, "acc_tcga") fetchData( api = cbio, studyId = "acc_tcga", molecularProfileIds = c( "acc_tcga_mutations", "acc_tcga_gistic", "acc_tcga_rppa" ), entrezGeneIds = 1:1000, sampleIds = c("TCGA-OR-A5J1-01", "TCGA-OR-A5J2-01") ) mutationData( api = cbio, molecularProfileIds = "acc_tcga_mutations", entrezGeneIds = 1:1000, sampleIds = c("TCGA-OR-A5J1-01", "TCGA-OR-A5J2-01") ) molecularData( api = cbio, molecularProfileIds = c("acc_tcga_rna_seq_v2_mrna", "acc_tcga_rppa"), entrezGeneIds = 1:100, sampleIds = c("TCGA-OR-A5J1-01", "TCGA-OR-A5J2-01") ) ## obtain molecularProfileId for discrete copy number alteration data molecularProfiles(cbio, "acc_tcga") |> dplyr::filter( molecularAlterationType == "COPY_NUMBER_ALTERATION" & datatype == "DISCRETE" ) copyNumberData( api = cbio, molecularProfileIds = "acc_tcga_gistic", entrezGeneIds = 25, sampleListId = "acc_tcga_all" ) searchOps(api = cbio, keyword = "molecular") samplesInSampleLists( api = cbio, sampleListIds = c("acc_tcga_rppa", "acc_tcga_cnaseq") ) sampleLists(api = cbio, studyId = "acc_tcga") genePanels(cbio) getGenePanel(cbio, "AmpliSeq") queryGeneTable(api = cbio, by = "entrezGeneId", genes = 7157) getDataByGenes( api = cbio, studyId = "acc_tcga", genes = 1, by = "entrezGeneId", molecularProfileIds = "acc_tcga_rna_seq_v2_mrna", sampleListId = "acc_tcga_rna_seq_v2_mrna" )
The cBioPortal
class is a representation of the cBioPortal
API protocol that directly inherits from the Service
class in the
AnVIL
package. For more information, see the
AnVIL package.
## S4 method for signature 'cBioPortal' operations(x, ..., .deprecated = FALSE)
## S4 method for signature 'cBioPortal' operations(x, ..., .deprecated = FALSE)
x |
A AnVIL instance or API representation as given by the cBioPortal function. |
... |
additional arguments passed to methods or, for
|
.deprecated |
optional logical(1) include deprecated operations? |
This class takes the static API as provided at https://www.cbioportal.org/api/v2/api-docs and creates an R object with the help from underlying infrastructure (i.e., rapiclient and AnVIL) to give the user a unified representation of the API specification provided by the cBioPortal group. Users are not expected to interact with this class other than to use it as input to the functionality provided by the rest of the package.
A cBioPortal
class instance
operations(cBioPortal)
: List all the operations
available with the
cBioPortal API object, e.g., api$operation
cBioPortal()
cBioPortal()
Obtain a MultiAssayExperiment
object for a particular gene panel,
studyId
, molecularProfileIds
, and sampleListIds
combination. Default
molecularProfileIds
and sampleListIds
are set to NULL for including all
data. This option is best for users who wish to obtain a section of the
study data that pertains to a specific molecular profile and gene panel
combination. For users looking to download the entire study data as provided
by the https://www.cbioportal.org/datasets, refer to cBioDataPack
.
cBioPortalData( api, studyId = NA_character_, genePanelId = NA_character_, genes = NA_character_, molecularProfileIds = NULL, sampleListId = NULL, sampleIds = NULL, by = c("entrezGeneId", "hugoGeneSymbol"), check_build = TRUE, ask = interactive() )
cBioPortalData( api, studyId = NA_character_, genePanelId = NA_character_, genes = NA_character_, molecularProfileIds = NULL, sampleListId = NULL, sampleIds = NULL, by = c("entrezGeneId", "hugoGeneSymbol"), check_build = TRUE, ask = interactive() )
api |
An API object of class |
studyId |
|
genePanelId |
|
genes |
|
molecularProfileIds |
|
sampleListId |
|
sampleIds |
|
by |
|
check_build |
logical(1L) Whether to check the build status of the
|
ask |
|
We are able to succesfully represent 98 percent of the study
identifiers as MultiAssayExperiment
objects as obtained via
cBioPortalData
with the IMPACT341
genePanelId
as the example
gene panel. Datasets that currently fail to import
can be seen in the getStudies(..., buildReport = TRUE)
dataset
under the "api_build"
column.
Note that changes to the cBioPortal API may affect this rate at any
time. If you encounter any issues, please open a GitHub issue at the
https://github.com/waldronlab/cBioPortalData/issues/ page with
a fully reproducible example.
A MultiAssayExperiment object
cbio <- cBioPortal() samps <- samplesInSampleLists(cbio, "acc_tcga_rppa")[[1]] getGenePanelMolecular( cbio, molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA"), samps ) acc_tcga <- cBioPortalData( cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "AmpliSeq", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA", "acc_tcga_mutations") )
cbio <- cBioPortal() samps <- samplesInSampleLists(cbio, "acc_tcga_rppa")[[1]] getGenePanelMolecular( cbio, molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA"), samps ) acc_tcga <- cBioPortalData( cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "AmpliSeq", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA", "acc_tcga_mutations") )
Note that these functions should be used when a particular
study is not currently available as a MultiAssayExperiment
representation. Otherwise, use cBioDataPack
. Provide a cancer_study_id
from getStudies
and retrieve the study tarball from the cBio Genomics
Portal. These functions are used by cBioDataPack
under the hood to
download,untar, and load the tarball datasets with caching. As stated in
?cBioDataPack
, not all studies are currently working as
MultiAssayExperiment
objects. As of July 2020, about ~80% of datasets can
be successfully imported into the MultiAssayExperiment
data class. Please
open an issue if you would like the team to prioritize a study. You may
also check getStudies(buildReport = TRUE)$pack_build
for the current
status.
downloadStudy( cancer_study_id, use_cache = TRUE, force = FALSE, url_location = getOption("cBio_URL", .url_location), ask = interactive() ) untarStudy(cancer_study_file, exdir = tempdir()) loadStudy( filepath, names.field = c("Hugo_Symbol", "Entrez_Gene_Id", "Gene", "Composite.Element.REF"), cleanup = TRUE )
downloadStudy( cancer_study_id, use_cache = TRUE, force = FALSE, url_location = getOption("cBio_URL", .url_location), ask = interactive() ) untarStudy(cancer_study_file, exdir = tempdir()) loadStudy( filepath, names.field = c("Hugo_Symbol", "Entrez_Gene_Id", "Gene", "Composite.Element.REF"), cleanup = TRUE )
cancer_study_id |
|
use_cache |
|
force |
|
url_location |
|
ask |
|
cancer_study_file |
|
exdir |
|
filepath |
|
names.field |
|
cleanup |
|
When attempting to load a dataset using loadStudy
, note that the
cleanup
argument is set to TRUE
by default. Change the argument to
FALSE
if you would like to keep the untarred data in the exdir
location. downloadStudy
and untarStudy
are not affected by this change.
The tarball of the downloaded data is cached via BiocFileCache
when
use_cache
is TRUE
.
downloadStudy - The file location of the data tarball
untarStudy - The directory location of the contents
loadStudy - A MultiAssayExperiment-class object
cBioDataPack, MultiAssayExperiment
acc_file <- downloadStudy("acc_tcga") acc_file file_dir <- untarStudy(acc_file, tempdir()) file_dir loadStudy(file_dir)
acc_file <- downloadStudy("acc_tcga") acc_file file_dir <- untarStudy(acc_file, tempdir()) file_dir loadStudy(file_dir)