| Title: | A new tool for exporting TCGA Firehose data |
|---|---|
| Description: | Managing data from large scale projects such as The Cancer Genome Atlas (TCGA) for further analysis is an important and time consuming step for research projects. Several efforts, such as Firehose project, make TCGA pre-processed data publicly available via web services and data portals but it requires managing, downloading and preparing the data for following steps. We developed an open source and extensible R based data client for Firehose pre-processed data and demonstrated its use with sample case studies. Results showed that RTCGAToolbox could improve data management for researchers who are interested with TCGA data. In addition, it can be integrated with other analysis pipelines for following data analysis. |
| Authors: | Mehmet Samur [aut], Marcel Ramos [aut, cre] (ORCID: <https://orcid.org/0000-0002-3242-0582>), Ludwig Geistlinger [ctb] |
| Maintainer: | Marcel Ramos <[email protected]> |
| License: | GPL-2 |
| Version: | 2.43.0 |
| Built: | 2026-05-30 06:54:34 UTC |
| Source: | https://github.com/bioc/RTCGAToolbox |
See the acc_sample.R script to see how the data was generated.
This dataset contains real data from the The Cancer Genome Atlas for
the pipeline run date and GISTIC analysis date of 2016-01-28.
data("accmini", package = "RTCGAToolbox")data("accmini", package = "RTCGAToolbox")
A FirehoseData data object
FirehoseData object to a
Bioconductor objectThis function processes data from a
FirehoseData object. Raw data is
converted to a conventional Bioconductor object. The function returns either
SummarizedExperiment
or RaggedExperiment. In cases
where there are multiple platforms in a data type, an attempt to consolidate
datasets will be made based on matching dimension names. For ranged data,
this functionality is provided with more control as part of the
RaggedExperiment features. See the
RaggedExperiment-class for
more details.
biocExtract( object, type = c("clinical", "RNASeqGene", "RNASeq2Gene", "miRNASeqGene", "RNASeq2GeneNorm", "CNASNP", "CNVSNP", "CNASeq", "CNACGH", "Methylation", "Mutation", "mRNAArray", "miRNAArray", "RPPAArray", "GISTIC", "GISTICA", "GISTICT", "GISTICP"), ... )biocExtract( object, type = c("clinical", "RNASeqGene", "RNASeq2Gene", "miRNASeqGene", "RNASeq2GeneNorm", "CNASNP", "CNVSNP", "CNASeq", "CNACGH", "Methylation", "Mutation", "mRNAArray", "miRNAArray", "RPPAArray", "GISTIC", "GISTICA", "GISTICT", "GISTICP"), ... )
object |
A |
type |
The type of data to extract from the "FirehoseData" object, see type section. |
... |
Additional arguments passed to lower level functions that
convert tabular data into Bioconductor object such as
|
A typical additional argument for this function passed down to
lower level functions is the names.field which indicates the row names
in the data. By default, it is the "Hugo_Symbol" column in the internal
code that converts data.frames to SummarizedExperiment representations
(via the .makeSummarizedExperimentFromDataFrame internal function).
Either SummarizedExperiment or RaggedExperiment.
Choices include the following:
clinical: Get the clinical data slot
RNASeqGene: RNASeqGene, RNASeq v1
RNASeqGene: RNASeq2Gene, RNASeq v2
RNASeq2GeneNorm: RNASeq v2 Normalized
miRNASeqGene: micro RNA SeqGene
CNASNP: Copy Number Alteration
CNVSNP: Copy Number Variation
CNASeq: Copy Number Alteration
CNACGH: Copy Number Alteration
Methylation: Methylation
mRNAArray: Messenger RNA
miRNAArray: micro RNA
RPPAArray: Reverse Phase Protein Array
Mutation: Mutations
GISTICA: GISTIC v2 ('AllByGene' only)
GISTICT: GISTIC v2 ('ThresholdedByGene' only)
GISTICP: GISTIC v2 ('Peaks' only)
GISTIC: GISTIC v2 scores, probabilities, and peaks
Marcel Ramos [email protected]
data(accmini) biocExtract(accmini, "RNASeq2Gene") biocExtract(accmini, "miRNASeqGene") biocExtract(accmini, "RNASeq2GeneNorm") biocExtract(accmini, "CNASNP") biocExtract(accmini, "CNVSNP") biocExtract(accmini, "Methylation") biocExtract(accmini, "Mutation") biocExtract(accmini, "RPPAArray") biocExtract(accmini, "GISTIC")data(accmini) biocExtract(accmini, "RNASeq2Gene") biocExtract(accmini, "miRNASeqGene") biocExtract(accmini, "RNASeq2GeneNorm") biocExtract(accmini, "CNASNP") biocExtract(accmini, "CNVSNP") biocExtract(accmini, "Methylation") biocExtract(accmini, "Mutation") biocExtract(accmini, "RPPAArray") biocExtract(accmini, "GISTIC")
An S4 class to store correlations between gene expression level and copy number data
DatasetA cohort name
CorrelationsResults data frame
An S4 class to store differential gene expression results
DatasetDataset name
ToptableResults data frame
An S4 class to store data from CGA platforms
FilenamePlatform name
DataMatrixA data frame that stores the CGH data.
An S4 class to store main data object from clinent function.
## S4 method for signature 'FirehoseData' show(object) ## S4 method for signature 'FirehoseData' getData(object, type, platform) ## S4 method for signature 'FirehoseGISTIC' getData(object, type, platform) ## S4 method for signature 'ANY' getData(object, type, platform) ## S4 method for signature 'FirehoseData' updateObject(object, ..., verbose = FALSE) ## S4 method for signature 'FirehoseData' selectType(object, dataType)## S4 method for signature 'FirehoseData' show(object) ## S4 method for signature 'FirehoseData' getData(object, type, platform) ## S4 method for signature 'FirehoseGISTIC' getData(object, type, platform) ## S4 method for signature 'ANY' getData(object, type, platform) ## S4 method for signature 'FirehoseData' updateObject(object, ..., verbose = FALSE) ## S4 method for signature 'FirehoseData' selectType(object, dataType)
object |
A FirehoseData object |
type |
A data type to be extracted |
platform |
An index for data types that may come from multiple platforms (such as mRNAArray), for GISTIC data, one of the options: 'AllByGene', 'ThresholdedByGene', or 'Peaks' |
... |
additional arguments for updateObject |
verbose |
logical (default FALSE) whether to print extra messages |
dataType |
An available data type, see object show method |
show(FirehoseData): show method
getData(FirehoseData): Get a matrix or data.frame from FirehoseData
getData(FirehoseGISTIC): Get GISTIC data from FirehoseData
getData(ANY): Default method for getting data from
FirehoseData
updateObject(FirehoseData): Update an old RTCGAToolbox FirehoseData object to
the most recent API
selectType(FirehoseData): Extract data type
DatasetA cohort name
runDateStandard data run date from getFirehoseRunningDates
gistic2DateAnalyze running date from getFirehoseAnalyzeDates
clinicalclinical data frame
RNASeqGeneGene level expression data matrix from RNAseq
RNASeq2GeneGene level expression data matrix from RNAseqV2
RNASeq2GeneNormGene level expression data matrix from RNAseqV2 (RSEM)
miRNASeqGenemiRNA expression data from matrix smallRNAseq
CNASNPA data frame to store somatic copy number alterations from SNP array platform
CNVSNPA data frame to store germline copy number variants from SNP array platform
CNASeqA data frame to store somatic copy number alterations from sequencing platform
CNACGHA list that stores FirehoseCGHArray object for somatic
copy number alterations from CGH platform
MethylationA list that stores FirehoseMethylationArray object
for methylation data
mRNAArrayA list that stores FirehosemRNAArray object for gene
expression data from microarray
miRNAArrayA list that stores FirehosemRNAArray object for miRNA
expression data from microarray
RPPAArrayA list that stores FirehosemRNAArray object for RPPA
data
MutationA data frame for mutation infromation from sequencing data
GISTICA FirehoseGISTIC object to store processed copy number
data
BarcodeUUIDA data frame that stores the Barcodes, UUIDs and Short sample identifiers
An S4 class to store processed copy number data. (Data processed by using GISTIC2 algorithm)
## S4 method for signature 'FirehoseGISTIC' isEmpty(x) ## S4 method for signature 'FirehoseGISTIC' updateObject(object, ..., verbose = FALSE)## S4 method for signature 'FirehoseGISTIC' isEmpty(x) ## S4 method for signature 'FirehoseGISTIC' updateObject(object, ..., verbose = FALSE)
x |
A FirehoseGISTIC class object |
object |
A |
... |
additional arguments for updateObject |
verbose |
logical (default FALSE) whether to print extra messages |
isEmpty(FirehoseGISTIC): check whether the FirehoseGISTIC object has
data in it or not
updateObject(FirehoseGISTIC): Update an old FirehoseGISTIC object to the most
recent API
DatasetCohort name
AllByGeneA data frame that stores continuous copy number
ThresholdedByGeneA data frame for discrete copy number data
PeaksA data frame storing GISTIC peak data. See getGISTICPeaks.
An S4 class to store data from methylation platforms
FilenamePlatform name
DataMatrixA data frame that stores the methylation data.
An S4 class to store data from array (mRNA, miRNA etc.) platforms
FilenamePlatform name
DataMatrixA data matrix that stores the expression data.
Obtain the mRNA expression clustering results from the Broad Institute for a specific cancer code (see getFirehoseDatasets).
getBroadSubtypes(dataset, clust.alg = c("CNMF", "ConsensusPlus"))getBroadSubtypes(dataset, clust.alg = c("CNMF", "ConsensusPlus"))
dataset |
A TCGA cancer code, e.g. "OV" for ovarian cancer |
clust.alg |
The selected cluster algorithm, either "CNMF" or "ConsensusPlus" (default "CNMF") |
A data.frame of cluster and silhouette values
Ludwig Geistlinger
co <- getBroadSubtypes("COAD", "CNMF") head(co)co <- getBroadSubtypes("COAD", "CNMF") head(co)
A go-to function for getting top level information from a FirehoseData object. Available datatypes for a particular object can be seen by entering the object name in the console ('show' method).
getData(object, type, platform)getData(object, type, platform)
object |
A FirehoseData object |
type |
A data type to be extracted |
platform |
An index for data types that may come from multiple platforms (such as mRNAArray), for GISTIC data, one of the options: 'AllByGene' or 'ThresholdedByGene' |
Returns matrix or data.frame depending on data type
data(accmini) getData(accmini, "clinical") getData(accmini, "RNASeq2GeneNorm") getData(accmini, "Methylation", 1)[1:4]data(accmini) getData(accmini, "clinical") getData(accmini, "RNASeq2GeneNorm") getData(accmini, "Methylation", 1)[1:4]
getFirehoseAnalyzeDates returns the character vector for analyze release dates.
getFirehoseAnalyzeDates(last = NULL)getFirehoseAnalyzeDates(last = NULL)
last |
To list last n dates. (Default NULL) |
A character vector for dates.
getFirehoseAnalyzeDates(last=2)getFirehoseAnalyzeDates(last=2)
getFirehoseData returns FirehoseData object that stores TCGA data.
getFirehoseData( dataset, runDate = "20160128", gistic2Date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = TRUE, miRNASeqGene = FALSE, miRNASeqGeneType = c("read_count", "reads_per_million_miRNA_mapped", "cross-mapped"), RNASeq2GeneNorm = FALSE, CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE, RNAseqNorm = "raw_count", RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), forceDownload = FALSE, destdir = .setCache(), fileSizeLimit = 500, getUUIDs = FALSE, ... )getFirehoseData( dataset, runDate = "20160128", gistic2Date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = TRUE, miRNASeqGene = FALSE, miRNASeqGeneType = c("read_count", "reads_per_million_miRNA_mapped", "cross-mapped"), RNASeq2GeneNorm = FALSE, CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE, RNAseqNorm = "raw_count", RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), forceDownload = FALSE, destdir = .setCache(), fileSizeLimit = 500, getUUIDs = FALSE, ... )
dataset |
A cohort disease code. TCGA cancer codes can be obtained via getFirehoseDatasets |
runDate |
Standard data run dates. Date list can be accessible via getFirehoseRunningDates |
gistic2Date |
Analysis run date for GISTIC obtained via getFirehoseAnalyzeDates |
RNASeqGene |
Logical (default FALSE) RNAseq TPM data. |
RNASeq2Gene |
Logical (default FALSE) RNAseq v2 (RSEM processed) data;
see |
clinical |
Logical (default TRUE) clinical data. |
miRNASeqGene |
Logical (default FALSE) smallRNAseq data. |
miRNASeqGeneType |
Character (default "read_count") Indicate which type of data should be pulled from the miRNASeqGene data. Must be one of "reads_per_million_miRNA_mapped", "read_count", or "cross-mapped". |
RNASeq2GeneNorm |
Logical (default FALSE) RNAseq v2 (RSEM processed) data. |
CNASNP |
Logical (default FALSE) somatic copy number alterations data from SNP array. |
CNVSNP |
Logical (default FALSE) germline copy number variants data from SNP array. |
CNASeq |
Logical (default FALSE) somatic copy number alterations data from sequencing. |
CNACGH |
Logical (default FALSE) somatic copy number alterations data from CGH. |
Methylation |
Logical (default FALSE) methylation data. |
Mutation |
Logical (default FALSE) mutation data from sequencing. |
mRNAArray |
Logical (default FALSE) mRNA expression data from microarray. |
miRNAArray |
Logical (default FALSE) miRNA expression data from microarray. |
RPPAArray |
Logical (default FALSE) RPPA data |
GISTIC |
logical (default FALSE) processed copy number data |
RNAseqNorm |
RNAseq data normalization method. (Default raw_count) |
RNAseq2Norm |
RNAseq v2 data normalization method. (Default normalized_count or one of RSEM_normalized_log2, raw_count, scaled_estimate) |
forceDownload |
A logic (Default FALSE) key to force download RTCGAToolbox every time. By default if you download files into your working directory once than RTCGAToolbox using local files next time. |
destdir |
Directory in which to store the resulting downloaded file.
Defaults to a cache directory given by |
fileSizeLimit |
Files that are larger than set value (megabyte) won't be downloaded (Default: 500) |
getUUIDs |
Logical key to get UUIDs from barcode (Default: FALSE) |
... |
Additional arguments to pass down. |
This is a main client function to download data from Firehose TCGA portal.
To avoid unnecessary downloads, we use
tools::R_user_dir("RTCGAToolbox", "cache") to set the default destdir
parameter to the cached directory. To get the actual default directory,
one can run RTCGAToolbox:::.setCache().
A FirehoseData data object that stores data for selected data types.
getLinks, https://gdac.broadinstitute.org/
# Sample Dataset data(accmini) accmini ## Not run: BRCAdata <- getFirehoseData(dataset="BRCA", runDate="20140416",gistic2Date="20140115", RNASeqGene=TRUE,clinical=TRUE,mRNAArray=TRUE,Mutation=TRUE) ## End(Not run)# Sample Dataset data(accmini) accmini ## Not run: BRCAdata <- getFirehoseData(dataset="BRCA", runDate="20140416",gistic2Date="20140115", RNASeqGene=TRUE,clinical=TRUE,mRNAArray=TRUE,Mutation=TRUE) ## End(Not run)
getFirehoseDatasets returns a character vector of TCGA disease codes.
A reference table can be seen at https://gdac.broadinstitute.org/.
getFirehoseDatasets()getFirehoseDatasets()
A character string
https://gdac.broadinstitute.org/
getFirehoseDatasets()getFirehoseDatasets()
getFirehoseRunningDates returns the character vector for standard data release dates.
getFirehoseRunningDates(last = NULL)getFirehoseRunningDates(last = NULL)
last |
To list last n dates. (Default NULL) |
A character vector for dates.
getFirehoseRunningDates() getFirehoseRunningDates(last=2)getFirehoseRunningDates() getFirehoseRunningDates(last=2)
Access GISTIC2 level 4 copy number data through
gdac.broadinstitute.org
getGISTICPeaks(object, peak = c("wide", "narrow", "full"), rm.chrX = TRUE)getGISTICPeaks(object, peak = c("wide", "narrow", "full"), rm.chrX = TRUE)
object |
A FirehoseData GISTIC type object |
peak |
The peak type, select from "wide", "narrow", "full". |
rm.chrX |
(logical default TRUE) Whether to remove observations in the X chromosome |
A data.frame of peak values
Ludwig Geistlinger
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE) peaks <- getGISTICPeaks(co, "wide") class(peaks) head(peaks)[1:6]co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE) peaks <- getGISTICPeaks(co, "wide") class(peaks) head(peaks)[1:6]
This function provides a reference to the resources downloaded from the GDAC Firehose pipeline. Based on the input, the function returns a URL location to the resource if there exists one.
getLinks( dataset, data_date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = FALSE, miRNASeqGene = FALSE, RNASeq2GeneNorm = FALSE, RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE )getLinks( dataset, data_date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = FALSE, miRNASeqGene = FALSE, RNASeq2GeneNorm = FALSE, RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE )
dataset |
A cohort disease code. TCGA cancer codes can be obtained via getFirehoseDatasets |
data_date |
Either a runDate or analysisDate typically entered in
|
RNASeqGene |
Logical (default FALSE) RNAseq TPM data. |
RNASeq2Gene |
Logical (default FALSE) RNAseq v2 (RSEM processed) data;
see |
clinical |
Logical (default TRUE) clinical data. |
miRNASeqGene |
Logical (default FALSE) smallRNAseq data. |
RNASeq2GeneNorm |
Logical (default FALSE) RNAseq v2 (RSEM processed) data. |
RNAseq2Norm |
RNAseq v2 data normalization method. (Default normalized_count or one of RSEM_normalized_log2, raw_count, scaled_estimate) |
CNASNP |
Logical (default FALSE) somatic copy number alterations data from SNP array. |
CNVSNP |
Logical (default FALSE) germline copy number variants data from SNP array. |
CNASeq |
Logical (default FALSE) somatic copy number alterations data from sequencing. |
CNACGH |
Logical (default FALSE) somatic copy number alterations data from CGH. |
Methylation |
Logical (default FALSE) methylation data. |
Mutation |
Logical (default FALSE) mutation data from sequencing. |
mRNAArray |
Logical (default FALSE) mRNA expression data from microarray. |
miRNAArray |
Logical (default FALSE) miRNA expression data from microarray. |
RPPAArray |
Logical (default FALSE) RPPA data |
GISTIC |
logical (default FALSE) processed copy number data |
A character URL to a dataset location
getLinks("BRCA", CNASeq = TRUE)getLinks("BRCA", CNASeq = TRUE)
Make a table for mutation rate of each gene in the cohort
getMutationRate(dataObject)getMutationRate(dataObject)
dataObject |
This must be |
Returns a data table
data(accmini) mutRate <- getMutationRate(dataObject=accmini) mutRate <- mutRate[order(mutRate[,2],decreasing = TRUE),] head(mutRate)data(accmini) mutRate <- getMutationRate(dataObject=accmini) mutRate <- mutRate[order(mutRate[,2],decreasing = TRUE),] head(mutRate)
A dataset containing the gene coordinates The variables are as follows:
A data frame with 28454 rows and 5 variables
GeneSymbol: Gene symbols
Chromosome: Chromosome name
Strand: Gene strand on chromosome
Start: Gene location on chromosome
End: Gene location on chromosome
Use the output of getFirehoseData to create a
SummarizedExperiment.
This can be done for three types of data, G-scores threshold by gene, copy
number by gene, and copy number by peak regions.
makeSummarizedExperimentFromGISTIC( gistic, dataType = c("AllByGene", "ThresholdedByGene", "Peaks"), rownameCol = "Gene.Symbol", ... )makeSummarizedExperimentFromGISTIC( gistic, dataType = c("AllByGene", "ThresholdedByGene", "Peaks"), rownameCol = "Gene.Symbol", ... )
gistic |
A FirehoseGISTIC object |
dataType |
character(1) One of "ThresholdedByGene" (default), "AllByGene", or "Peaks" |
rownameCol |
character(1) The name of the column in the data to use as rownames in the data matrix (default: 'Gene.Symbol'). The row names are only set when the column name is found in the data and all values are unique. |
... |
Additional arguments passed to 'getGISTICPeaks'. |
A SummarizedExperiment object
L. Geistlinger, M. Ramos
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE, destdir = tempdir()) makeSummarizedExperimentFromGISTIC(co, "AllByGene")co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE, destdir = tempdir()) makeSummarizedExperimentFromGISTIC(co, "AllByGene")
Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA) for further analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, downloaded and prepared for subsequent steps. We have developed an open source and extensible R based data client for pre-processed data from the Firehose, and demonstrate its use with sample case studies. Results show that our RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data. The RTCGAToolbox can also be integrated with other analysis pipelines for further data processing.
The main function you're likely to need from RTCGAToolbox is
getFirehoseData. Otherwise refer to the vignettes to see
how to use the RTCGAToolbox
Mehmet Kemal Samur
Useful links:
Report bugs at https://github.com/mksamur/RTCGAToolbox/issues
An accessor function for the FirehoseData. An argument will specify the data type to return See FirehoseData for more details.
selectType(object, dataType)selectType(object, dataType)
object |
A |
dataType |
A data type, see details. |
clinical: Get the clinical data slot
RNASeqGene: RNASeqGene
RNASeq2GeneNorm: Normalized
miRNASeqGene: micro RNA SeqGene
CNASNP: Copy Number Alteration
CNVSNP: Copy Number Variation
CNASeq: Copy Number Alteration
CNACGH: Copy Number Alteration
Methylation: Methylation
mRNAArray: Messenger RNA
miRNAArray: micro RNA
RPPAArray: Reverse Phase Protein Array
Mutation: Mutations
GISTIC: GISTIC v2 scores and probabilities
The data type element of the FirehoseData object
Export toptable or correlation data frame
showResults(object)showResults(object)
object |
Returns toptable or correlation data frame
data(accmini)data(accmini)
Export toptable or correlation data frame
## S4 method for signature 'CorResult' showResults(object)## S4 method for signature 'CorResult' showResults(object)
object |
Returns correlation results data frame
data(accmini)data(accmini)
Export toptable or correlation data frame
## S4 method for signature 'DGEResult' showResults(object)## S4 method for signature 'DGEResult' showResults(object)
object |
Returns toptable for DGE results
data(accmini)data(accmini)