Title: | A new tool for exporting TCGA Firehose data |
---|---|
Description: | Managing data from large scale projects such as The Cancer Genome Atlas (TCGA) for further analysis is an important and time consuming step for research projects. Several efforts, such as Firehose project, make TCGA pre-processed data publicly available via web services and data portals but it requires managing, downloading and preparing the data for following steps. We developed an open source and extensible R based data client for Firehose pre-processed data and demonstrated its use with sample case studies. Results showed that RTCGAToolbox could improve data management for researchers who are interested with TCGA data. In addition, it can be integrated with other analysis pipelines for following data analysis. |
Authors: | Mehmet Samur [aut], Marcel Ramos [aut, cre] , Ludwig Geistlinger [ctb] |
Maintainer: | Marcel Ramos <[email protected]> |
License: | GPL-2 |
Version: | 2.37.2 |
Built: | 2024-12-19 03:37:06 UTC |
Source: | https://github.com/bioc/RTCGAToolbox |
See the acc_sample.R
script to see how the data was generated.
This dataset contains real data from the The Cancer Genome Atlas for
the pipeline run date and GISTIC analysis date of 2016-01-28.
data("accmini", package = "RTCGAToolbox")
data("accmini", package = "RTCGAToolbox")
A FirehoseData
data object
FirehoseData
object to a
Bioconductor
objectThis function processes data from a
FirehoseData object. Raw data is
converted to a conventional Bioconductor object. The function returns either
SummarizedExperiment
or RaggedExperiment. In cases
where there are multiple platforms in a data type, an attempt to consolidate
datasets will be made based on matching dimension names. For ranged data,
this functionality is provided with more control as part of the
RaggedExperiment
features. See the
RaggedExperiment-class for
more details.
biocExtract( object, type = c("clinical", "RNASeqGene", "RNASeq2Gene", "miRNASeqGene", "RNASeq2GeneNorm", "CNASNP", "CNVSNP", "CNASeq", "CNACGH", "Methylation", "Mutation", "mRNAArray", "miRNAArray", "RPPAArray", "GISTIC", "GISTICA", "GISTICT", "GISTICP"), ... )
biocExtract( object, type = c("clinical", "RNASeqGene", "RNASeq2Gene", "miRNASeqGene", "RNASeq2GeneNorm", "CNASNP", "CNVSNP", "CNASeq", "CNACGH", "Methylation", "Mutation", "mRNAArray", "miRNAArray", "RPPAArray", "GISTIC", "GISTICA", "GISTICT", "GISTICP"), ... )
object |
A |
type |
The type of data to extract from the "FirehoseData" object, see type section. |
... |
Additional arguments passed to lower level functions that
convert tabular data into Bioconductor object such as
|
A typical additional argument for this function passed down to
lower level functions is the names.field
which indicates the row names
in the data. By default, it is the "Hugo_Symbol" column in the internal
code that converts data.frame
s to SummarizedExperiment
representations
(via the .makeSummarizedExperimentFromDataFrame
internal function).
Either SummarizedExperiment or RaggedExperiment.
Choices include the following:
clinical: Get the clinical data slot
RNASeqGene: RNASeqGene, RNASeq v1
RNASeqGene: RNASeq2Gene, RNASeq v2
RNASeq2GeneNorm: RNASeq v2 Normalized
miRNASeqGene: micro RNA SeqGene
CNASNP: Copy Number Alteration
CNVSNP: Copy Number Variation
CNASeq: Copy Number Alteration
CNACGH: Copy Number Alteration
Methylation: Methylation
mRNAArray: Messenger RNA
miRNAArray: micro RNA
RPPAArray: Reverse Phase Protein Array
Mutation: Mutations
GISTICA: GISTIC v2 ('AllByGene' only)
GISTICT: GISTIC v2 ('ThresholdedByGene' only)
GISTICP: GISTIC v2 ('Peaks' only)
GISTIC: GISTIC v2 scores, probabilities, and peaks
Marcel Ramos [email protected]
data(accmini) biocExtract(accmini, "RNASeq2Gene") biocExtract(accmini, "miRNASeqGene") biocExtract(accmini, "RNASeq2GeneNorm") biocExtract(accmini, "CNASNP") biocExtract(accmini, "CNVSNP") biocExtract(accmini, "Methylation") biocExtract(accmini, "Mutation") biocExtract(accmini, "RPPAArray") biocExtract(accmini, "GISTIC")
data(accmini) biocExtract(accmini, "RNASeq2Gene") biocExtract(accmini, "miRNASeqGene") biocExtract(accmini, "RNASeq2GeneNorm") biocExtract(accmini, "CNASNP") biocExtract(accmini, "CNVSNP") biocExtract(accmini, "Methylation") biocExtract(accmini, "Mutation") biocExtract(accmini, "RPPAArray") biocExtract(accmini, "GISTIC")
An S4 class to store correlations between gene expression level and copy number data
Dataset
A cohort name
Correlations
Results data frame
An S4 class to store differential gene expression results
Dataset
Dataset name
Toptable
Results data frame
An S4 class to store data from CGA platforms
Filename
Platform name
DataMatrix
A data frame that stores the CGH data.
An S4 class to store main data object from clinent function.
## S4 method for signature 'FirehoseData' show(object) ## S4 method for signature 'FirehoseData' getData(object, type, platform) ## S4 method for signature 'FirehoseGISTIC' getData(object, type, platform) ## S4 method for signature 'ANY' getData(object, type, platform) ## S4 method for signature 'FirehoseData' updateObject(object, ..., verbose = FALSE) ## S4 method for signature 'FirehoseData' selectType(object, dataType)
## S4 method for signature 'FirehoseData' show(object) ## S4 method for signature 'FirehoseData' getData(object, type, platform) ## S4 method for signature 'FirehoseGISTIC' getData(object, type, platform) ## S4 method for signature 'ANY' getData(object, type, platform) ## S4 method for signature 'FirehoseData' updateObject(object, ..., verbose = FALSE) ## S4 method for signature 'FirehoseData' selectType(object, dataType)
object |
A FirehoseData object |
type |
A data type to be extracted |
platform |
An index for data types that may come from multiple platforms (such as mRNAArray), for GISTIC data, one of the options: 'AllByGene', 'ThresholdedByGene', or 'Peaks' |
... |
additional arguments for updateObject |
verbose |
logical (default FALSE) whether to print extra messages |
dataType |
An available data type, see object show method |
show(FirehoseData)
: show method
getData(FirehoseData)
: Get a matrix or data.frame from FirehoseData
getData(FirehoseGISTIC)
: Get GISTIC data from FirehoseData
getData(ANY)
: Default method for getting data from
FirehoseData
updateObject(FirehoseData)
: Update an old RTCGAToolbox FirehoseData object to
the most recent API
selectType(FirehoseData)
: Extract data type
Dataset
A cohort name
runDate
Standard data run date from getFirehoseRunningDates
gistic2Date
Analyze running date from getFirehoseAnalyzeDates
clinical
clinical data frame
RNASeqGene
Gene level expression data matrix from RNAseq
RNASeq2Gene
Gene level expression data matrix from RNAseqV2
RNASeq2GeneNorm
Gene level expression data matrix from RNAseqV2 (RSEM)
miRNASeqGene
miRNA expression data from matrix smallRNAseq
CNASNP
A data frame to store somatic copy number alterations from SNP array platform
CNVSNP
A data frame to store germline copy number variants from SNP array platform
CNASeq
A data frame to store somatic copy number alterations from sequencing platform
CNACGH
A list that stores FirehoseCGHArray
object for somatic
copy number alterations from CGH platform
Methylation
A list that stores FirehoseMethylationArray
object
for methylation data
mRNAArray
A list that stores FirehosemRNAArray
object for gene
expression data from microarray
miRNAArray
A list that stores FirehosemRNAArray
object for miRNA
expression data from microarray
RPPAArray
A list that stores FirehosemRNAArray
object for RPPA
data
Mutation
A data frame for mutation infromation from sequencing data
GISTIC
A FirehoseGISTIC
object to store processed copy number
data
BarcodeUUID
A data frame that stores the Barcodes, UUIDs and Short sample identifiers
An S4 class to store processed copy number data. (Data processed by using GISTIC2 algorithm)
## S4 method for signature 'FirehoseGISTIC' isEmpty(x) ## S4 method for signature 'FirehoseGISTIC' updateObject(object, ..., verbose = FALSE)
## S4 method for signature 'FirehoseGISTIC' isEmpty(x) ## S4 method for signature 'FirehoseGISTIC' updateObject(object, ..., verbose = FALSE)
x |
A FirehoseGISTIC class object |
object |
A |
... |
additional arguments for updateObject |
verbose |
logical (default FALSE) whether to print extra messages |
isEmpty(FirehoseGISTIC)
: check whether the FirehoseGISTIC object has
data in it or not
updateObject(FirehoseGISTIC)
: Update an old FirehoseGISTIC object to the most
recent API
Dataset
Cohort name
AllByGene
A data frame that stores continuous copy number
ThresholdedByGene
A data frame for discrete copy number data
Peaks
A data frame storing GISTIC peak data. See getGISTICPeaks.
An S4 class to store data from methylation platforms
Filename
Platform name
DataMatrix
A data frame that stores the methylation data.
An S4 class to store data from array (mRNA, miRNA etc.) platforms
Filename
Platform name
DataMatrix
A data matrix that stores the expression data.
Obtain the mRNA expression clustering results from the Broad Institute for a specific cancer code (see getFirehoseDatasets).
getBroadSubtypes(dataset, clust.alg = c("CNMF", "ConsensusPlus"))
getBroadSubtypes(dataset, clust.alg = c("CNMF", "ConsensusPlus"))
dataset |
A TCGA cancer code, e.g. "OV" for ovarian cancer |
clust.alg |
The selected cluster algorithm, either "CNMF" or "ConsensusPlus" (default "CNMF") |
A data.frame
of cluster and silhouette values
Ludwig Geistlinger
co <- getBroadSubtypes("COAD", "CNMF") head(co)
co <- getBroadSubtypes("COAD", "CNMF") head(co)
A go-to function for getting top level information from a FirehoseData object. Available datatypes for a particular object can be seen by entering the object name in the console ('show' method).
getData(object, type, platform)
getData(object, type, platform)
object |
A FirehoseData object |
type |
A data type to be extracted |
platform |
An index for data types that may come from multiple platforms (such as mRNAArray), for GISTIC data, one of the options: 'AllByGene' or 'ThresholdedByGene' |
Returns matrix or data.frame depending on data type
data(accmini) getData(accmini, "clinical") getData(accmini, "RNASeq2GeneNorm") getData(accmini, "Methylation", 1)[1:4]
data(accmini) getData(accmini, "clinical") getData(accmini, "RNASeq2GeneNorm") getData(accmini, "Methylation", 1)[1:4]
getFirehoseAnalyzeDates
returns the character vector for analyze release dates.
getFirehoseAnalyzeDates(last = NULL)
getFirehoseAnalyzeDates(last = NULL)
last |
To list last n dates. (Default NULL) |
A character vector for dates.
getFirehoseAnalyzeDates(last=2)
getFirehoseAnalyzeDates(last=2)
getFirehoseData
returns FirehoseData
object that stores TCGA data.
getFirehoseData( dataset, runDate = "20160128", gistic2Date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = TRUE, miRNASeqGene = FALSE, miRNASeqGeneType = c("read_count", "reads_per_million_miRNA_mapped", "cross-mapped"), RNASeq2GeneNorm = FALSE, CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE, RNAseqNorm = "raw_count", RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), forceDownload = FALSE, destdir = .setCache(), fileSizeLimit = 500, getUUIDs = FALSE, ... )
getFirehoseData( dataset, runDate = "20160128", gistic2Date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = TRUE, miRNASeqGene = FALSE, miRNASeqGeneType = c("read_count", "reads_per_million_miRNA_mapped", "cross-mapped"), RNASeq2GeneNorm = FALSE, CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE, RNAseqNorm = "raw_count", RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), forceDownload = FALSE, destdir = .setCache(), fileSizeLimit = 500, getUUIDs = FALSE, ... )
dataset |
A cohort disease code. TCGA cancer codes can be obtained via getFirehoseDatasets |
runDate |
Standard data run dates. Date list can be accessible via getFirehoseRunningDates |
gistic2Date |
Analysis run date for GISTIC obtained via getFirehoseAnalyzeDates |
RNASeqGene |
Logical (default FALSE) RNAseq TPM data. |
RNASeq2Gene |
Logical (default FALSE) RNAseq v2 (RSEM processed) data;
see |
clinical |
Logical (default TRUE) clinical data. |
miRNASeqGene |
Logical (default FALSE) smallRNAseq data. |
miRNASeqGeneType |
Character (default "read_count") Indicate which type of data should be pulled from the miRNASeqGene data. Must be one of "reads_per_million_miRNA_mapped", "read_count", or "cross-mapped". |
RNASeq2GeneNorm |
Logical (default FALSE) RNAseq v2 (RSEM processed) data. |
CNASNP |
Logical (default FALSE) somatic copy number alterations data from SNP array. |
CNVSNP |
Logical (default FALSE) germline copy number variants data from SNP array. |
CNASeq |
Logical (default FALSE) somatic copy number alterations data from sequencing. |
CNACGH |
Logical (default FALSE) somatic copy number alterations data from CGH. |
Methylation |
Logical (default FALSE) methylation data. |
Mutation |
Logical (default FALSE) mutation data from sequencing. |
mRNAArray |
Logical (default FALSE) mRNA expression data from microarray. |
miRNAArray |
Logical (default FALSE) miRNA expression data from microarray. |
RPPAArray |
Logical (default FALSE) RPPA data |
GISTIC |
logical (default FALSE) processed copy number data |
RNAseqNorm |
RNAseq data normalization method. (Default raw_count) |
RNAseq2Norm |
RNAseq v2 data normalization method. (Default normalized_count or one of RSEM_normalized_log2, raw_count, scaled_estimate) |
forceDownload |
A logic (Default FALSE) key to force download RTCGAToolbox every time. By default if you download files into your working directory once than RTCGAToolbox using local files next time. |
destdir |
Directory in which to store the resulting downloaded file.
Defaults to a cache directory given by |
fileSizeLimit |
Files that are larger than set value (megabyte) won't be downloaded (Default: 500) |
getUUIDs |
Logical key to get UUIDs from barcode (Default: FALSE) |
... |
Additional arguments to pass down. |
This is a main client function to download data from Firehose TCGA portal.
To avoid unnecessary downloads, we use
tools::R_user_dir("RTCGAToolbox", "cache")
to set the default destdir
parameter to the cached directory. To get the actual default directory,
one can run RTCGAToolbox:::.setCache()
.
A FirehoseData
data object that stores data for selected data types.
getLinks, https://gdac.broadinstitute.org/
# Sample Dataset data(accmini) accmini ## Not run: BRCAdata <- getFirehoseData(dataset="BRCA", runDate="20140416",gistic2Date="20140115", RNASeqGene=TRUE,clinical=TRUE,mRNAArray=TRUE,Mutation=TRUE) ## End(Not run)
# Sample Dataset data(accmini) accmini ## Not run: BRCAdata <- getFirehoseData(dataset="BRCA", runDate="20140416",gistic2Date="20140115", RNASeqGene=TRUE,clinical=TRUE,mRNAArray=TRUE,Mutation=TRUE) ## End(Not run)
getFirehoseDatasets
returns a character vector of TCGA disease codes.
A reference table can be seen at https://gdac.broadinstitute.org/.
getFirehoseDatasets()
getFirehoseDatasets()
A character string
https://gdac.broadinstitute.org/
getFirehoseDatasets()
getFirehoseDatasets()
getFirehoseRunningDates
returns the character vector for standard data release dates.
getFirehoseRunningDates(last = NULL)
getFirehoseRunningDates(last = NULL)
last |
To list last n dates. (Default NULL) |
A character vector for dates.
getFirehoseRunningDates() getFirehoseRunningDates(last=2)
getFirehoseRunningDates() getFirehoseRunningDates(last=2)
Access GISTIC2 level 4 copy number data through
gdac.broadinstitute.org
getGISTICPeaks(object, peak = c("wide", "narrow", "full"), rm.chrX = TRUE)
getGISTICPeaks(object, peak = c("wide", "narrow", "full"), rm.chrX = TRUE)
object |
A FirehoseData GISTIC type object |
peak |
The peak type, select from "wide", "narrow", "full". |
rm.chrX |
(logical default TRUE) Whether to remove observations in the X chromosome |
A data.frame
of peak values
Ludwig Geistlinger
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE) peaks <- getGISTICPeaks(co, "wide") class(peaks) head(peaks)[1:6]
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE) peaks <- getGISTICPeaks(co, "wide") class(peaks) head(peaks)[1:6]
This function provides a reference to the resources downloaded from the GDAC Firehose pipeline. Based on the input, the function returns a URL location to the resource if there exists one.
getLinks( dataset, data_date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = FALSE, miRNASeqGene = FALSE, RNASeq2GeneNorm = FALSE, RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE )
getLinks( dataset, data_date = "20160128", RNASeqGene = FALSE, RNASeq2Gene = FALSE, clinical = FALSE, miRNASeqGene = FALSE, RNASeq2GeneNorm = FALSE, RNAseq2Norm = c("normalized_counts", "RSEM_normalized_log2", "raw_counts", "scaled_estimate"), CNASNP = FALSE, CNVSNP = FALSE, CNASeq = FALSE, CNACGH = FALSE, Methylation = FALSE, Mutation = FALSE, mRNAArray = FALSE, miRNAArray = FALSE, RPPAArray = FALSE, GISTIC = FALSE )
dataset |
A cohort disease code. TCGA cancer codes can be obtained via getFirehoseDatasets |
data_date |
Either a runDate or analysisDate typically entered in
|
RNASeqGene |
Logical (default FALSE) RNAseq TPM data. |
RNASeq2Gene |
Logical (default FALSE) RNAseq v2 (RSEM processed) data;
see |
clinical |
Logical (default TRUE) clinical data. |
miRNASeqGene |
Logical (default FALSE) smallRNAseq data. |
RNASeq2GeneNorm |
Logical (default FALSE) RNAseq v2 (RSEM processed) data. |
RNAseq2Norm |
RNAseq v2 data normalization method. (Default normalized_count or one of RSEM_normalized_log2, raw_count, scaled_estimate) |
CNASNP |
Logical (default FALSE) somatic copy number alterations data from SNP array. |
CNVSNP |
Logical (default FALSE) germline copy number variants data from SNP array. |
CNASeq |
Logical (default FALSE) somatic copy number alterations data from sequencing. |
CNACGH |
Logical (default FALSE) somatic copy number alterations data from CGH. |
Methylation |
Logical (default FALSE) methylation data. |
Mutation |
Logical (default FALSE) mutation data from sequencing. |
mRNAArray |
Logical (default FALSE) mRNA expression data from microarray. |
miRNAArray |
Logical (default FALSE) miRNA expression data from microarray. |
RPPAArray |
Logical (default FALSE) RPPA data |
GISTIC |
logical (default FALSE) processed copy number data |
A character URL to a dataset location
getLinks("BRCA", CNASeq = TRUE)
getLinks("BRCA", CNASeq = TRUE)
Make a table for mutation rate of each gene in the cohort
getMutationRate(dataObject)
getMutationRate(dataObject)
dataObject |
This must be |
Returns a data table
data(accmini) mutRate <- getMutationRate(dataObject=accmini) mutRate <- mutRate[order(mutRate[,2],decreasing = TRUE),] head(mutRate)
data(accmini) mutRate <- getMutationRate(dataObject=accmini) mutRate <- mutRate[order(mutRate[,2],decreasing = TRUE),] head(mutRate)
A dataset containing the gene coordinates The variables are as follows:
A data frame with 28454 rows and 5 variables
GeneSymbol: Gene symbols
Chromosome: Chromosome name
Strand: Gene strand on chromosome
Start: Gene location on chromosome
End: Gene location on chromosome
Use the output of getFirehoseData
to create a
SummarizedExperiment.
This can be done for three types of data, G-scores threshold by gene, copy
number by gene, and copy number by peak regions.
makeSummarizedExperimentFromGISTIC( gistic, dataType = c("AllByGene", "ThresholdedByGene", "Peaks"), rownameCol = "Gene.Symbol", ... )
makeSummarizedExperimentFromGISTIC( gistic, dataType = c("AllByGene", "ThresholdedByGene", "Peaks"), rownameCol = "Gene.Symbol", ... )
gistic |
A FirehoseGISTIC object |
dataType |
character(1) One of "ThresholdedByGene" (default), "AllByGene", or "Peaks" |
rownameCol |
character(1) The name of the column in the data to use as rownames in the data matrix (default: 'Gene.Symbol'). The row names are only set when the column name is found in the data and all values are unique. |
... |
Additional arguments passed to 'getGISTICPeaks'. |
A SummarizedExperiment
object
L. Geistlinger, M. Ramos
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE, destdir = tempdir()) makeSummarizedExperimentFromGISTIC(co, "AllByGene")
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE, destdir = tempdir()) makeSummarizedExperimentFromGISTIC(co, "AllByGene")
Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA) for further analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, downloaded and prepared for subsequent steps. We have developed an open source and extensible R based data client for pre-processed data from the Firehose, and demonstrate its use with sample case studies. Results show that our RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data. The RTCGAToolbox can also be integrated with other analysis pipelines for further data processing.
The main function you're likely to need from RTCGAToolbox
is
getFirehoseData. Otherwise refer to the vignettes to see
how to use the RTCGAToolbox
Mehmet Kemal Samur
Useful links:
Report bugs at https://github.com/mksamur/RTCGAToolbox/issues
An accessor function for the FirehoseData. An argument will specify the data type to return See FirehoseData for more details.
selectType(object, dataType)
selectType(object, dataType)
object |
A |
dataType |
A data type, see details. |
clinical: Get the clinical data slot
RNASeqGene: RNASeqGene
RNASeq2GeneNorm: Normalized
miRNASeqGene: micro RNA SeqGene
CNASNP: Copy Number Alteration
CNVSNP: Copy Number Variation
CNASeq: Copy Number Alteration
CNACGH: Copy Number Alteration
Methylation: Methylation
mRNAArray: Messenger RNA
miRNAArray: micro RNA
RPPAArray: Reverse Phase Protein Array
Mutation: Mutations
GISTIC: GISTIC v2 scores and probabilities
The data type element of the FirehoseData
object
Export toptable or correlation data frame
showResults(object)
showResults(object)
object |
Returns toptable or correlation data frame
data(accmini)
data(accmini)
Export toptable or correlation data frame
## S4 method for signature 'CorResult' showResults(object)
## S4 method for signature 'CorResult' showResults(object)
object |
Returns correlation results data frame
data(accmini)
data(accmini)
Export toptable or correlation data frame
## S4 method for signature 'DGEResult' showResults(object)
## S4 method for signature 'DGEResult' showResults(object)
object |
Returns toptable for DGE results
data(accmini)
data(accmini)