TCGAbiolinks is providing functions to query and download TARGET data and also get SummarizedExperiments objects ready to use for downstream analysis. Moreover, a new design has been suggested for function TCGAanalyze_DEA to allow more user customization and support of limma pipeline
TARGET applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. Investigators form a collaborative network to facilitate discovery of molecular targets and translate those findings into the clinic. TARGET is managed by NCI’s Office of Cancer Genomics and Cancer Therapy Evaluation Program.
Below is an example to run to query, download, and prepare data from the TARGET initiative through tcgabiolinks package:
Available TARGET projects:
## grep("TARGET", projects, value = TRUE)
## 1 TARGET-AML
## 2 TARGET-ALL-P1
## 3 TARGET-ALL-P2
## 4 TARGET-ALL-P3
## 5 TARGET-NBL
## 6 TARGET-WT
## 7 TARGET-OS
## 8 TARGET-RT
## 9 TARGET-CCSK
#Downloading and prepare TARGET CASE
query_Target <- GDCquery(project = "TARGET-AML",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samplesDown_Target <- getResults(query_Target, cols = c("cases"))
dataSmTB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
typesample = "TB")
dataSmNB_Target <- TCGAquery_SampleTypes(barcode = samplesDown_Target,
typesample = "TBM")
dataSmTB_short_Target <- dataSmTB_Target[1:10]
dataSmNB_short_Target <- dataSmNB_Target[1:10]
queryDown_Target <- GDCquery(project = "TARGET-AML",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(dataSmTB_short_Target, dataSmNB_short_Target))
GDCdownload(queryDown_Target)
### SummarizedExperiment = TRUE
data <- GDCprepare(queryDown_Target)
dataPrep_Target <- TCGAanalyze_Preprocessing(object = data,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm_Target <- TCGAanalyze_Normalization(tabDF = dataPrep_Target,
geneInfo = geneInfoHT,
method = "gcContent")
boxplot(dataPrep_Target, outline = FALSE)
boxplot(dataNorm_Target, outline = FALSE)
dataFilt_Target <- TCGAanalyze_Filtering(tabDF = dataNorm_Target,
method = "quantile",
qnt.cut = 0.25)
CancerProject <- "TCGA-BRCA"
DataDirectory <- paste0("../GDC/",gsub("-","_",CancerProject))
FileNameData <- paste0(DataDirectory, "_","HTSeq_Counts",".rda")
query <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samplesDown <- getResults(query,cols = c("cases"))
dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
typesample = "TP")
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
typesample = "NT")
dataSmTP_short <- dataSmTP[1:10]
dataSmNT_short <- dataSmNT[1:10]
queryDown <- GDCquery(project = CancerProject,
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(dataSmTP_short, dataSmNT_short))
GDCdownload(query = queryDown)
dataPrep1 <- GDCprepare(query = queryDown,
save = TRUE,
save.filename = "TCGA_BRCA_HTSeq_Countds.rda")
dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
geneInfo = geneInfoHT,
method = "gcContent")
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
geneInfo = geneInfoHT,
method = "geneLength")
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
method = "quantile",
qnt.cut = 0.25)
After applying filtering and normalization steps, this function reassigns raw counts to the filtered data frame of genesXsamples. This step could be useful for methods requiring to work with raw counts only (i.e edgeR pipeline) Arguments are:
This function takes a vector of TCGA barcodes as an input and returns a 2 elements list.
–If more data is needed, please check TabSubtypesCol_merged data frame available in the package data.
TCGAbiolinks is providing a new design for the TCGAanalyze_DEA() function to perform differential expression analysis (DEA) either contrasting two groups or multiple sets by providing a formula for contrast. Limma pipeline was also added on top of the previously implemented one in EdgeR.
The function takes the following arguments for the version bump:
This example will perform differential expression analysis between control and case groups matrices downloaded through tcgabiolinks from GDC portal. Other pipeline option could be edgeR.
contrast.formula argument is a string formula formatted in a way to be used as an input for DEA using limma pipeline. Elements of contrast in the formula should be one of the unique elements in CondTypes vector. The example below shows the syntax contrast.formula argument should have. We are contrasting here between Pam50 molecular subtypes of some lung cancer samples. N.B: Syntax for the contrast formula should be under this form *formula<-“contrast1=type1-type2, contrast2=type2-type3, contrast3=type1-(type2+type3)/2,…”
CondTypes vector should preferably have no space character in its elements.
### ##download and prepare lung data through GDC### ##
query.lung <- GDCquery(project = "TCGA-LUSC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
samplesDown.lusc <- getResults(query.lung,cols=c("cases"))
dataSmTP.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
typesample = "TP")
dataSmNT.lusc <- TCGAquery_SampleTypes(barcode = samplesDown.lusc,
typesample = "NT")
dataSmTP_short.lusc <- dataSmTP.lusc[1:10]
dataSmNT_short.lusc <- dataSmNT.lusc[1:10]
length(dataSmTP.lusc)
queryDown.lung <- GDCquery(project = "TCGA-LUSC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c(dataSmTP.lusc, dataSmNT.lusc))
GDCdownload(query = queryDown.lung)
dataPrep1.tcga <- GDCprepare(query = queryDown.lung,
save = TRUE,
save.filename = "TCGA_LUSC_HTSeq_Countds.rda")
dataPrep.tcga <- TCGAanalyze_Preprocessing(object = dataPrep1.tcga,
cor.cut = 0.6,
datatype = "HTSeq - Counts")
dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataPrep.tcga,
geneInfo = geneInfoHT,
method = "gcContent")
dataNorm.tcga <- TCGAanalyze_Normalization(tabDF = dataNorm.tcga,
geneInfo = geneInfoHT,
method = "geneLength")
dataFilt.tcga <- TCGAanalyze_Filtering(tabDF = dataNorm.tcga,
method = "quantile",
qnt.cut = 0.25)
### #Filtering data so all samples have a pam50 subtype for LUSC
diff<-setdiff(colnames(dataFilt.tcga), TCGA_MolecularSubtype(colnames(dataFilt.tcga))$filtered)
TCGA_MolecularSubtype(diff)$subtypes$barcodes
dataFilt.tcga.lusc<-dataFilt.tcga[,diff]
LUSC.subtypes<-TCGA_MolecularSubtype(colnames(dataFilt.tcga.lusc))$subtypes$subtype
### Differential expression analysis after correcting for "Plate" factor.
DEG.lusc <- TCGAanalyze_DEA(MAT=dataFilt.tcga.lusc,
pipeline="limma",
batch.factors = c("Plate"),
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT", ClinicalDF = data.frame(),
Condtypes = LUSC.subtypes,
contrast.formula = "LUSC.basalvsLUSC.classical=LUSC.basal - LUSC.classical")
This function calls ComBat from sva package to handle batch correction, display some plots for exploratory data analysis purposes. More specifically, it shows a plot with black as a kernel estimate of the empirical batch effect density and red as the parametric.
On top of data exploration through ComBat plots, if no batch factor is provided nothing will be performed. Run the code below to see an example:
This function calls ComBat from sva package to handle batch correction, display some plots for exploratory data analysis purposes. More specifically, it shows a plot with black as a kernel estimate of the empirical batch effect density and red as the parametric.
On top of data exploration through ComBat plots, if no batch factor is provided nothing will be performed. Run the code below to see an example:
The following example shows how to consider two cancer type LUAD and LUSC and normal and tumor samples. The end-user can replace LUSC with unpublished data for example.
#dataFilt is a gene expression matrix from pancancer33 RNASeq data that can be generated following our previous described workflow
sampleTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "TP")
sampleNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = "NT")
dataFiltTP <- dataFilt[,sampleTP]
dataFiltNT <- dataFilt[,sampleNT]
dataSubt_LUAD <-TCGAquery_subtype(tumor = "LUAD")
sampleStageI_LUAD <- dataSubt_LUAD[dataSubt_LUAD$Tumor.stage %in% c("Stage I","Stage IA","Stage IB"),]
dataFilt.LUAD.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUAD$patient]
dataFilt.LUAD.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUAD$patient]
# here for example the end-user can replace LUSC with unpublished data in a dataframe containing counts.
# and adding the information for LUSC.stageI.TP or LUSC.stageI.NT with unpublished data
dataSubt_LUSC <-TCGAquery_subtype(tumor = "LUSC")
sampleStageI_LUSC <- dataSubt_LUSC[dataSubt_LUSC$T.stage %in% c("T1","T1a","T1b"),]
dataFilt.LUSC.stageI.TP <- dataFiltTP[,substr(colnames(dataFiltTP),1,12) %in% sampleStageI_LUSC$patient]
dataFilt.LUSC.stageI.NT <- dataFiltNT[,substr(colnames(dataFiltNT),1,12) %in% sampleStageI_LUSC$patient]
countsTable <-cbind(dataFilt.LUAD.stageI.TP,dataFilt.LUAD.stageI.NT,
dataFilt.LUSC.stageI.TP, dataFilt.LUSC.stageI.NT)
AnnotationCounts <- matrix(0,ncol(countsTable),2)
colnames(AnnotationCounts) <- c("Samples","Batch")
rownames(AnnotationCounts) <- colnames(countsTable)
AnnotationCounts <- as.data.frame(AnnotationCounts)
AnnotationCounts$Samples <- colnames(countsTable)
AnnotationCounts[colnames(dataFilt.LUAD.stageI.TP),"Batch"] <- "LUAD.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUAD.stageI.NT),"Batch"] <- "LUAD.stageI.NT"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.TP),"Batch"] <- "LUSC.stageI.TP"
AnnotationCounts[colnames(dataFilt.LUSC.stageI.NT),"Batch"] <- "LUSC.stageI.NT"
countsCorrected <- TCGAbatch_Correction(tabDF =countsTable,
UnpublishedData = TRUE,
AnnotationDF = AnnotationCounts)
Function takes TCGA barcodes as input and filter them according to values specified by user. Default value for all of them is 0. Arguments are estimates using different measures imported from the TCGA consortium, and they are as the following (Refer to [1]):
The function returns a list object with “pure_barcodes”” attribute as a vector of pure samples and “filtered”” attribute as filtered samples with no purity info. –For tumor purity info of all measured TCGA samples, please check Tumor.purity data frame available with the package.–
[1] D. Aran, M. Sirota, and A. J. Butte. Systematic pan-cancer analysis of tumour purity. Nat Commun, 6:8971, 2015.
Recount2 project has made gene count data available for the scientific community to download. Recount2 is an online resource consisting of RNA-seq gene and exon counts as well as coverage bigWig files for 2041 different studies. It is the second generation of the ReCount project. The raw sequencing data were processed with Rail-RNA as described in the recount2 paper. The purpose of this function is to get data processed under one single pipeline to avoid any technical variabilities affecting any downstream integrative analysis. The returned object will be a list with elements named “project_tissue” and as a SummarizedExperiment (SE) of gene counts per sample. This is particularly useful to study batch-free data or to increase the pool of healthy/normal samples.
Arguments are: