Title: | TCGAbiolinks: An R/Bioconductor package for integrative analysis with GDC data |
---|---|
Description: | The aim of TCGAbiolinks is : i) facilitate the GDC open-access data retrieval, ii) prepare the data using the appropriate pre-processing strategies, iii) provide the means to carry out different standard analyses and iv) to easily reproduce earlier research results. In more detail, the package provides multiple methods for analysis (e.g., differential expression analysis, identifying differentially methylated regions) and methods for visualization (e.g., survival plots, volcano plots, starburst plots) in order to easily develop complete analysis pipelines. |
Authors: | Antonio Colaprico, Tiago Chedraoui Silva, Catharina Olsen, Luciano Garofano, Davide Garolini, Claudia Cava, Thais Sabedot, Tathiane Malta, Stefano M. Pagnotta, Isabella Castiglioni, Michele Ceccarelli, Gianluca Bontempi, Houtan Noushmehr |
Maintainer: | Tiago Chedraoui Silva <[email protected]>, Antonio Colaprico <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.35.0 |
Built: | 2024-11-19 04:49:42 UTC |
Source: | https://github.com/bioc/TCGAbiolinks |
Create samples information matrix for GDC samples add subtype information
colDataPrepare(barcode)
colDataPrepare(barcode)
barcode |
TCGA or TARGET barcode |
metadata <- colDataPrepare(c("TCGA-OR-A5K3-01A","C3N-00321-01")) metadata <- colDataPrepare(c("BLGSP-71-06-00157-01A", "BLGSP-71-22-00332-01A"))
metadata <- colDataPrepare(c("TCGA-OR-A5K3-01A","C3N-00321-01")) metadata <- colDataPrepare(c("BLGSP-71-06-00157-01A", "BLGSP-71-22-00332-01A"))
Perform non-parametrix wilcoxon test
dmc.non.parametric( matrix, idx1 = NULL, idx2 = NULL, paired = FALSE, adj.method = "BH", alternative = "two.sided", cores = 1 )
dmc.non.parametric( matrix, idx1 = NULL, idx2 = NULL, paired = FALSE, adj.method = "BH", alternative = "two.sided", cores = 1 )
matrix |
A matrix |
idx1 |
Index columns group1 |
idx2 |
Index columns group2 |
paired |
Do a paired wilcoxon test? Default: True |
adj.method |
P-value adjustment method. Default:"BH" Benjamini-Hochberg |
alternative |
wilcoxon test alternative |
cores |
Number of cores to be used |
Data frame with p-values and diff mean
nrows <- 200; ncols <- 20 counts <- matrix( runif(nrows * ncols, 1, 1e4), nrows, dimnames = list(paste0("cg",1:200),paste0("S",1:20)) ) TCGAbiolinks:::dmc.non.parametric(counts,1:10,11:20)
nrows <- 200; ncols <- 20 counts <- matrix( runif(nrows * ncols, 1, 1e4), nrows, dimnames = list(paste0("cg",1:200),paste0("S",1:20)) ) TCGAbiolinks:::dmc.non.parametric(counts,1:10,11:20)
This function is a auxiliary function to visualize GAIA output (all significant aberrant regions.)
gaiaCNVplot(calls, threshold = 0.01)
gaiaCNVplot(calls, threshold = 0.01)
calls |
A matrix with the following columns: Chromossome, Aberration Kind Region Start, Region End, Region Size and score |
threshold |
Score threshold (orange horizontal line in the plot) |
A plot with all significant aberrant regions.
call <- data.frame("Chromossome" = rep(9,100), "Aberration Kind" = rep(c(-2,-1,0,1,2),20), "Region Start [bp]" = 18259823:18259922, "Region End [bp]" = 18259823:18259922, "score" = rep(c(1,2,3,4),25)) gaiaCNVplot(call,threshold = 0.01) call <- data.frame("Chromossome" = rep(c(1,9),50), "Aberration Kind" = rep(c(-2,-1,0,1,2),20), "Region Start [bp]" = 18259823:18259922, "Region End [bp]" = 18259823:18259922, "score" = rep(c(1,2,3,4),25)) gaiaCNVplot(call,threshold = 0.01)
call <- data.frame("Chromossome" = rep(9,100), "Aberration Kind" = rep(c(-2,-1,0,1,2),20), "Region Start [bp]" = 18259823:18259922, "Region End [bp]" = 18259823:18259922, "score" = rep(c(1,2,3,4),25)) gaiaCNVplot(call,threshold = 0.01) call <- data.frame("Chromossome" = rep(c(1,9),50), "Aberration Kind" = rep(c(-2,-1,0,1,2),20), "Region Start [bp]" = 18259823:18259922, "Region End [bp]" = 18259823:18259922, "score" = rep(c(1,2,3,4),25)) gaiaCNVplot(call,threshold = 0.01)
Uses GDC API or GDC transfer tool to download gdc data The user can use query argument The data from query will be save in a folder: project/data.category
GDCdownload( query, token.file, method = "api", directory = "GDCdata", files.per.chunk = NULL )
GDCdownload( query, token.file, method = "api", directory = "GDCdata", files.per.chunk = NULL )
query |
A query for GDCquery function |
token.file |
Token file to download controlled data (only for method = "client") |
method |
Uses the API (POST method) or gdc client tool. Options "api", "client". API is faster, but the data might get corrupted in the download, and it might need to be executed again |
directory |
Directory/Folder where the data was downloaded. Default: GDCdata |
files.per.chunk |
This will make the API method only download n (files.per.chunk) files at a time. This may reduce the download problems when the data size is too large. Expected a integer number (example files.per.chunk = 6) |
Shows the output from the GDC transfer tools
Tiago Chedraoui Silva
## Not run: # Download clinical data from XML query <- GDCquery(project = "TCGA-COAD", data.category = "Clinical") GDCdownload(query, files.per.chunk = 200) query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification", workflow.type = "BCGSC miRNA Profiling", barcode = c("TARGET-20-PARUDL-03A-01R", "TARGET-20-PASRRB-03A-01R") ) # data will be saved in: # example_data_dir/TARGET-AML/harmonized/Transcriptome_Profiling/miRNA_Expression_Quantification GDCdownload(query, method = "client", directory = "example_data_dir") query_acc_gbm <- GDCquery( project = c("TCGA-ACC", "TCGA-GBM"), data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts" ) GDCdownload( query = query_acc_gbm, method = "api", directory = "example", files.per.chunk = 50 ) ## End(Not run)
## Not run: # Download clinical data from XML query <- GDCquery(project = "TCGA-COAD", data.category = "Clinical") GDCdownload(query, files.per.chunk = 200) query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification", workflow.type = "BCGSC miRNA Profiling", barcode = c("TARGET-20-PARUDL-03A-01R", "TARGET-20-PASRRB-03A-01R") ) # data will be saved in: # example_data_dir/TARGET-AML/harmonized/Transcriptome_Profiling/miRNA_Expression_Quantification GDCdownload(query, method = "client", directory = "example_data_dir") query_acc_gbm <- GDCquery( project = c("TCGA-ACC", "TCGA-GBM"), data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts" ) GDCdownload( query = query_acc_gbm, method = "api", directory = "example", files.per.chunk = 50 ) ## End(Not run)
Reads the data downloaded and prepare it into an R object
GDCprepare( query, save = FALSE, save.filename, directory = "GDCdata", summarizedExperiment = TRUE, remove.files.prepared = FALSE, add.gistic2.mut = NULL, mutant_variant_classification = c("Frame_Shift_Del", "Frame_Shift_Ins", "Missense_Mutation", "Nonsense_Mutation", "Splice_Site", "In_Frame_Del", "In_Frame_Ins", "Translation_Start_Site", "Nonstop_Mutation") )
GDCprepare( query, save = FALSE, save.filename, directory = "GDCdata", summarizedExperiment = TRUE, remove.files.prepared = FALSE, add.gistic2.mut = NULL, mutant_variant_classification = c("Frame_Shift_Del", "Frame_Shift_Ins", "Missense_Mutation", "Nonsense_Mutation", "Splice_Site", "In_Frame_Del", "In_Frame_Ins", "Translation_Start_Site", "Nonstop_Mutation") )
query |
A query for GDCquery function |
save |
Save result as RData object? |
save.filename |
Name of the file to be save if empty an automatic will be created |
directory |
Directory/Folder where the data was downloaded. Default: GDCdata |
summarizedExperiment |
Create a summarizedExperiment? Default TRUE (if possible) |
remove.files.prepared |
Remove the files read? Default: FALSE This argument will be considered only if save argument is set to true |
add.gistic2.mut |
If a list of genes (gene symbol) is given, columns with gistic2 results from GDAC firehose (hg19) and a column indicating if there is or not mutation in that gene (hg38) (TRUE or FALSE - use the MAF file for more information) will be added to the sample matrix in the summarized Experiment object. |
mutant_variant_classification |
List of mutant_variant_classification that will be consider a sample mutant or not. Default: "Frame_Shift_Del", "Frame_Shift_Ins", "Missense_Mutation", "Nonsense_Mutation", "Splice_Site", "In_Frame_Del", "In_Frame_Ins", "Translation_Start_Site", "Nonstop_Mutation" |
A summarizedExperiment or a data.frame
Tiago Chedraoui Silva
## Not run: query <- GDCquery( project = "TCGA-KIRP", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation" ) GDCdownload(query, method = "api", directory = "maf") maf <- GDCprepare(query, directory = "maf") ## End(Not run)
## Not run: query <- GDCquery( project = "TCGA-KIRP", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation" ) GDCdownload(query, method = "api", directory = "maf") maf <- GDCprepare(query, directory = "maf") ## End(Not run)
This function receives the query argument and parses the clinical xml files based on the desired information
GDCprepare_clinic(query, clinical.info, directory = "GDCdata")
GDCprepare_clinic(query, clinical.info, directory = "GDCdata")
query |
Result from GDCquery, with data.category set to Clinical |
clinical.info |
Which information should be retrieved. Options Clinical: drug, admin, follow_up,radiation, patient, stage_event or new_tumor_event Options Biospecimen: protocol, admin, aliquot, analyte, bio_patient, sample, portion, slide |
directory |
Directory/Folder where the data was downloaded. Default: GDCdata |
A data frame with the parsed values from the XML
query <- GDCquery( project = "TCGA-COAD", data.category = "Clinical", data.format = "bcr xml", barcode = c("TCGA-RU-A8FL","TCGA-AA-3972") ) GDCdownload(query) clinical <- GDCprepare_clinic(query,"patient") clinical.drug <- GDCprepare_clinic(query,"drug") clinical.radiation <- GDCprepare_clinic(query,"radiation") clinical.admin <- GDCprepare_clinic(query,"admin") ## Not run: query <- GDCquery( project = "TCGA-COAD", data.category = "Biospecimen", data.format = "bcr xml", data.type = "Biospecimen Supplement", barcode = c("TCGA-RU-A8FL","TCGA-AA-3972") ) GDCdownload(query) clinical <- GDCprepare_clinic(query,"admin") clinical.drug <- GDCprepare_clinic(query,"sample") clinical.radiation <- GDCprepare_clinic(query,"portion") clinical.admin <- GDCprepare_clinic(query,"slide") ## End(Not run)
query <- GDCquery( project = "TCGA-COAD", data.category = "Clinical", data.format = "bcr xml", barcode = c("TCGA-RU-A8FL","TCGA-AA-3972") ) GDCdownload(query) clinical <- GDCprepare_clinic(query,"patient") clinical.drug <- GDCprepare_clinic(query,"drug") clinical.radiation <- GDCprepare_clinic(query,"radiation") clinical.admin <- GDCprepare_clinic(query,"admin") ## Not run: query <- GDCquery( project = "TCGA-COAD", data.category = "Biospecimen", data.format = "bcr xml", data.type = "Biospecimen Supplement", barcode = c("TCGA-RU-A8FL","TCGA-AA-3972") ) GDCdownload(query) clinical <- GDCprepare_clinic(query,"admin") clinical.drug <- GDCprepare_clinic(query,"sample") clinical.radiation <- GDCprepare_clinic(query,"portion") clinical.admin <- GDCprepare_clinic(query,"slide") ## End(Not run)
Uses GDC API to search for search, it searches for both controlled and open-access data. For GDC data arguments project, data.category, data.type and workflow.type should be used Please, see the vignette for a table with the possibilities.
GDCquery( project, data.category, data.type, workflow.type, access, platform, barcode, data.format, experimental.strategy, sample.type )
GDCquery( project, data.category, data.type, workflow.type, access, platform, barcode, data.format, experimental.strategy, sample.type )
project |
A list of valid project (see list with TCGAbiolinks:::getGDCprojects()$project_id)]
|
||||||||||||||||||||||||||||||||||||||||||||||
data.category |
A valid project (see list with TCGAbiolinks:::getProjectSummary(project)) For the complete list please check the vignette. List for harmonized database:
|
||||||||||||||||||||||||||||||||||||||||||||||
data.type |
A data type to filter the files to download For the complete list please check the vignette. |
||||||||||||||||||||||||||||||||||||||||||||||
workflow.type |
GDC workflow type |
||||||||||||||||||||||||||||||||||||||||||||||
access |
Filter by access type. Possible values: controlled, open |
||||||||||||||||||||||||||||||||||||||||||||||
platform |
Example:
|
||||||||||||||||||||||||||||||||||||||||||||||
barcode |
A list of barcodes to filter the files to download |
||||||||||||||||||||||||||||||||||||||||||||||
data.format |
Data format filter ("VCF", "TXT", "BAM","SVS","BCR XML","BCR SSF XML", "TSV", "BCR Auxiliary XML", "BCR OMF XML", "BCR Biotab", "MAF", "BCR PPS XML", "XLSX") |
||||||||||||||||||||||||||||||||||||||||||||||
experimental.strategy |
Filter to experimental strategy. Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array. |
||||||||||||||||||||||||||||||||||||||||||||||
sample.type |
A sample type to filter the files to download |
A data frame with the results and the parameters used
Tiago Chedraoui Silva
query <- GDCquery( project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Copy Number Segment" ) ## Not run: query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification", workflow.type = "BCGSC miRNA Profiling", barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R") ) query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R") ) query <- GDCquery( project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Masked Copy Number Segment", sample.type = c("Primary Tumor") ) ## End(Not run)
query <- GDCquery( project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Copy Number Segment" ) ## Not run: query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "miRNA Expression Quantification", workflow.type = "BCGSC miRNA Profiling", barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R") ) query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R") ) query <- GDCquery( project = "TCGA-ACC", data.category = "Copy Number Variation", data.type = "Masked Copy Number Segment", sample.type = c("Primary Tumor") ) ## End(Not run)
Retrieve open access ATAC-seq files from GDC server https://gdc.cancer.gov/about-data/publications/ATACseq-AWG Manifest available at: https://gdc.cancer.gov/files/public/file/ATACseq-AWG_Open_GDC-Manifest.txt
GDCquery_ATAC_seq(tumor = NULL, file.type = NULL)
GDCquery_ATAC_seq(tumor = NULL, file.type = NULL)
tumor |
a valid tumor |
file.type |
Write maf file into a csv document |
A data frame with the maf file information
query <- GDCquery_ATAC_seq(file.type = "txt") ## Not run: GDCdownload(query) ## End(Not run) query <- GDCquery_ATAC_seq(tumor = "BRCA",file.type = "bigWigs") ## Not run: GDCdownload(query,method = "client") ## End(Not run)
query <- GDCquery_ATAC_seq(file.type = "txt") ## Not run: GDCdownload(query) ## End(Not run) query <- GDCquery_ATAC_seq(tumor = "BRCA",file.type = "bigWigs") ## Not run: GDCdownload(query,method = "client") ## End(Not run)
GDCquery_clinic will download all clinical information from the API as the one with using the button from each project
GDCquery_clinic(project, type = "clinical", save.csv = FALSE)
GDCquery_clinic(project, type = "clinical", save.csv = FALSE)
project |
A valid project (see list with getGDCprojects()$project_id)]
|
type |
A valid type. Options "clinical", "Biospecimen" (see list with getGDCprojects()$project_id)] |
save.csv |
Write clinical information into a csv document |
A data frame with the clinical information
Tiago Chedraoui Silva
clinical <- GDCquery_clinic( project = "TCGA-ACC", type = "clinical", save.csv = FALSE ) clinical <- GDCquery_clinic( project = "TCGA-ACC", type = "biospecimen", save.csv = FALSE ) ## Not run: clinical_cptac_3 <- GDCquery_clinic( project = "CPTAC-3", type = "clinical" ) clinical_cptac_2 <- GDCquery_clinic( project = "CPTAC-2", type = "clinical" ) clinical_HCMI_CMDC <- GDCquery_clinic( project = "HCMI-CMDC", type = "clinical" ) clinical_GCI_HTMCP_CC <- GDCquery_clinic( project = "CGCI-HTMCP-CC", type = "clinical" ) clinical <- GDCquery_clinic( project = "NCICCR-DLBCL", type = "clinical" ) clinical <- GDCquery_clinic( project = "ORGANOID-PANCREATIC", type = "clinical" ) ## End(Not run)
clinical <- GDCquery_clinic( project = "TCGA-ACC", type = "clinical", save.csv = FALSE ) clinical <- GDCquery_clinic( project = "TCGA-ACC", type = "biospecimen", save.csv = FALSE ) ## Not run: clinical_cptac_3 <- GDCquery_clinic( project = "CPTAC-3", type = "clinical" ) clinical_cptac_2 <- GDCquery_clinic( project = "CPTAC-2", type = "clinical" ) clinical_HCMI_CMDC <- GDCquery_clinic( project = "HCMI-CMDC", type = "clinical" ) clinical_GCI_HTMCP_CC <- GDCquery_clinic( project = "CGCI-HTMCP-CC", type = "clinical" ) clinical <- GDCquery_clinic( project = "NCICCR-DLBCL", type = "clinical" ) clinical <- GDCquery_clinic( project = "ORGANOID-PANCREATIC", type = "clinical" ) ## End(Not run)
get_IDs allows user to extract metadata from barcodes. The dataframe returned has columns for 'project', 'tss','participant', 'sample', "portion", "plate", and "center"
get_IDs(data)
get_IDs(data)
data |
numeric matrix, each row represents a gene, each column represents a sample |
data frame with columns 'project', 'tss','participant', 'sample', "portion", "plate", "center", "condition"
Get hg19 (from biomart) or hg38 (gencode v36) gene annotation
get.GRCh.bioMart(genome = c("hg19", "hg38"), as.granges = FALSE)
get.GRCh.bioMart(genome = c("hg19", "hg38"), as.granges = FALSE)
genome |
hg38 or hg19 |
as.granges |
Output as GRanges or data.frame |
Using biogrid database, it will create a matrix of gene interactions. If columns A and row B has value 1, it means the gene A and gene B interacts.
getAdjacencyBiogrid(tmp.biogrid, names.genes = NULL)
getAdjacencyBiogrid(tmp.biogrid, names.genes = NULL)
tmp.biogrid |
Biogrid table |
names.genes |
List of genes to filter from output. Default: consider all genes |
A matrix with 1 for genes that interacts, 0 for no interaction.
names.genes.de <- c("PLCB1","MCL1","PRDX4","TTF2","TACC3", "PARP4","LSM1") tmp.biogrid <- data.frame("Official.Symbol.Interactor.A" = names.genes.de, "Official.Symbol.Interactor.B" = rev(names.genes.de)) net.biogrid.de <- getAdjacencyBiogrid(tmp.biogrid, names.genes.de) ## Not run: file <- paste0("http://thebiogrid.org/downloads/archives/", "Release%20Archive/BIOGRID-3.4.133/BIOGRID-ALL-3.4.133.tab2.zip") downloader::download(file,basename(file)) unzip(basename(file),junkpaths =TRUE) tmp.biogrid <- read.csv(gsub("zip","txt",basename(file)), header=TRUE, sep="\t", stringsAsFactors=FALSE) names.genes.de <- c("PLCB1","MCL1","PRDX4","TTF2","TACC3", "PARP4","LSM1") net.biogrid.de <- getAdjacencyBiogrid(tmp.biogrid, names.genes.de) ## End(Not run)
names.genes.de <- c("PLCB1","MCL1","PRDX4","TTF2","TACC3", "PARP4","LSM1") tmp.biogrid <- data.frame("Official.Symbol.Interactor.A" = names.genes.de, "Official.Symbol.Interactor.B" = rev(names.genes.de)) net.biogrid.de <- getAdjacencyBiogrid(tmp.biogrid, names.genes.de) ## Not run: file <- paste0("http://thebiogrid.org/downloads/archives/", "Release%20Archive/BIOGRID-3.4.133/BIOGRID-ALL-3.4.133.tab2.zip") downloader::download(file,basename(file)) unzip(basename(file),junkpaths =TRUE) tmp.biogrid <- read.csv(gsub("zip","txt",basename(file)), header=TRUE, sep="\t", stringsAsFactors=FALSE) names.genes.de <- c("PLCB1","MCL1","PRDX4","TTF2","TACC3", "PARP4","LSM1") net.biogrid.de <- getAdjacencyBiogrid(tmp.biogrid, names.genes.de) ## End(Not run)
Create a Summary table for each sample in a project saying if it contains or not files for a certain data category
getDataCategorySummary(project)
getDataCategorySummary(project)
project |
A GDC project |
A data frame
Tiago Chedraoui Silva
summary <- getDataCategorySummary("TCGA-ACC")
summary <- getDataCategorySummary("TCGA-ACC")
Check GDC server status using the api https://api.gdc.cancer.gov/status
getGDCInfo()
getGDCInfo()
Return true all status
info <- getGDCInfo()
info <- getGDCInfo()
getGDCprojects uses the following api to get projects https://api.gdc.cancer.gov/projects
getGDCprojects()
getGDCprojects()
A data frame with last GDC projects
projects <- getGDCprojects()
projects <- getGDCprojects()
Download GISTIC data from firehose from http://gdac.broadinstitute.org/runs/analyses__latest/data/
getGistic(disease, type = "thresholded")
getGistic(disease, type = "thresholded")
disease |
TCGA disease. Option available in http://gdac.broadinstitute.org/runs/analyses__latest/data/ |
type |
Results type: thresholded or data |
Retrieve linkedOmics data from http://linkedomics.org/
getLinkedOmicsData(project, dataset)
getLinkedOmicsData(project, dataset)
project |
A linkedOmics project:
|
dataset |
A dataset from the list below
|
A matrix with the data
## Not run: TCGA_COAD_protein <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "Proteome (Gene level)" ) TCGA_COAD_RNASeq_hiseq <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "RNAseq (HiSeq, Gene level)" ) TCGA_COAD_RNASeq_ga <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "RNAseq (GA, Gene level)" ) TCGA_COAD_RPPA <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "RPPA (Gene level)" ) ## End(Not run)
## Not run: TCGA_COAD_protein <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "Proteome (Gene level)" ) TCGA_COAD_RNASeq_hiseq <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "RNAseq (HiSeq, Gene level)" ) TCGA_COAD_RNASeq_ga <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "RNAseq (GA, Gene level)" ) TCGA_COAD_RPPA <- getLinkedOmicsData( project = "TCGA-COADREAD", dataset = "RPPA (Gene level)" ) ## End(Not run)
Get a Manifest from GDCquery output that can be used with GDC-client
getManifest(query, save = FALSE)
getManifest(query, save = FALSE)
query |
A query for GDCquery function |
save |
Write Manifest to a txt file (tab separated) |
query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = c("TARGET-20-PADZCG-04A-01R", "TARGET-20-PARJCR-09A-01R") ) getManifest(query)
query <- GDCquery( project = "TARGET-AML", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = c("TARGET-20-PADZCG-04A-01R", "TARGET-20-PARJCR-09A-01R") ) getManifest(query)
Download data from https://gdc.cancer.gov/about-data/publications/mc3-2017 https://gdc-docs.nci.nih.gov/Data/Release_Notes/Data_Release_Notes/
getMC3MAF()
getMC3MAF()
A data frame with the MAF file information from https://gdc.cancer.gov/about-data/publications/mc3-2017
## Not run: maf <- getMC3MAF() ## End(Not run)
## Not run: maf <- getMC3MAF() ## End(Not run)
Get Number of cases in GDC for a project
getNbCases(project, data.category)
getNbCases(project, data.category)
project |
A GDC project |
data.category |
A GDC project data category |
Tiago Chedraoui Silva
## Not run: getNbCases("TCGA-ACC","Clinical") getNbCases("CPTAC-2","Clinical") ## End(Not run)
## Not run: getNbCases("TCGA-ACC","Clinical") getNbCases("CPTAC-2","Clinical") ## End(Not run)
Get Number of files in GDC for a project
getNbFiles(project, data.category)
getNbFiles(project, data.category)
project |
A GDC project |
data.category |
A GDC project data category |
Tiago Chedraoui Silva
## Not run: getNbFiles("TCGA-ACC","Clinical") getNbFiles("CPTAC-2","Clinical") ## End(Not run)
## Not run: getNbFiles("TCGA-ACC","Clinical") getNbFiles("CPTAC-2","Clinical") ## End(Not run)
Get Project Summary from GDC
getProjectSummary(project)
getProjectSummary(project)
project |
A GDC project |
Tiago Chedraoui Silva
getProjectSummary("TCGA-ACC") ## Not run: getProjectSummary("CPTAC-2") ## End(Not run)
getProjectSummary("TCGA-ACC") ## Not run: getProjectSummary("CPTAC-2") ## End(Not run)
Get the results table from query, it can select columns with cols argument and return a number of rows using rows argument.
getResults(query, rows, cols)
getResults(query, rows, cols)
query |
A object from GDCquery |
rows |
Rows identifiers (row numbers) |
cols |
Columns identifiers (col names) |
Table with query results
query <- GDCquery( project = "TCGA-GBM", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01") ) results <- getResults(query)
query <- GDCquery( project = "TCGA-GBM", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "STAR - Counts", barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01") ) results <- getResults(query)
Retrieve the numner of files under each data_category + data_type + experimental_strategy + platform Almost like https://portal.gdc.cancer.gov/exploration
getSampleFilesSummary(project, files.access = NA)
getSampleFilesSummary(project, files.access = NA)
project |
A GDC project |
files.access |
Filter by file access ("open" or "controlled"). Default: no filter |
A data frame with the maf file information
Tiago Chedraoui Silva
summary <- getSampleFilesSummary("TCGA-UCS") ## Not run: summary <- getSampleFilesSummary(c("TCGA-OV","TCGA-ACC")) ## End(Not run)
summary <- getSampleFilesSummary("TCGA-UCS") ## Not run: summary <- getSampleFilesSummary(c("TCGA-OV","TCGA-ACC")) ## End(Not run)
getTSS to fetch GENCODE gene annotation (transcripts level) from Bioconductor package biomaRt If upstream and downstream are specified in TSS list, promoter regions of GENCODE gene will be generated.
getTSS( genome = c("hg38", "hg19"), TSS = list(upstream = NULL, downstream = NULL) )
getTSS( genome = c("hg38", "hg19"), TSS = list(upstream = NULL, downstream = NULL) )
genome |
Which genome build will be used: hg38 (default) or hg19. |
TSS |
A list. Contains upstream and downstream like TSS=list(upstream, downstream). When upstream and downstream is specified, coordinates of promoter regions with gene annotation will be generated. |
GENCODE gene annotation if TSS is not specified. Coordinates of GENCODE gene promoter regions if TSS is specified.
# get GENCODE gene annotation (transcripts level) ## Not run: getTSS <- getTSS() getTSS <- getTSS(genome.build = "hg38", TSS=list(upstream=1000, downstream=1000)) ## End(Not run)
# get GENCODE gene annotation (transcripts level) ## Not run: getTSS <- getTSS() getTSS <- getTSS(genome.build = "hg38", TSS=list(upstream=1000, downstream=1000)) ## End(Not run)
Classify DNA methylation gliomas using data from https://doi.org/10.1016/j.cell.2015.12.028
gliomaClassifier(data)
gliomaClassifier(data)
data |
DNA methylation matrix or Summarized Experiments with samples on columns and probes on the rows |
A list of 3 data frames: 1) Sample final classification 2) Each model final classification 3) Each class probability of classification
Tiago Chedraoui Silva, Tathiane Malta, Houtan Noushmehr
## Not run: query <- GDCquery( project= "TCGA-GBM", data.category = "DNA methylation", barcode = c("TCGA-06-0122","TCGA-14-1456"), platform = "Illumina Human Methylation 27", legacy = TRUE ) GDCdownload(query) data.hg19 <- GDCprepare(query) classification <- gliomaClassifier(data.hg19) # Comparing reslts TCGAquery_subtype("GBM") %>% dplyr::filter(patient %in% c("TCGA-06-0122","TCGA-14-1456")) %>% dplyr::select("patient","Supervised.DNA.Methylation.Cluster") ## End(Not run)
## Not run: query <- GDCquery( project= "TCGA-GBM", data.category = "DNA methylation", barcode = c("TCGA-06-0122","TCGA-14-1456"), platform = "Illumina Human Methylation 27", legacy = TRUE ) GDCdownload(query) data.hg19 <- GDCprepare(query) classification <- gliomaClassifier(data.hg19) # Comparing reslts TCGAquery_subtype("GBM") %>% dplyr::filter(patient %in% c("TCGA-06-0122","TCGA-14-1456")) %>% dplyr::select("patient","Supervised.DNA.Methylation.Cluster") ## End(Not run)
Check GDC server status using the api https://api.gdc.cancer.gov/status
isServeOK()
isServeOK()
Return true if status is ok
status <- isServeOK()
status <- isServeOK()
For a given TCGA project it gets the primary tumors samples (barcode) with both DNA methylation and Gene expression data from GDC database
matchedMetExp(project, n = NULL)
matchedMetExp(project, n = NULL)
project |
A GDC project |
n |
Number of samples to return. If NULL return all (default) |
A vector of barcodes
# Get ACC samples with both DNA methylation (HM450K) and gene expression aligned to hg19 samples <- matchedMetExp("TCGA-UCS")
# Get ACC samples with both DNA methylation (HM450K) and gene expression aligned to hg19 samples <- matchedMetExp("TCGA-UCS")
PanCancerAtlas_subtypes is a curated table with molecular subtypes for 24 TCGA cancer types
PanCancerAtlas_subtypes()
PanCancerAtlas_subtypes()
a data.frame with barcode and molecular subtypes for 24 cancer types
molecular.subtypes <- PanCancerAtlas_subtypes()
molecular.subtypes <- PanCancerAtlas_subtypes()
internal function to break a huge API call into smaller ones so it repects the max character limit of a string
splitAPICall(FUN, step = 20, items)
splitAPICall(FUN, step = 20, items)
FUN |
function that calls the API |
step |
How many items to be evaluated per API call |
items |
vector of items to be using within the function (list of barcodes, aliquot ids, etc) |
Tiago Chedraoui Silva
A dataset containing the Sample Ids from TCGA and PAM50 subtyping attributes of 4768 tumor patients
TabSubtypesCol_merged
TabSubtypesCol_merged
A data frame with 4768 rows and 3 variables:
Sample ID from TCGA barcodes, character string
Pam50 classification, character string
color, character string
...
TCGA_MolecularSubtype Retrieve molecular subtypes from TCGA consortium for a given set of barcodes
TCGA_MolecularSubtype(barcodes)
TCGA_MolecularSubtype(barcodes)
barcodes |
is a vector of TCGA barcodes |
List with $subtypes attribute as a dataframe with barcodes, samples, subtypes, and colors. The $filtered attribute is returned as filtered samples with no subtype info
TCGA_MolecularSubtype("TCGA-60-2721-01A-01R-0851-07")
TCGA_MolecularSubtype("TCGA-60-2721-01A-01R-0851-07")
TCGAanalyze_analyseGRN perform gene regulatory network.
TCGAanalyze_analyseGRN(TFs, normCounts, kNum)
TCGAanalyze_analyseGRN(TFs, normCounts, kNum)
TFs |
a vector of genes. |
normCounts |
is a matrix of gene expression with genes in rows and samples in columns. |
kNum |
the number of nearest neighbors to consider to estimate the mutual information. Must be less than the number of columns of normCounts. |
an adjacent matrix
Hierarchical cluster analysis using several methods such as ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
TCGAanalyze_Clustering(tabDF, method, methodHC = "ward.D2")
TCGAanalyze_Clustering(tabDF, method, methodHC = "ward.D2")
tabDF |
is a dataframe or numeric matrix, each row represents a gene, each column represents a sample come from TCGAPrepare. |
method |
is method to be used for generic cluster such as 'hclust' or 'consensus' |
methodHC |
is method to be used for Hierarchical cluster. |
object of class hclust if method selected is 'hclust'. If method selected is 'Consensus' returns a list of length maxK (maximum cluster number to evaluate.). Each element is a list containing consensusMatrix (numerical matrix), consensusTree (hclust), consensusClass (consensus class assignments). ConsensusClusterPlus also produces images.
TCGAanalyze_DEA allows user to perform Differentially expression analysis (DEA), using edgeR package or limma to identify differentially expressed genes (DEGs). It is possible to do a two-class analysis.
TCGAanalyze_DEA performs DEA using following functions from edgeR:
edgeR::DGEList converts the count matrix into an edgeR object.
edgeR::estimateCommonDisp each gene gets assigned the same dispersion estimate.
edgeR::exactTest performs pair-wise tests for differential expression between two groups.
edgeR::topTags takes the output from exactTest(), adjusts the raw p-values using the False Discovery Rate (FDR) correction, and returns the top differentially expressed genes.
TCGAanalyze_DEA performs DEA using following functions from limma:
limma::makeContrasts construct matrix of custom contrasts.
limma::lmFit Fit linear model for each gene given a series of arrays.
limma::contrasts.fit Given a linear model fit to microarray data, compute estimated coefficients and standard errors for a given set of contrasts.
limma::eBayes Given a microarray linear model fit, compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes moderation of the standard errors towards a common value.
limma::toptable Extract a table of the top-ranked genes from a linear model fit.
TCGAanalyze_DEA( mat1, mat2, metadata = TRUE, Cond1type, Cond2type, pipeline = "edgeR", method = "exactTest", fdr.cut = 1, logFC.cut = 0, batch.factors = NULL, ClinicalDF = data.frame(), paired = FALSE, log.trans = FALSE, voom = FALSE, trend = FALSE, MAT = data.frame(), contrast.formula = "", Condtypes = c() )
TCGAanalyze_DEA( mat1, mat2, metadata = TRUE, Cond1type, Cond2type, pipeline = "edgeR", method = "exactTest", fdr.cut = 1, logFC.cut = 0, batch.factors = NULL, ClinicalDF = data.frame(), paired = FALSE, log.trans = FALSE, voom = FALSE, trend = FALSE, MAT = data.frame(), contrast.formula = "", Condtypes = c() )
mat1 |
numeric matrix, each row represents a gene, each column represents a sample with Cond1type |
mat2 |
numeric matrix, each row represents a gene, each column represents a sample with Cond2type |
metadata |
Add metadata |
Cond1type |
a string containing the class label of the samples in mat1 (e.g., control group) |
Cond2type |
a string containing the class label of the samples in mat2 (e.g., case group) |
pipeline |
a string to specify which package to use ("limma" or "edgeR") |
method |
is 'glmLRT' (1) or 'exactTest' (2) used for edgeR (1) Fit a negative binomial generalized log-linear model to the read counts for each gene (2) Compute genewise exact tests for differences in the means between two groups of negative-binomially distributed counts. |
fdr.cut |
is a threshold to filter DEGs according their p-value corrected |
logFC.cut |
is a threshold to filter DEGs according their logFC |
batch.factors |
a vector containing strings to specify options for batch correction. Options are "Plate", "TSS", "Year", "Portion", "Center", and "Patients" |
ClinicalDF |
a dataframe returned by GDCquery_clinic() to be used to extract year data |
paired |
boolean to account for paired or non-paired samples. Set to TRUE for paired case |
log.trans |
boolean to perform log cpm transformation. Set to TRUE for log transformation |
voom |
boolean to perform voom transformation for limma-voom pipeline. Set to TRUE for voom transformation |
trend |
boolean to perform limma-trend pipeline. Set to TRUE to go through limma-trend |
MAT |
matrix containing expression set as all samples in columns and genes as rows. Do not provide if mat1 and mat2 are used |
contrast.formula |
string input to determine coefficients and to design contrasts in a customized way |
Condtypes |
vector of grouping for samples in MAT |
table with DEGs containing for each gene logFC, logCPM, pValue,and FDR, also for each contrast
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo) dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut = 0.25) samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT")) samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP")) dataDEGs <- TCGAanalyze_DEA( mat1 = dataFilt[,samplesNT], mat2 = dataFilt[,samplesTP], Cond1type = "Normal", Cond2type = "Tumor" )
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo) dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut = 0.25) samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT")) samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP")) dataDEGs <- TCGAanalyze_DEA( mat1 = dataFilt[,samplesNT], mat2 = dataFilt[,samplesTP], Cond1type = "Normal", Cond2type = "Tumor" )
Differentially expression analysis (DEA) using limma package.
TCGAanalyze_DEA_Affy(AffySet, FC.cut = 0.01)
TCGAanalyze_DEA_Affy(AffySet, FC.cut = 0.01)
AffySet |
A matrix-like data object containing log-ratios or log-expression values for a series of arrays, with rows corresponding to genes and columns to samples |
FC.cut |
write |
List of list with tables in 2 by 2 comparison of the top-ranked genes from a linear model fitted by DEA's limma
## Not run: to add example ## End(Not run)
## Not run: to add example ## End(Not run)
This function will search for differentially methylated CpG sites, which are regarded as possible functional regions involved in gene transcriptional regulation.
In order to find these regions we use the beta-values (methylation values ranging from 0.0 to 1.0) to compare two groups.
Firstly, it calculates the difference between the mean methylation of each group for each probes. Secondly, it calculates the p-value using the wilcoxon test using the Benjamini-Hochberg adjustment method. The default parameters will require a minimum absolute beta values delta of 0.2 and a false discovery rate (FDR)-adjusted Wilcoxon rank-sum P-value of < 0.01 for the difference.
After these analysis, we save a volcano plot (x-axis:diff mean methylation, y-axis: significance) that will help the user identify the differentially methylated CpG sites and return the object with the calculus in the rowRanges.
If the calculus already exists in the object it will not recalculated. You should set overwrite parameter to TRUE to force it, or remove the columns with the results from the object.
TCGAanalyze_DMC( data, groupCol = NULL, group1 = NULL, group2 = NULL, alternative = "two.sided", diffmean.cut = 0.2, paired = FALSE, adj.method = "BH", plot.filename = "methylation_volcano.pdf", ylab = expression(paste(-Log[10], " (FDR corrected P-values)")), xlab = expression(paste("DNA Methylation difference (", beta, "-values)")), title = NULL, legend = "Legend", color = c("black", "red", "darkgreen"), label = NULL, xlim = NULL, ylim = NULL, p.cut = 0.01, probe.names = FALSE, cores = 1, save = TRUE, save.directory = ".", filename = NULL )
TCGAanalyze_DMC( data, groupCol = NULL, group1 = NULL, group2 = NULL, alternative = "two.sided", diffmean.cut = 0.2, paired = FALSE, adj.method = "BH", plot.filename = "methylation_volcano.pdf", ylab = expression(paste(-Log[10], " (FDR corrected P-values)")), xlab = expression(paste("DNA Methylation difference (", beta, "-values)")), title = NULL, legend = "Legend", color = c("black", "red", "darkgreen"), label = NULL, xlim = NULL, ylim = NULL, p.cut = 0.01, probe.names = FALSE, cores = 1, save = TRUE, save.directory = ".", filename = NULL )
data |
SummarizedExperiment obtained from the TCGAPrepare |
groupCol |
Columns with the groups inside the SummarizedExperiment object. (This will be obtained by the function colData(data)) |
group1 |
In case our object has more than 2 groups, you should set the name of the group |
group2 |
In case our object has more than 2 groups, you should set the name of the group |
alternative |
wilcoxon test alternative |
diffmean.cut |
diffmean threshold. Default: 0.2 |
paired |
Wilcoxon paired parameter. Default: FALSE |
adj.method |
Adjusted method for the p-value calculation |
plot.filename |
Filename. Default: volcano.pdf, volcano.svg, volcano.png. If set to FALSE, there will be no plot. |
ylab |
y axis text |
xlab |
x axis text |
title |
main title. If not specified it will be "Volcano plot (group1 vs group2) |
legend |
Legend title |
color |
vector of colors to be used in graph |
label |
vector of labels to be used in the figure. Example: c("Not Significant","Hypermethylated in group1", "Hypomethylated in group1")) |
xlim |
x limits to cut image |
ylim |
y limits to cut image |
p.cut |
p values threshold. Default: 0.01 |
probe.names |
is probe.names |
cores |
Number of cores to be used in the non-parametric test Default = groupCol.group1.group2.rda |
save |
Save object with results? Default: TRUE |
save.directory |
Directory to save the files. Default: working directory |
filename |
Name of the file to save the object. |
Volcano plot saved and the given data with the results (diffmean.group1.group2,p.value.group1.group2, p.value.adj.group1.group2,status.group1.group2) in the rowRanges where group1 and group2 are the names of the groups
nrows <- 200; ncols <- 20 counts <- matrix( runif(nrows * ncols, 1, 1e4), nrows, dimnames = list(paste0("cg",1:200),paste0("S",1:20)) ) rowRanges <- GenomicRanges::GRanges( rep(c("chr1", "chr2"), c(50, 150)), IRanges::IRanges(floor(runif(200, 1e5, 1e6)), width = 100), strand = sample(c("+", "-"), 200, TRUE), feature_id = sprintf("ID%03d", 1:200) ) names(rowRanges) <- paste0("cg",1:200) colData <- S4Vectors::DataFrame( Treatment = rep(c("ChIP", "Input"), 5), row.names = paste0("S",1:20), group = rep(c("group1","group2"),c(10,10)) ) data <- SummarizedExperiment::SummarizedExperiment( assays=S4Vectors::SimpleList(counts=counts), rowRanges = rowRanges, colData = colData ) SummarizedExperiment::colData(data)$group <- c(rep("group 1",ncol(data)/2), rep("group 2",ncol(data)/2)) hypo.hyper <- TCGAanalyze_DMC(data, p.cut = 0.85,"group","group 1","group 2") SummarizedExperiment::colData(data)$group2 <- c(rep("group_1",ncol(data)/2), rep("group_2",ncol(data)/2)) hypo.hyper <- TCGAanalyze_DMC( data = data, p.cut = 0.85, groupCol = "group2", group1 = "group_1", group2 = "group_2" )
nrows <- 200; ncols <- 20 counts <- matrix( runif(nrows * ncols, 1, 1e4), nrows, dimnames = list(paste0("cg",1:200),paste0("S",1:20)) ) rowRanges <- GenomicRanges::GRanges( rep(c("chr1", "chr2"), c(50, 150)), IRanges::IRanges(floor(runif(200, 1e5, 1e6)), width = 100), strand = sample(c("+", "-"), 200, TRUE), feature_id = sprintf("ID%03d", 1:200) ) names(rowRanges) <- paste0("cg",1:200) colData <- S4Vectors::DataFrame( Treatment = rep(c("ChIP", "Input"), 5), row.names = paste0("S",1:20), group = rep(c("group1","group2"),c(10,10)) ) data <- SummarizedExperiment::SummarizedExperiment( assays=S4Vectors::SimpleList(counts=counts), rowRanges = rowRanges, colData = colData ) SummarizedExperiment::colData(data)$group <- c(rep("group 1",ncol(data)/2), rep("group 2",ncol(data)/2)) hypo.hyper <- TCGAanalyze_DMC(data, p.cut = 0.85,"group","group 1","group 2") SummarizedExperiment::colData(data)$group2 <- c(rep("group_1",ncol(data)/2), rep("group_2",ncol(data)/2)) hypo.hyper <- TCGAanalyze_DMC( data = data, p.cut = 0.85, groupCol = "group2", group1 = "group_1", group2 = "group_2" )
The rational behind a enrichment analysis ( gene-set, pathway etc) is to compute statistics of whether the overlap between the focus list (signature) and the gene-set is significant. ie the confidence that overlap between the list is not due to chance. The Gene Ontology project describes genes (gene products) using terms from three structured vocabularies: biological process, cellular component and molecular function. The Gene Ontology Enrichment component, also referred to as the GO Terms" component, allows the genes in any such "changed-gene" list to be characterized using the Gene Ontology terms annotated to them. It asks, whether for any particular GO term, the fraction of genes assigned to it in the "changed-gene" list is higher than expected by chance (is over-represented), relative to the fraction of genes assigned to that term in the reference set. In statistical terms it perform the analysis tests the null hypothesis that, for any particular ontology term, there is no difference in the proportion of genes annotated to it in the reference list and the proportion annotated to it in the test list. We adopted a Fisher Exact Test to perform the EA.
TCGAanalyze_EA( GeneName, RegulonList, TableEnrichment, EAGenes, GOtype, FDRThresh = 0.01, GeneSymbolsTable = FALSE )
TCGAanalyze_EA( GeneName, RegulonList, TableEnrichment, EAGenes, GOtype, FDRThresh = 0.01, GeneSymbolsTable = FALSE )
GeneName |
is the name of gene signatures list |
RegulonList |
is a gene signature (lisf of genes) in which perform EA. |
TableEnrichment |
is a table related to annotations of gene symbols such as GO[BP,MF,CC] and Pathways. It was created from DAVID gene ontology on-line. |
EAGenes |
is a table with informations about genes such as ID, Gene, Description, Location and Family. |
GOtype |
is type of gene ontology Biological process (BP), Molecular Function (MF), Cellular componet (CC) |
FDRThresh |
pvalue corrected (FDR) as threshold to selected significant BP, MF,CC, or pathways. (default FDR < 0.01) |
GeneSymbolsTable |
if it is TRUE will return a table with GeneSymbols in common GO or pathways. |
Table with enriched GO or pathways by selected gene signature.
## Not run: EAGenes <- get("EAGenes") RegulonList <- rownames(dataDEGsFiltLevel) ResBP <- TCGAanalyze_EA( GeneName="DEA genes Normal Vs Tumor", RegulonList = RegulonList, TableEnrichment = DAVID_BP_matrix, EAGenes = EAGenes, GOtype = "DavidBP" ) ## End(Not run)
## Not run: EAGenes <- get("EAGenes") RegulonList <- rownames(dataDEGsFiltLevel) ResBP <- TCGAanalyze_EA( GeneName="DEA genes Normal Vs Tumor", RegulonList = RegulonList, TableEnrichment = DAVID_BP_matrix, EAGenes = EAGenes, GOtype = "DavidBP" ) ## End(Not run)
Researchers, in order to better understand the underlying biological processes, often want to retrieve a functional profile of a set of genes that might have an important role. This can be done by performing an enrichment analysis.
We will perform an enrichment analysis on gene sets using the TCGAanalyze_EAcomplete function. Given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find identify classes of genes or proteins that are #'over-represented using annotations for that gene set.
TCGAanalyze_EAcomplete(TFname, RegulonList)
TCGAanalyze_EAcomplete(TFname, RegulonList)
TFname |
is the name of the list of genes or TF's regulon. |
RegulonList |
List of genes such as TF's regulon or DEGs where to find enrichment. |
Enrichment analysis GO[BP,MF,CC] and Pathways complete table enriched by genelist.
Genelist <- c("FN1","COL1A1") ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist) ## Not run: Genelist <- rownames(dataDEGsFiltLevel) system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist)) ## End(Not run)
Genelist <- c("FN1","COL1A1") ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist) ## Not run: Genelist <- rownames(dataDEGsFiltLevel) system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist)) ## End(Not run)
TCGAanalyze_Filtering allows user to filter mRNA transcripts and miRNA, samples, higher than the threshold defined quantile mean across all samples.
TCGAanalyze_Filtering( tabDF, method, qnt.cut = 0.25, var.func = IQR, var.cutoff = 0.75, eta = 0.05, foldChange = 1 )
TCGAanalyze_Filtering( tabDF, method, qnt.cut = 0.25, var.func = IQR, var.cutoff = 0.75, eta = 0.05, foldChange = 1 )
tabDF |
is a dataframe or numeric matrix, each row represents a gene, each column represents a sample come from TCGAPrepare |
method |
is method of filtering such as 'quantile', 'varFilter', 'filter1', 'filter2' |
qnt.cut |
is threshold selected as mean for filtering |
var.func |
is function used as the per-feature filtering statistic. See genefilter documentation |
var.cutoff |
is a numeric value. See genefilter documentation |
eta |
is a parameter for filter1. default eta = 0.05. |
foldChange |
is a parameter for filter2. default foldChange = 1. |
A filtered dataframe or numeric matrix where each row represents a gene, each column represents a sample
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo) dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo = geneInfo, method = "geneLength") dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm, method = "quantile", qnt.cut = 0.25)
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo) dataNorm <- TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo = geneInfo, method = "geneLength") dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm, method = "quantile", qnt.cut = 0.25)
TCGAanalyze_LevelTab allows user to add information related to DEGs genes from Differentially expression analysis (DEA) such as mean values and in two conditions.
TCGAanalyze_LevelTab( FC_FDR_table_mRNA, typeCond1, typeCond2, TableCond1, TableCond2, typeOrder = TRUE )
TCGAanalyze_LevelTab( FC_FDR_table_mRNA, typeCond1, typeCond2, TableCond1, TableCond2, typeOrder = TRUE )
FC_FDR_table_mRNA |
Output of dataDEGs filter by abs(LogFC) >=1 |
typeCond1 |
a string containing the class label of the samples in TableCond1 (e.g., control group) |
typeCond2 |
a string containing the class label of the samples in TableCond2 (e.g., case group) |
TableCond1 |
numeric matrix, each row represents a gene, each column represents a sample with Cond1type |
TableCond2 |
numeric matrix, each row represents a gene, each column represents a sample with Cond2type |
typeOrder |
typeOrder |
table with DEGs, log Fold Change (FC), false discovery rate (FDR), the gene expression level for samples in Cond1type, and Cond2type, and Delta value (the difference of gene expression between the two conditions multiplied logFC)
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo) dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut = 0.25) samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT")) samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP")) dataDEGs <- TCGAanalyze_DEA( dataFilt[,samplesNT], dataFilt[,samplesTP], Cond1type = "Normal", Cond2type = "Tumor" ) dataDEGsFilt <- dataDEGs[abs(dataDEGs$logFC) >= 1,] dataTP <- dataFilt[,samplesTP] dataTN <- dataFilt[,samplesNT] dataDEGsFiltLevel <- TCGAanalyze_LevelTab( FC_FDR_table_mRNA = dataDEGsFilt, typeCond1 = "Tumor", typeCond2 = "Normal", TableCond1 = dataTP, TableCond2 = dataTN )
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo) dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut = 0.25) samplesNT <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT")) samplesTP <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP")) dataDEGs <- TCGAanalyze_DEA( dataFilt[,samplesNT], dataFilt[,samplesTP], Cond1type = "Normal", Cond2type = "Tumor" ) dataDEGsFilt <- dataDEGs[abs(dataDEGs$logFC) >= 1,] dataTP <- dataFilt[,samplesTP] dataTN <- dataFilt[,samplesNT] dataDEGsFiltLevel <- TCGAanalyze_LevelTab( FC_FDR_table_mRNA = dataDEGsFilt, typeCond1 = "Tumor", typeCond2 = "Normal", TableCond1 = dataTP, TableCond2 = dataTN )
TCGAanalyze_networkInference taking expression data as input, this will return an adjacency matrix of interactions
TCGAanalyze_networkInference(data, optionMethod = "clr")
TCGAanalyze_networkInference(data, optionMethod = "clr")
data |
expression data, genes in columns, samples in rows |
optionMethod |
inference method, chose from aracne, c3net, clr and mrnet |
an adjacent matrix
TCGAanalyze_Normalization allows user to normalize mRNA transcripts and miRNA, using EDASeq package.
Normalization for RNA-Seq Numerical and graphical summaries of RNA-Seq read data. Within-lane normalization procedures to adjust for GC-content effect (or other gene-level effects) on read counts: loess robust local regression, global-scaling, and full-quantile normalization (Risso et al., 2011). Between-lane normalization procedures to adjust for distributional differences between lanes (e.g., sequencing depth): global-scaling and full-quantile normalization (Bullard et al., 2010).
For istance returns all mRNA or miRNA with mean across all samples, higher than the threshold defined quantile mean across all samples.
TCGAanalyze_Normalization performs normalization using following functions from EDASeq
EDASeq::newSeqExpressionSet
EDASeq::withinLaneNormalization
EDASeq::betweenLaneNormalization
EDASeq::counts
TCGAanalyze_Normalization(tabDF, geneInfo, method = "geneLength")
TCGAanalyze_Normalization(tabDF, geneInfo, method = "geneLength")
tabDF |
Rnaseq numeric matrix, each row represents a gene, each column represents a sample |
geneInfo |
Information matrix of 20531 genes about geneLength and gcContent. Two objects are provided: TCGAbiolinks::geneInfoHT,TCGAbiolinks::geneInfo |
method |
is method of normalization such as 'gcContent' or 'geneLength' |
Rnaseq matrix normalized with counts slot holds the count data as a matrix of non-negative integer count values, one row for each observational unit (gene or the like), and one column for each sample.
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(dataBRCA, geneInfo)
TCGAanalyze_Pathview pathway based data integration and visualization.
TCGAanalyze_Pathview(dataDEGs, pathwayKEGG = "hsa05200")
TCGAanalyze_Pathview(dataDEGs, pathwayKEGG = "hsa05200")
dataDEGs |
dataDEGs |
pathwayKEGG |
pathwayKEGG |
an adjacent matrix
## Not run: dataDEGs <- data.frame(mRNA = c("TP53","TP63","TP73"), logFC = c(1,2,3)) TCGAanalyze_Pathview(dataDEGs) ## End(Not run)
## Not run: dataDEGs <- data.frame(mRNA = c("TP53","TP63","TP73"), logFC = c(1,2,3)) TCGAanalyze_Pathview(dataDEGs) ## End(Not run)
TCGAanalyze_Preprocessing perform Array Array Intensity correlation (AAIC). It defines a square symmetric matrix of spearman correlation among samples. According this matrix and boxplot of correlation samples by samples it is possible to find samples with low correlation that can be identified as possible outliers.
TCGAanalyze_Preprocessing( object, cor.cut = 0, datatype = names(assays(object))[1], filename = NULL, width = 1000, height = 1000 )
TCGAanalyze_Preprocessing( object, cor.cut = 0, datatype = names(assays(object))[1], filename = NULL, width = 1000, height = 1000 )
object |
gene expression of class RangedSummarizedExperiment from TCGAprepare |
cor.cut |
is a threshold to filter samples according their spearman correlation in samples by samples. default cor.cut is 0 |
datatype |
is a string from RangedSummarizedExperiment assay |
filename |
Filename of the image file |
width |
Image width |
height |
Image height |
Plot with array array intensity correlation and boxplot of correlation samples by samples
TCGAanalyze_Stemness generate the mRNAsi score
TCGAanalyze_Stemness(stemSig, dataGE, colname.score = "stemness_score")
TCGAanalyze_Stemness(stemSig, dataGE, colname.score = "stemness_score")
stemSig |
is a vector of the stemness Signature generated using gelnet package. Please check the data from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5902191/
|
dataGE |
is a matrix of Gene expression (genes in rows, samples in cols) from TCGAprepare |
colname.score |
Column name of the output. Default "stemness_score" |
table with samples and selected score
# Selecting TCGA breast cancer (10 samples) for example stored in dataBRCA dataNorm <- TCGAanalyze_Normalization( tabDF = dataBRCA, geneInfo = geneInfo ) # quantile filter of genes dataFilt <- TCGAanalyze_Filtering( tabDF = dataNorm, method = "quantile", qnt.cut = 0.25 ) Stemness_score <- TCGAanalyze_Stemness( stemSig = SC_PCBC_stemSig, dataGE = dataFilt, colname.score = "SC_PCBC_stem_score" ) ECTO_score <- TCGAanalyze_Stemness( stemSig = ECTO_PCBC_stemSig, dataGE = dataFilt, colname.score = "ECTO_PCBC_stem_score" ) MESO_score <- TCGAanalyze_Stemness( stemSig = MESO_PCBC_stemSig, dataGE = dataFilt, colname.score = "MESO_PCBC_stem_score" )
# Selecting TCGA breast cancer (10 samples) for example stored in dataBRCA dataNorm <- TCGAanalyze_Normalization( tabDF = dataBRCA, geneInfo = geneInfo ) # quantile filter of genes dataFilt <- TCGAanalyze_Filtering( tabDF = dataNorm, method = "quantile", qnt.cut = 0.25 ) Stemness_score <- TCGAanalyze_Stemness( stemSig = SC_PCBC_stemSig, dataGE = dataFilt, colname.score = "SC_PCBC_stem_score" ) ECTO_score <- TCGAanalyze_Stemness( stemSig = ECTO_PCBC_stemSig, dataGE = dataFilt, colname.score = "ECTO_PCBC_stem_score" ) MESO_score <- TCGAanalyze_Stemness( stemSig = MESO_PCBC_stemSig, dataGE = dataFilt, colname.score = "MESO_PCBC_stem_score" )
Creates a survival plot from TCGA patient clinical data using survival library. It uses the fields days_to_death and vital, plus a columns for groups.
TCGAanalyze_survival( data, clusterCol = NULL, legend = "Legend", labels = NULL, risk.table = TRUE, xlim = NULL, main = "Kaplan-Meier Overall Survival Curves", ylab = "Probability of survival", xlab = "Time since diagnosis (days)", filename = "survival.pdf", color = NULL, height = 8, width = 12, dpi = 300, pvalue = TRUE, conf.int = TRUE, ... )
TCGAanalyze_survival( data, clusterCol = NULL, legend = "Legend", labels = NULL, risk.table = TRUE, xlim = NULL, main = "Kaplan-Meier Overall Survival Curves", ylab = "Probability of survival", xlab = "Time since diagnosis (days)", filename = "survival.pdf", color = NULL, height = 8, width = 12, dpi = 300, pvalue = TRUE, conf.int = TRUE, ... )
data |
TCGA Clinical patient with the information days_to_death |
clusterCol |
Column with groups to plot. This is a mandatory field, the caption will be based in this column |
legend |
Legend title of the figure |
labels |
labels of the plot |
risk.table |
show or not the risk table |
xlim |
x axis limits e.g. xlim = c(0, 1000). Present narrower X axis, but not affect survival estimates. |
main |
main title of the plot |
ylab |
y axis text of the plot |
xlab |
x axis text of the plot |
filename |
The name of the pdf file. |
color |
Define the colors/Pallete for lines. |
height |
Image height |
width |
Image width |
dpi |
Figure quality |
pvalue |
show p-value of log-rank test |
conf.int |
show confidence intervals for point estimates of survival curves. |
... |
Further arguments passed to ggsurvplot. |
Survival plot
# clin <- GDCquery_clinic("TCGA-BRCA","clinical") clin <- data.frame( vital_status = c("alive","alive","alive","dead","alive", "alive","dead","alive","dead","alive"), days_to_death = c(NA,NA,NA,172,NA,NA,3472,NA,786,NA), days_to_last_follow_up = c(3011,965,718,NA,1914,423,NA,5,656,1417), gender = c(rep("male",5),rep("female",5)) ) TCGAanalyze_survival(clin, clusterCol="gender") TCGAanalyze_survival(clin, clusterCol="gender", xlim = 1000) TCGAanalyze_survival(clin, clusterCol="gender", risk.table = FALSE, conf.int = FALSE, color = c("pink","blue")) TCGAanalyze_survival(clin, clusterCol="gender", risk.table = FALSE, xlim = c(100,1000), conf.int = FALSE, color = c("Dark2"))
# clin <- GDCquery_clinic("TCGA-BRCA","clinical") clin <- data.frame( vital_status = c("alive","alive","alive","dead","alive", "alive","dead","alive","dead","alive"), days_to_death = c(NA,NA,NA,172,NA,NA,3472,NA,786,NA), days_to_last_follow_up = c(3011,965,718,NA,1914,423,NA,5,656,1417), gender = c(rep("male",5),rep("female",5)) ) TCGAanalyze_survival(clin, clusterCol="gender") TCGAanalyze_survival(clin, clusterCol="gender", xlim = 1000) TCGAanalyze_survival(clin, clusterCol="gender", risk.table = FALSE, conf.int = FALSE, color = c("pink","blue")) TCGAanalyze_survival(clin, clusterCol="gender", risk.table = FALSE, xlim = c(100,1000), conf.int = FALSE, color = c("Dark2"))
TCGAanalyze_SurvivalKM perform an univariate Kaplan-Meier (KM) survival analysis (SA). It performed Kaplan-Meier survival univariate using complete follow up with all days taking one gene a time from Genelist of gene symbols. For each gene according its level of mean expression in cancer samples, defining two thresholds for quantile expression of that gene in all samples (default ThreshTop=0.67,ThreshDown=0.33) it is possible to define a threshold of intensity of gene expression to divide the samples in 3 groups (High, intermediate, low). TCGAanalyze_SurvivalKM performs SA between High and low groups using following functions from survival package
survival::Surv
survival::survdiff
survival::survfit
TCGAanalyze_SurvivalKM( clinical_patient, dataGE, Genelist, Survresult = FALSE, ThreshTop = 0.67, ThreshDown = 0.33, p.cut = 0.05, group1, group2 )
TCGAanalyze_SurvivalKM( clinical_patient, dataGE, Genelist, Survresult = FALSE, ThreshTop = 0.67, ThreshDown = 0.33, p.cut = 0.05, group1, group2 )
clinical_patient |
is a data.frame using function 'clinic' with information related to barcode / samples such as bcr_patient_barcode, days_to_death , days_to_last_follow_up , vital_status, etc |
dataGE |
is a matrix of Gene expression (genes in rows, samples in cols) from TCGAprepare |
Genelist |
is a list of gene symbols where perform survival KM. |
Survresult |
is a parameter (default = FALSE) if is TRUE will show KM plot and results. |
ThreshTop |
is a quantile threshold to identify samples with high expression of a gene |
ThreshDown |
is a quantile threshold to identify samples with low expression of a gene |
p.cut |
p.values threshold. Default: 0.05 |
group1 |
a string containing the barcode list of the samples in in control group |
group2 |
a string containing the barcode list of the samples in in disease group |
table with survival genes pvalues from KM.
# Selecting only 20 genes for example dataBRCAcomplete <- log2(dataBRCA[1:20,] + 1) # clinical_patient_Cancer <- GDCquery_clinic("TCGA-BRCA","clinical") clinical_patient_Cancer <- data.frame( bcr_patient_barcode = substr(colnames(dataBRCAcomplete),1,12), vital_status = c(rep("alive",3),"dead",rep("alive",2),rep(c("dead","alive"),2)), days_to_death = c(NA,NA,NA,172,NA,NA,3472,NA,786,NA), days_to_last_follow_up = c(3011,965,718,NA,1914,423,NA,5,656,1417) ) group1 <- TCGAquery_SampleTypes(colnames(dataBRCAcomplete), typesample = c("NT")) group2 <- TCGAquery_SampleTypes(colnames(dataBRCAcomplete), typesample = c("TP")) tabSurvKM <- TCGAanalyze_SurvivalKM( clinical_patient = clinical_patient_Cancer, dataGE = dataBRCAcomplete, Genelist = rownames(dataBRCAcomplete), Survresult = FALSE, p.cut = 0.4, ThreshTop = 0.67, ThreshDown = 0.33, group1 = group1, # Control group group2 = group2 ) # Disease group # If the groups are not specified group1 == group2 and all samples are used ## Not run: tabSurvKM <- TCGAanalyze_SurvivalKM( clinical_patient_Cancer, dataBRCAcomplete, Genelist = rownames(dataBRCAcomplete), Survresult = TRUE, p.cut = 0.2, ThreshTop = 0.67, ThreshDown = 0.33 ) ## End(Not run)
# Selecting only 20 genes for example dataBRCAcomplete <- log2(dataBRCA[1:20,] + 1) # clinical_patient_Cancer <- GDCquery_clinic("TCGA-BRCA","clinical") clinical_patient_Cancer <- data.frame( bcr_patient_barcode = substr(colnames(dataBRCAcomplete),1,12), vital_status = c(rep("alive",3),"dead",rep("alive",2),rep(c("dead","alive"),2)), days_to_death = c(NA,NA,NA,172,NA,NA,3472,NA,786,NA), days_to_last_follow_up = c(3011,965,718,NA,1914,423,NA,5,656,1417) ) group1 <- TCGAquery_SampleTypes(colnames(dataBRCAcomplete), typesample = c("NT")) group2 <- TCGAquery_SampleTypes(colnames(dataBRCAcomplete), typesample = c("TP")) tabSurvKM <- TCGAanalyze_SurvivalKM( clinical_patient = clinical_patient_Cancer, dataGE = dataBRCAcomplete, Genelist = rownames(dataBRCAcomplete), Survresult = FALSE, p.cut = 0.4, ThreshTop = 0.67, ThreshDown = 0.33, group1 = group1, # Control group group2 = group2 ) # Disease group # If the groups are not specified group1 == group2 and all samples are used ## Not run: tabSurvKM <- TCGAanalyze_SurvivalKM( clinical_patient_Cancer, dataBRCAcomplete, Genelist = rownames(dataBRCAcomplete), Survresult = TRUE, p.cut = 0.2, ThreshTop = 0.67, ThreshDown = 0.33 ) ## End(Not run)
TCGAbatch_correction allows user to perform a Voom correction on gene expression data and have it ready for DEA. One can also use ComBat for batch correction for exploratory analysis. If batch.factor or adjustment argument is "Year" please provide clinical data. If no batch factor is provided, the data will be voom corrected only
TCGAanalyze_DEA performs DEA using following functions from sva and limma:
limma::voom Transform RNA-Seq Data Ready for Linear Modelling.
sva::ComBat Adjust for batch effects using an empirical Bayes framework.
TCGAbatch_Correction( tabDF, batch.factor = NULL, adjustment = NULL, ClinicalDF = data.frame(), UnpublishedData = FALSE, AnnotationDF = data.frame() )
TCGAbatch_Correction( tabDF, batch.factor = NULL, adjustment = NULL, ClinicalDF = data.frame(), UnpublishedData = FALSE, AnnotationDF = data.frame() )
tabDF |
numeric matrix, each row represents a gene, each column represents a sample |
batch.factor |
a string containing the batch factor to use for correction. Options are "Plate", "TSS", "Year", "Portion", "Center" |
adjustment |
vector containing strings for factors to adjust for using ComBat. Options are "Plate", "TSS", "Year", "Portion", "Center" |
ClinicalDF |
a dataframe returned by GDCquery_clinic() to be used to extract year data |
UnpublishedData |
if TRUE perform a batch correction after adding new data |
AnnotationDF |
a dataframe with column Batch indicating different batches of the samples in the tabDF |
data frame with ComBat batch correction applied
Prepare CEL files into an AffyBatch.
TCGAprepare_Affy(ClinData, PathFolder, TabCel)
TCGAprepare_Affy(ClinData, PathFolder, TabCel)
ClinData |
write |
PathFolder |
write |
TabCel |
write |
Normalized Expression data from Affy eSets
## Not run: to add example ## End(Not run)
## Not run: to add example ## End(Not run)
TCGAquery_MatchedCoupledSampleTypes
TCGAquery_MatchedCoupledSampleTypes(barcode, typesample)
TCGAquery_MatchedCoupledSampleTypes(barcode, typesample)
barcode |
barcode |
typesample |
typesample |
a list of samples / barcode filtered by type sample selected
TCGAquery_MatchedCoupledSampleTypes(c("TCGA-B0-4698-01Z-00-DX1", "TCGA-B0-4698-02Z-00-DX1"), c("TP","TR")) barcode <- c("TARGET-20-PANSBH-02A-02D","TARGET-20-PANSBH-01A-02D", "TCGA-B0-4698-01Z-00-DX1","TCGA-CZ-4863-02Z-00-DX1", "TARGET-20-PANSZZ-02A-02D","TARGET-20-PANSZZ-11A-02D", "TCGA-B0-4699-01Z-00-DX1","TCGA-B0-4699-02Z-00-DX1" ) TCGAquery_MatchedCoupledSampleTypes(barcode,c("TR","TP"))
TCGAquery_MatchedCoupledSampleTypes(c("TCGA-B0-4698-01Z-00-DX1", "TCGA-B0-4698-02Z-00-DX1"), c("TP","TR")) barcode <- c("TARGET-20-PANSBH-02A-02D","TARGET-20-PANSBH-01A-02D", "TCGA-B0-4698-01Z-00-DX1","TCGA-CZ-4863-02Z-00-DX1", "TARGET-20-PANSZZ-02A-02D","TARGET-20-PANSZZ-11A-02D", "TCGA-B0-4699-01Z-00-DX1","TCGA-B0-4699-02Z-00-DX1" ) TCGAquery_MatchedCoupledSampleTypes(barcode,c("TR","TP"))
TCGArecount2_query queries and downloads data produced by the Recount2 project. User can specify which project and which tissue to query
TCGAquery_recount2(project, tissue = c())
TCGAquery_recount2(project, tissue = c())
project |
is a string denoting which project the user wants. Options are "tcga" and "gtex" |
tissue |
a vector of tissue(s) to download. Options are "adipose tissue", "adrenal gland", "bladder","blood", "blood vessel", "bone marrow", "brain", "breast","cervix uteri", "colon", "esophagus", "fallopian tube","heart", "kidney", "liver", "lung", "muscle", "nerve", "ovary","pancreas", "pituitary", "prostate", "salivary gland", "skin", "small intestine", "spleen", "stomach", "testis", "thyroid", "uterus", "vagina" |
List with $subtypes attribute as a dataframe with barcodes, samples, subtypes, and colors. The $filtered attribute is returned as filtered samples with no subtype info
## Not run: brain.rec<-TCGAquery_recount2(project = "gtex", tissue = "brain") ## End(Not run)
## Not run: brain.rec<-TCGAquery_recount2(project = "gtex", tissue = "brain") ## End(Not run)
TCGAquery_SampleTypes for a given list of samples and types, return the union of samples that are from theses type.
TCGAquery_SampleTypes(barcode, typesample)
TCGAquery_SampleTypes(barcode, typesample)
barcode |
is a list of samples as TCGA barcodes |
|||||||||||||||||||||||||||||
typesample |
a character vector indicating tissue type to query. Example:
|
a list of samples / barcode filtered by type sample selected
# selection of normal samples "NT" barcode <- c("TCGA-B0-4698-01Z-00-DX1","TCGA-CZ-4863-02Z-00-DX1") # Returns the second barcode TCGAquery_SampleTypes(barcode,"TR") # Returns both barcode TCGAquery_SampleTypes(barcode,c("TR","TP")) barcode <- c("TARGET-20-PANSBH-14A-02D","TARGET-20-PANSBH-01A-02D", "TCGA-B0-4698-01Z-00-DX1","TCGA-CZ-4863-02Z-00-DX1") TCGAquery_SampleTypes(barcode,c("TR","TP"))
# selection of normal samples "NT" barcode <- c("TCGA-B0-4698-01Z-00-DX1","TCGA-CZ-4863-02Z-00-DX1") # Returns the second barcode TCGAquery_SampleTypes(barcode,"TR") # Returns both barcode TCGAquery_SampleTypes(barcode,c("TR","TP")) barcode <- c("TARGET-20-PANSBH-14A-02D","TARGET-20-PANSBH-01A-02D", "TCGA-B0-4698-01Z-00-DX1","TCGA-CZ-4863-02Z-00-DX1") TCGAquery_SampleTypes(barcode,c("TR","TP"))
TCGAquery_subtype Retrieve molecular subtypes for a given tumor
TCGAquery_subtype(tumor)
TCGAquery_subtype(tumor)
tumor |
is a cancer Examples:
|
a data.frame with barcode and molecular subtypes
dataSubt <- TCGAquery_subtype(tumor = "lgg")
dataSubt <- TCGAquery_subtype(tumor = "lgg")
TCGAtumor_purity Filters TCGA samples using 5 estimates from 5 methods as thresholds.
TCGAtumor_purity(barcodes, estimate, absolute, lump, ihc, cpe)
TCGAtumor_purity(barcodes, estimate, absolute, lump, ihc, cpe)
barcodes |
is a vector of TCGA barcodes |
estimate |
uses gene expression profiles of 141 immune genes and 141 stromal genes |
absolute |
which uses somatic copy-number data (estimations were available for only 11 cancer types) |
lump |
(leukocytes unmethylation for purity), which averages 44 non-methylated immune-specific CpG sites |
ihc |
as estimated by image analysis of haematoxylin and eosin stain slides produced by the Nationwide Childrens Hospital Biospecimen Core Resource |
cpe |
CPE is a derived consensus measurement as the median purity level after normalizing levels from all methods to give them equal means and s.ds |
List with $pure_barcodes attribute as a vector of pure samples and $filtered attribute as filtered samples with no purity info
dataTableSubt <- TCGAtumor_purity("TCGA-60-2721-01A-01R-0851-07", estimate = 0.6, absolute = 0.6, ihc = 0.8, lump = 0.8, cpe = 0.7)
dataTableSubt <- TCGAtumor_purity("TCGA-60-2721-01A-01R-0851-07", estimate = 0.6, absolute = 0.6, ihc = 0.8, lump = 0.8, cpe = 0.7)
Barplot of subtypes and clinical info in groups of gene expression clustered.
TCGAvisualize_BarPlot( DFfilt, DFclin, DFsubt, data_Hc2, Subtype, cbPalette, filename, width, height, dpi )
TCGAvisualize_BarPlot( DFfilt, DFclin, DFsubt, data_Hc2, Subtype, cbPalette, filename, width, height, dpi )
DFfilt |
write |
DFclin |
write |
DFsubt |
write |
data_Hc2 |
write |
Subtype |
write |
cbPalette |
Define the colors of the bar. |
filename |
The name of the pdf file |
width |
Image width |
height |
Image height |
dpi |
Image dpi |
barplot image in pdf or png file
The figure shows canonical pathways significantly overrepresented (enriched) by the DEGs (differentially expressed genes). The most statistically significant canonical pathways identified in DEGs list are listed according to their p value corrected FDR (-Log) (colored bars) and the ratio of list genes found in each pathway over the total number of genes in that pathway (Ratio, red line).
TCGAvisualize_EAbarplot( tf, GOMFTab, GOBPTab, GOCCTab, PathTab, nBar, nRGTab, filename = "TCGAvisualize_EAbarplot_Output.pdf", text.size = 1, mfrow = c(2, 2), xlim = NULL, fig.width = 30, fig.height = 15, color = c("orange", "cyan", "green", "yellow") )
TCGAvisualize_EAbarplot( tf, GOMFTab, GOBPTab, GOCCTab, PathTab, nBar, nRGTab, filename = "TCGAvisualize_EAbarplot_Output.pdf", text.size = 1, mfrow = c(2, 2), xlim = NULL, fig.width = 30, fig.height = 15, color = c("orange", "cyan", "green", "yellow") )
tf |
is a list of gene symbols |
GOMFTab |
is results from TCGAanalyze_EAcomplete related to Molecular Function (MF) |
GOBPTab |
is results from TCGAanalyze_EAcomplete related to Biological Process (BP) |
GOCCTab |
is results from TCGAanalyze_EAcomplete related to Cellular Component (CC) |
PathTab |
is results from TCGAanalyze_EAcomplete related to Pathways EA |
nBar |
is the number of bar histogram selected to show (default = 10) |
nRGTab |
is the gene signature list with gene symbols. |
filename |
Name for the pdf. If null it will return the plot. |
text.size |
Text size |
mfrow |
Vector with number of rows/columns of the plot. Default 2 rows/2 columns "c(2,2)" |
xlim |
Upper limit of the x-axis. |
fig.width |
Default 30 |
fig.height |
Default 15 |
color |
A vector of colors for each barplot. Deafult: c("orange", "cyan","green","yellow") |
Complete barPlot from Enrichment Analysis showing significant (default FDR < 0.01) BP,CC,MF and pathways enriched by list of genes.
Genelist <- c("FN1","COL1A1") ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist) TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC, GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat, nRGTab = Genelist, nBar = 10, filename="a.pdf") ## Not run: Genelist <- rownames(dataDEGsFiltLevel) system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist)) # Enrichment Analysis EA (TCGAVisualize) # Gene Ontology (GO) and Pathway enrichment barPlot TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC, GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat, nRGTab = Genelist, nBar = 10) ## End(Not run)
Genelist <- c("FN1","COL1A1") ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist) TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC, GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat, nRGTab = Genelist, nBar = 10, filename="a.pdf") ## Not run: Genelist <- rownames(dataDEGsFiltLevel) system.time(ansEA <- TCGAanalyze_EAcomplete(TFname="DEA genes Normal Vs Tumor",Genelist)) # Enrichment Analysis EA (TCGAVisualize) # Gene Ontology (GO) and Pathway enrichment barPlot TCGAvisualize_EAbarplot(tf = rownames(ansEA$ResBP), GOBPTab = ansEA$ResBP, GOCCTab = ansEA$ResCC, GOMFTab = ansEA$ResMF, PathTab = ansEA$ResPat, nRGTab = Genelist, nBar = 10) ## End(Not run)
Heatmap with more sensible behavior using heatmap.plus
TCGAvisualize_Heatmap( data, col.metadata, row.metadata, col.colors = NULL, row.colors = NULL, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE, cluster_columns = FALSE, sortCol, extremes = NULL, rownames.size = 12, title = NULL, color.levels = NULL, values.label = NULL, filename = "heatmap.pdf", width = 10, height = 10, type = "expression", scale = "none", heatmap.legend.color.bar = "continuous" )
TCGAvisualize_Heatmap( data, col.metadata, row.metadata, col.colors = NULL, row.colors = NULL, show_column_names = FALSE, show_row_names = FALSE, cluster_rows = FALSE, cluster_columns = FALSE, sortCol, extremes = NULL, rownames.size = 12, title = NULL, color.levels = NULL, values.label = NULL, filename = "heatmap.pdf", width = 10, height = 10, type = "expression", scale = "none", heatmap.legend.color.bar = "continuous" )
data |
The object to with the heatmap data (expression, methylation) |
col.metadata |
Metadata for the columns (samples). It should have on of the following columns: barcode (28 characters) column to match with the samples. It will also work with "bcr_patient_barcode"(12 chars),"patient"(12 chars),"sample"(16 chars) columns but as one patient might have more than one sample, this coul lead to errors in the annotation. The code will throw a warning in case two samples are from the same patient. |
row.metadata |
Metadata for the rows genes (expression) or probes (methylation) |
col.colors |
A list of names colors |
row.colors |
A list of named colors |
show_column_names |
Show column names names? Default: FALSE |
show_row_names |
Show row names? Default: FALSE |
cluster_rows |
Cluster rows ? Default: FALSE |
cluster_columns |
Cluster columns ? Default: FALSE |
sortCol |
Name of the column to be used to sort the columns |
extremes |
Extremes of colors (vector of 3 values) |
rownames.size |
Rownames size |
title |
Title of the plot |
color.levels |
A vector with the colors (low level, middle level, high level) |
values.label |
Text of the levels in the heatmap |
filename |
Filename to save the heatmap. Default: heatmap.png |
width |
figure width |
height |
figure height |
type |
Select the colors of the heatmap values. Possible values are "expression" (default), "methylation" |
scale |
Use z-score to make the heatmap? If we want to show differences between genes, it is good to make Z-score by samples (force each sample to have zero mean and standard deviation=1). If we want to show differences between samples, it is good to make Z-score by genes (force each gene to have zero mean and standard deviation=1). Possibilities: "row", "col". Default "none" |
heatmap.legend.color.bar |
Heatmap legends values type. Options: "continuous", "discrete" |
Heatmap plotted in the device
row.mdat <- matrix(c("FALSE","FALSE", "TRUE","TRUE", "FALSE","FALSE", "TRUE","FALSE", "FALSE","TRUE" ), nrow = 5, ncol = 2, byrow = TRUE, dimnames = list( c("probe1", "probe2","probe3","probe4","probe5"), c("duplicated", "Enhancer region"))) dat <- matrix(c(0.3,0.2,0.3,1,1,0.1,1,1,0, 0.8,1,0.7,0.7,0.3,1), nrow = 5, ncol = 3, byrow = TRUE, dimnames = list( c("probe1", "probe2","probe3","probe4","probe5"), c("TCGA-DU-6410", "TCGA-DU-A5TS", "TCGA-HT-7688"))) mdat <- data.frame(patient=c("TCGA-DU-6410","TCGA-DU-A5TS","TCGA-HT-7688"), Sex=c("Male","Female","Male"), COCCluster=c("coc1","coc1","coc1"), IDHtype=c("IDHwt","IDHMut-cod","IDHMut-noncod")) TCGAvisualize_Heatmap(dat, col.metadata = mdat, row.metadata = row.mdat, row.colors = list(duplicated = c("FALSE" = "pink", "TRUE"="green"), "Enhancer region" = c("FALSE" = "purple", "TRUE"="grey")), col.colors = list(Sex = c("Male" = "blue", "Female"="red"), COCCluster=c("coc1"="grey"), IDHtype=c("IDHwt"="cyan", "IDHMut-cod"="tomato" ,"IDHMut-noncod"="gold")), type = "methylation", show_row_names=TRUE)
row.mdat <- matrix(c("FALSE","FALSE", "TRUE","TRUE", "FALSE","FALSE", "TRUE","FALSE", "FALSE","TRUE" ), nrow = 5, ncol = 2, byrow = TRUE, dimnames = list( c("probe1", "probe2","probe3","probe4","probe5"), c("duplicated", "Enhancer region"))) dat <- matrix(c(0.3,0.2,0.3,1,1,0.1,1,1,0, 0.8,1,0.7,0.7,0.3,1), nrow = 5, ncol = 3, byrow = TRUE, dimnames = list( c("probe1", "probe2","probe3","probe4","probe5"), c("TCGA-DU-6410", "TCGA-DU-A5TS", "TCGA-HT-7688"))) mdat <- data.frame(patient=c("TCGA-DU-6410","TCGA-DU-A5TS","TCGA-HT-7688"), Sex=c("Male","Female","Male"), COCCluster=c("coc1","coc1","coc1"), IDHtype=c("IDHwt","IDHMut-cod","IDHMut-noncod")) TCGAvisualize_Heatmap(dat, col.metadata = mdat, row.metadata = row.mdat, row.colors = list(duplicated = c("FALSE" = "pink", "TRUE"="green"), "Enhancer region" = c("FALSE" = "purple", "TRUE"="grey")), col.colors = list(Sex = c("Male" = "blue", "Female"="red"), COCCluster=c("coc1"="grey"), IDHtype=c("IDHwt"="cyan", "IDHMut-cod"="tomato" ,"IDHMut-noncod"="gold")), type = "methylation", show_row_names=TRUE)
Creates a mean methylation boxplot for groups (groupCol), subgroups will be highlighted as shapes if the subgroupCol was set.
Observation: Data is a summarizedExperiment.
TCGAvisualize_meanMethylation( data, groupCol = NULL, subgroupCol = NULL, shapes = NULL, print.pvalue = FALSE, plot.jitter = TRUE, jitter.size = 3, filename = "groupMeanMet.pdf", ylab = expression(paste("Mean DNA methylation (", beta, "-values)")), xlab = NULL, title = "Mean DNA methylation", labels = NULL, group.legend = NULL, subgroup.legend = NULL, color = NULL, y.limits = NULL, sort, order, legend.position = "top", legend.title.position = "top", legend.ncols = 3, add.axis.x.text = TRUE, width = 10, height = 10, dpi = 600, axis.text.x.angle = 90 )
TCGAvisualize_meanMethylation( data, groupCol = NULL, subgroupCol = NULL, shapes = NULL, print.pvalue = FALSE, plot.jitter = TRUE, jitter.size = 3, filename = "groupMeanMet.pdf", ylab = expression(paste("Mean DNA methylation (", beta, "-values)")), xlab = NULL, title = "Mean DNA methylation", labels = NULL, group.legend = NULL, subgroup.legend = NULL, color = NULL, y.limits = NULL, sort, order, legend.position = "top", legend.title.position = "top", legend.ncols = 3, add.axis.x.text = TRUE, width = 10, height = 10, dpi = 600, axis.text.x.angle = 90 )
data |
SummarizedExperiment object obtained from TCGAPrepare |
groupCol |
Columns in colData(data) that defines the groups. If no columns defined a columns called "Patients" will be used |
subgroupCol |
Columns in colData(data) that defines the subgroups. |
shapes |
Shape vector of the subgroups. It must have the size of the levels of the subgroups. Example: shapes = c(21,23) if for two levels |
print.pvalue |
Print p-value for two groups |
plot.jitter |
Plot jitter? Default TRUE |
jitter.size |
Plot jitter size? Default 3 |
filename |
The name of the pdf that will be saved |
ylab |
y axis text in the plot |
xlab |
x axis text in the plot |
title |
main title in the plot |
labels |
Labels of the groups |
group.legend |
Name of the group legend. DEFAULT: groupCol |
subgroup.legend |
Name of the subgroup legend. DEFAULT: subgroupCol |
color |
vector of colors to be used in graph |
y.limits |
Change lower/upper y-axis limit |
sort |
Sort boxplot by mean or median. Possible values: mean.asc, mean.desc, median.asc, median.desc |
order |
Order of the boxplots |
legend.position |
Legend position ("top", "right","left","bottom") |
legend.title.position |
Legend title position ("top", "right","left","bottom") |
legend.ncols |
Number of columns of the legend |
add.axis.x.text |
Add text to x-axis? Default: FALSE |
width |
Plot width default:10 |
height |
Plot height default:10 |
dpi |
Pdf dpi default:600 |
axis.text.x.angle |
Angle of text in the x axis |
Save the pdf survival plot
nrows <- 200; ncols <- 21 counts <- matrix(runif(nrows * ncols, 0, 1), nrows) rowRanges <- GenomicRanges::GRanges(rep(c("chr1", "chr2"), c(50, 150)), IRanges::IRanges(floor(runif(200, 1e5, 1e6)), width=100), strand=sample(c("+", "-"), 200, TRUE), feature_id=sprintf("ID%03d", 1:200)) colData <- S4Vectors::DataFrame(Treatment=rep(c("ChIP", "Input","Other"), 7), row.names=LETTERS[1:21], group=rep(c("group1","group2","group3"),c(7,7,7)), subgroup=rep(c("subgroup1","subgroup2","subgroup3"),7)) data <- SummarizedExperiment::SummarizedExperiment( assays=S4Vectors::SimpleList(counts=counts), rowRanges=rowRanges, colData=colData) TCGAvisualize_meanMethylation(data,groupCol = "group") # change lower/upper y-axis limit TCGAvisualize_meanMethylation(data,groupCol = "group", y.limits = c(0,1)) # change lower y-axis limit TCGAvisualize_meanMethylation(data,groupCol = "group", y.limits = 0) TCGAvisualize_meanMethylation(data,groupCol = "group", subgroupCol="subgroup") TCGAvisualize_meanMethylation(data,groupCol = "group") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="mean.desc",filename="meandesc.pdf") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="mean.asc",filename="meanasc.pdf") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="median.asc",filename="medianasc.pdf") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="median.desc",filename="mediandesc.pdf")
nrows <- 200; ncols <- 21 counts <- matrix(runif(nrows * ncols, 0, 1), nrows) rowRanges <- GenomicRanges::GRanges(rep(c("chr1", "chr2"), c(50, 150)), IRanges::IRanges(floor(runif(200, 1e5, 1e6)), width=100), strand=sample(c("+", "-"), 200, TRUE), feature_id=sprintf("ID%03d", 1:200)) colData <- S4Vectors::DataFrame(Treatment=rep(c("ChIP", "Input","Other"), 7), row.names=LETTERS[1:21], group=rep(c("group1","group2","group3"),c(7,7,7)), subgroup=rep(c("subgroup1","subgroup2","subgroup3"),7)) data <- SummarizedExperiment::SummarizedExperiment( assays=S4Vectors::SimpleList(counts=counts), rowRanges=rowRanges, colData=colData) TCGAvisualize_meanMethylation(data,groupCol = "group") # change lower/upper y-axis limit TCGAvisualize_meanMethylation(data,groupCol = "group", y.limits = c(0,1)) # change lower y-axis limit TCGAvisualize_meanMethylation(data,groupCol = "group", y.limits = 0) TCGAvisualize_meanMethylation(data,groupCol = "group", subgroupCol="subgroup") TCGAvisualize_meanMethylation(data,groupCol = "group") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="mean.desc",filename="meandesc.pdf") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="mean.asc",filename="meanasc.pdf") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="median.asc",filename="medianasc.pdf") TCGAvisualize_meanMethylation(data,groupCol = "group",sort="median.desc",filename="mediandesc.pdf")
Creating a oncoprint
TCGAvisualize_oncoprint( mut, genes, filename, color, annotation.position = "bottom", annotation, height, width = 10, rm.empty.columns = FALSE, show.column.names = FALSE, show.row.barplot = TRUE, label.title = "Mutation", column.names.size = 8, label.font.size = 16, rows.font.size = 16, dist.col = 0.5, dist.row = 0.5, information = "Variant_Type", row.order = TRUE, col.order = TRUE, heatmap.legend.side = "bottom", annotation.legend.side = "bottom" )
TCGAvisualize_oncoprint( mut, genes, filename, color, annotation.position = "bottom", annotation, height, width = 10, rm.empty.columns = FALSE, show.column.names = FALSE, show.row.barplot = TRUE, label.title = "Mutation", column.names.size = 8, label.font.size = 16, rows.font.size = 16, dist.col = 0.5, dist.row = 0.5, information = "Variant_Type", row.order = TRUE, col.order = TRUE, heatmap.legend.side = "bottom", annotation.legend.side = "bottom" )
mut |
A dataframe from the mutation annotation file (see TCGAquery_maf from TCGAbiolinks) |
genes |
Gene list |
filename |
name of the pdf |
color |
named vector for the plot |
annotation.position |
Position of the annotation "bottom" or "top" |
annotation |
Matrix or data frame with the annotation. Should have a column bcr_patient_barcode with the same ID of the mutation object |
height |
pdf height |
width |
pdf width |
rm.empty.columns |
If there is no alteration in that sample, whether remove it on the oncoprint |
show.column.names |
Show column names? Default: FALSE |
show.row.barplot |
Show barplot annotation on rows? |
label.title |
Title of the label |
column.names.size |
Size of the fonts of the columns names |
label.font.size |
Size of the fonts |
rows.font.size |
Size of the fonts |
dist.col |
distance between columns in the plot |
dist.row |
distance between rows in the plot |
information |
Which column to use as information from MAF. Options: 1) "Variant_Classification" (The information will be "Frame_Shift_Del", "Frame_Shift_Ins", "In_Frame_Del", "In_Frame_Ins", "Missense_Mutation", "Nonsense_Mutation", "Nonstop_Mutation", "RNA", "Silent" , "Splice_Site", "Targeted_Region", "Translation_Start_Site") 2) "Variant_Type" (The information will be INS,DEL,SNP) |
row.order |
Order the genes (rows) Default:TRUE. Genes with more mutations will be in the first rows |
col.order |
Order columns. Default:TRUE. |
heatmap.legend.side |
Position of the heatmap legend |
annotation.legend.side |
Position of the annotation legend |
A oncoprint plot
## Not run: library(dplyr) query <- GDCquery( project = "TCGA-CHOL", data.category = "Simple Nucleotide Variation", access = "open", legacy = FALSE, data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking" ) GDCdownload(query) mut <- GDCprepare(query) TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:10], rm.empty.columns = TRUE) TCGAvisualize_oncoprint( mut = mut, genes = mut$Hugo_Symbol[1:10], filename = "onco.pdf", color = c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown") ) clin <- GDCquery_clinic("TCGA-ACC","clinical") clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")] TCGAvisualize_oncoprint( mut = mut, genes = mut$Hugo_Symbol[1:20], filename = "onco.pdf", annotation = clin, color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"), rows.font.size=10, heatmap.legend.side = "right", dist.col = 0, label.font.size = 10 ) ## End(Not run)
## Not run: library(dplyr) query <- GDCquery( project = "TCGA-CHOL", data.category = "Simple Nucleotide Variation", access = "open", legacy = FALSE, data.type = "Masked Somatic Mutation", workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking" ) GDCdownload(query) mut <- GDCprepare(query) TCGAvisualize_oncoprint(mut = mut, genes = mut$Hugo_Symbol[1:10], rm.empty.columns = TRUE) TCGAvisualize_oncoprint( mut = mut, genes = mut$Hugo_Symbol[1:10], filename = "onco.pdf", color = c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown") ) clin <- GDCquery_clinic("TCGA-ACC","clinical") clin <- clin[,c("bcr_patient_barcode","disease","gender","tumor_stage","race","vital_status")] TCGAvisualize_oncoprint( mut = mut, genes = mut$Hugo_Symbol[1:20], filename = "onco.pdf", annotation = clin, color=c("background"="#CCCCCC","DEL"="purple","INS"="yellow","SNP"="brown"), rows.font.size=10, heatmap.legend.side = "right", dist.col = 0, label.font.size = 10 ) ## End(Not run)
TCGAvisualize_PCA performs a principal components analysis (PCA) on the given data matrix and returns the results as an object of class prcomp, and shows results in PCA level.
TCGAvisualize_PCA(dataFilt, dataDEGsFiltLevel, ntopgenes, group1, group2)
TCGAvisualize_PCA(dataFilt, dataDEGsFiltLevel, ntopgenes, group1, group2)
dataFilt |
A filtered dataframe or numeric matrix where each row represents a gene, each column represents a sample from function TCGAanalyze_Filtering |
dataDEGsFiltLevel |
table with DEGs, log Fold Change (FC), false discovery rate (FDR), the gene expression level, etc, from function TCGAanalyze_LevelTab. |
ntopgenes |
number of DEGs genes to plot in PCA |
group1 |
a string containing the barcode list of the samples in in control group |
group2 |
a string containing the barcode list of the samples in in disease group the name of the group |
principal components analysis (PCA) plot of PC1 and PC2
# normalization of genes dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo = geneInfo, method = "geneLength") # quantile filter of genes dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut = 0.25) # Principal Component Analysis plot for ntop selected DEGs # selection of normal samples "NT" group1 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT")) # selection of normal samples "TP" group2 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP")) pca <- TCGAvisualize_PCA(dataFilt,dataDEGsFiltLevel, ntopgenes = 200, group1, group2)
# normalization of genes dataNorm <- TCGAbiolinks::TCGAanalyze_Normalization(tabDF = dataBRCA, geneInfo = geneInfo, method = "geneLength") # quantile filter of genes dataFilt <- TCGAanalyze_Filtering(tabDF = dataBRCA, method = "quantile", qnt.cut = 0.25) # Principal Component Analysis plot for ntop selected DEGs # selection of normal samples "NT" group1 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("NT")) # selection of normal samples "TP" group2 <- TCGAquery_SampleTypes(colnames(dataFilt), typesample = c("TP")) pca <- TCGAvisualize_PCA(dataFilt,dataDEGsFiltLevel, ntopgenes = 200, group1, group2)
Create Starburst plot for comparison of DNA methylation and gene expression. The log10 (FDR-corrected P value) is plotted for beta value for DNA methylation (x axis) and gene expression (y axis) for each gene.
The black dashed line shows the FDR-adjusted P value of 0.01.
You can set names to TRUE to get the names of the significant genes.
Candidate biologically significant genes will be circled in the plot.
Candidate biologically significant are the genes that respect the expression (logFC.cut), DNA methylation (diffmean.cut) and significance thresholds (exp.p.cut, met.p.cut)
TCGAvisualize_starburst( met, exp, group1 = NULL, group2 = NULL, exp.p.cut = 0.01, met.p.cut = 0.01, diffmean.cut = 0, logFC.cut = 0, met.platform = c("Illumina Human Methylation 450", "Illumina Human Methylation 27", "Illumina Methylation Epic"), genome, names = FALSE, names.fill = TRUE, filename = "starburst.png", return.plot = FALSE, ylab = expression(atop("Gene Expression", paste(-Log[10], " (FDR corrected P values)"))), xlab = expression(atop("DNA Methylation", paste(-Log[10], " (FDR corrected P values)"))), title = "Starburst Plot", legend = "DNA Methylation/Expression Relation", color = NULL, label = c("Not Significant", "Up regulated & Hypo methylated", "Down regulated & Hypo methylated", "hypo methylated", "hyper methylated", "Up regulated", "Down regulated", "Up regulated & Hyper methylated", "Down regulated & Hyper methylated"), xlim = NULL, ylim = NULL, height = 10, width = 20, dpi = 600 )
TCGAvisualize_starburst( met, exp, group1 = NULL, group2 = NULL, exp.p.cut = 0.01, met.p.cut = 0.01, diffmean.cut = 0, logFC.cut = 0, met.platform = c("Illumina Human Methylation 450", "Illumina Human Methylation 27", "Illumina Methylation Epic"), genome, names = FALSE, names.fill = TRUE, filename = "starburst.png", return.plot = FALSE, ylab = expression(atop("Gene Expression", paste(-Log[10], " (FDR corrected P values)"))), xlab = expression(atop("DNA Methylation", paste(-Log[10], " (FDR corrected P values)"))), title = "Starburst Plot", legend = "DNA Methylation/Expression Relation", color = NULL, label = c("Not Significant", "Up regulated & Hypo methylated", "Down regulated & Hypo methylated", "hypo methylated", "hyper methylated", "Up regulated", "Down regulated", "Up regulated & Hyper methylated", "Down regulated & Hyper methylated"), xlim = NULL, ylim = NULL, height = 10, width = 20, dpi = 600 )
met |
A SummarizedExperiment with methylation data obtained from the TCGAPrepare or Data frame from DMR_results file. Expected colData columns: diffmean, p.value.adj and p.value Execute volcanoPlot function in order to obtain these values for the object. |
exp |
Object obtained by DEArnaSEQ function |
group1 |
The name of the group 1 Obs: Column p.value.adj.group1.group2 should exist |
group2 |
The name of the group 2. Obs: Column p.value.adj.group1.group2 should exist |
exp.p.cut |
expression p value cut-off |
met.p.cut |
methylation p value cut-off |
diffmean.cut |
If set, the probes with diffmean higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted. |
logFC.cut |
If set, the probes with expression fold change higher than methylation cut-off will be highlighted in the plot. And the data frame return will be subseted. |
met.platform |
DNA methylation platform "Illumina Human Methylation 450", "Illumina Human Methylation 27", "Illumina Methylation Epic" |
genome |
Genome of reference ("hg38" or "hg19") used to identify nearest probes TSS |
names |
Add the names of the significant genes? Default: FALSE |
names.fill |
Names should be filled in a color box? Default: TRUE |
filename |
The filename of the file (it can be pdf, svg, png, etc) |
return.plot |
If true only plot object will be returned (pdf will not be created) |
ylab |
y axis text |
xlab |
x axis text |
title |
main title |
legend |
legend title |
color |
vector of colors to be used in graph |
label |
vector of labels to be used in graph |
xlim |
x limits to cut image |
ylim |
y limits to cut image |
height |
Figure height |
width |
Figure width |
dpi |
Figure dpi |
Input: data with gene expression/methylation expression Output: starburst plot
Save a starburst plot
## Not run: library(SummarizedExperiment) met <- TCGAbiolinks:::getMetPlatInfo( genome = "hg38", platform = "Illumina Human Methylation 27" ) values(met) <- NULL met$probeID <- names(met) nrows <- length(met); ncols <- 20 counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows) colData <- S4Vectors::DataFrame( Treatment = rep(c("ChIP", "Input"), 5), row.names = LETTERS[1:20], group = rep(c("group1","group2"),c(10,10)) ) met <- SummarizedExperiment::SummarizedExperiment( assays = S4Vectors::SimpleList(counts=counts), rowRanges = met, colData = colData ) rowRanges(met)$diffmean.g1.g2 <- c(runif(nrows, -0.1, 0.1)) rowRanges(met)$diffmean.g2.g1 <- -1*(rowRanges(met)$diffmean.g1.g2) rowRanges(met)$p.value.g1.g2 <- c(runif(nrows, 0, 1)) rowRanges(met)$p.value.adj.g1.g2 <- c(runif(nrows, 0, 1)) exp <- TCGAbiolinks:::get.GRCh.bioMart("hg38") exp$logFC <- runif(nrow(exp), -5, 5) exp$FDR <- runif(nrow(exp), 0.01, 1) result <- TCGAvisualize_starburst( met, exp, exp.p.cut = 0.05, met.p.cut = 0.05, logFC.cut = 2, group1 = "g1", group2 = "g2", genome = "hg38", met.platform = "27k", diffmean.cut = 0.0, names = TRUE ) ## End(Not run)
## Not run: library(SummarizedExperiment) met <- TCGAbiolinks:::getMetPlatInfo( genome = "hg38", platform = "Illumina Human Methylation 27" ) values(met) <- NULL met$probeID <- names(met) nrows <- length(met); ncols <- 20 counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows) colData <- S4Vectors::DataFrame( Treatment = rep(c("ChIP", "Input"), 5), row.names = LETTERS[1:20], group = rep(c("group1","group2"),c(10,10)) ) met <- SummarizedExperiment::SummarizedExperiment( assays = S4Vectors::SimpleList(counts=counts), rowRanges = met, colData = colData ) rowRanges(met)$diffmean.g1.g2 <- c(runif(nrows, -0.1, 0.1)) rowRanges(met)$diffmean.g2.g1 <- -1*(rowRanges(met)$diffmean.g1.g2) rowRanges(met)$p.value.g1.g2 <- c(runif(nrows, 0, 1)) rowRanges(met)$p.value.adj.g1.g2 <- c(runif(nrows, 0, 1)) exp <- TCGAbiolinks:::get.GRCh.bioMart("hg38") exp$logFC <- runif(nrow(exp), -5, 5) exp$FDR <- runif(nrow(exp), 0.01, 1) result <- TCGAvisualize_starburst( met, exp, exp.p.cut = 0.05, met.p.cut = 0.05, logFC.cut = 2, group1 = "g1", group2 = "g2", genome = "hg38", met.platform = "27k", diffmean.cut = 0.0, names = TRUE ) ## End(Not run)
TCGAvisualize_SurvivalCoxNET can help an user to identify a group of survival genes that are significant from univariate Kaplan Meier Analysis and also for Cox Regression. It shows in the end a network build with community of genes with similar range of pvalues from Cox regression (same color) and that interaction among those genes is already validated in literatures using the STRING database (version 9.1). TCGAvisualize_SurvivalCoxNET perform survival analysis with univariate Cox regression and package (dnet) using following functions wrapping from these packages:
survival::coxph
igraph::subgraph.edges
igraph::layout.fruchterman.reingold
igraph::spinglass.community
igraph::communities
dnet::dRDataLoader
dnet::dNetInduce
dnet::dNetPipeline
dnet::visNet
dnet::dCommSignif
TCGAvisualize_SurvivalCoxNET( clinical_patient, dataGE, Genelist, org.Hs.string, scoreConfidence = 700, titlePlot = "TCGAvisualize_SurvivalCoxNET Example" )
TCGAvisualize_SurvivalCoxNET( clinical_patient, dataGE, Genelist, org.Hs.string, scoreConfidence = 700, titlePlot = "TCGAvisualize_SurvivalCoxNET Example" )
clinical_patient |
is a data.frame using function 'clinic' with information related to barcode / samples such as bcr_patient_barcode, days_to_death , days_to_last_followup , vital_status, etc |
dataGE |
is a matrix of Gene expression (genes in rows, samples in cols) from TCGAprepare |
Genelist |
is a list of gene symbols where perform survival KM. |
org.Hs.string |
an igraph object that contains a functional protein association network in human. The network is extracted from the STRING database (version 10). |
scoreConfidence |
restrict to those edges with high confidence (eg. score>=700) |
titlePlot |
is the title to show in the final plot. |
TCGAvisualize_SurvivalCoxNET allow user to perform the complete workflow using coxph and dnet package related to survival analysis with an identification of gene-active networks from high-throughput omics data using gene expression and clinical data.
Cox regression survival analysis to obtain hazard ratio (HR) and p-values
fit a Cox proportional hazards model and ANOVA (Chisq test)
Network comunites
An igraph object that contains a functional protein association network in human. The network is extracted from the STRING database (version 9.1). Only those associations with medium confidence (score>=400) are retained.
restrict to those edges with high confidence (score>=700)
extract network that only contains genes in pvals
Identification of gene-active network
visualisation of the gene-active network itself
the layout of the network visualisation (fixed in different visuals)
color nodes according to communities (identified via a spin-glass model and simulated annealing)
node sizes according to degrees
highlight different communities
visualize the subnetwork
net IGRAPH with related Cox survival genes in community (same pval and color) and with interactions from STRING database.
Creates a volcano plot from the gene expression and DNA methylation analysis.
TCGAVisualize_volcano( x, y, filename = "volcano.pdf", ylab = expression(paste(-Log[10], " (FDR corrected P-values)")), xlab = NULL, title = "Volcano plot", legend = NULL, label = NULL, xlim = NULL, ylim = NULL, color = c("black", "red", "green"), names = NULL, names.fill = TRUE, show.names = "significant", x.cut = 0, y.cut = 0.01, height = 5, width = 10, highlight = NULL, highlight.color = "orange", names.size = 4, dpi = 300 )
TCGAVisualize_volcano( x, y, filename = "volcano.pdf", ylab = expression(paste(-Log[10], " (FDR corrected P-values)")), xlab = NULL, title = "Volcano plot", legend = NULL, label = NULL, xlim = NULL, ylim = NULL, color = c("black", "red", "green"), names = NULL, names.fill = TRUE, show.names = "significant", x.cut = 0, y.cut = 0.01, height = 5, width = 10, highlight = NULL, highlight.color = "orange", names.size = 4, dpi = 300 )
x |
x-axis data (i.e. Diff mean beta-values or Log2FC). |
y |
FDR adjusted p-value (q-value). This data will be transformed to -log10 values. |
filename |
File name: volcano.pdf, volcano.svg, volcano.png. If NULL returns the ggplot object. |
ylab |
y axis text. Default: -Log10 FDR corrected P-values |
xlab |
x axis text. Default: No text. Examples of input: expression(paste(Log[2], "FoldChange")) |
title |
main title. If not specified it will be "Volcano plot (group1 vs group2) |
legend |
Legend title |
label |
vector of labels to be used in the figure. Example: c("Not Significant","Hypermethylated in group1", "Hypomethylated in group1"))#' |
xlim |
x limits to cut image (i.e. c(-4,4)) |
ylim |
y limits to cut image (i.e. c(-1,10)) |
color |
vector of colors to be used in graph |
names |
Names to be plotted if significant. Should be the same size of x and y |
names.fill |
Names should be filled in a color box? Default: TRUE |
show.names |
What names will be showed? Possibilities: "both", "significant", "highlighted" |
x.cut |
x-axis threshold. Default: 0.0 If you give only one number (e.g. 0.2) the cut-offs will be -0.2 and 0.2. Or you can give different cut-offs as a vector (e.g. c(-0.3,0.4)) |
y.cut |
q-values threshold (i.e. 0.01, 10^-10) |
height |
Figure height |
width |
Figure width |
highlight |
List of genes/probes to be highlighted. It should be in the names argument. |
highlight.color |
Color of the points highlighted |
names.size |
Size of the names text |
dpi |
Figure dpi |
Creates a volcano plot from the gene expression and DNA methylation analysis. Please see the vignette for more information
Saves the volcano plot in the current folder
log2_foldchange <- runif(200, -2, 2) fdr <- runif(200, 0.01, 1) TCGAVisualize_volcano( x = log2_foldchange, y = fdr, x.cut = 1.5, y.cut = 0.01, title = "Title example", xlab = expression(paste(Log[2], "FoldChange")) ) ## Not run: beta_diff <- runif(200, -1, 1) fdr <- runif(200, 0.01, 1) TCGAVisualize_volcano( x = beta_diff, y = fdr, x.cut = 1.5, y.cut = 0.01, title = "Title example", xlab = expression(paste("DNA Methylation difference (", beta, "-values)")) ) TCGAVisualize_volcano( x, y, filename = NULL, y.cut = 10000000, x.cut=0.8, names = rep("AAAA",length(x)), legend = "Status", names.fill = FALSE ) TCGAVisualize_volcano( x, y, filename = NULL, y.cut = 10000000, x.cut = 0.8, names = as.character(1:length(x)), legend = "Status", names.fill = TRUE, highlight = c("1","2"), show = "both" ) TCGAVisualize_volcano( x, y, filename = NULL, y.cut = 10000000, x.cut = c(-0.3,0.8), names = as.character(1:length(x)), legend = "Status", names.fill = TRUE, highlight = c("1","2"), show = "both" ) ## End(Not run) while (!(is.null(dev.list()["RStudioGD"]))){dev.off()}
log2_foldchange <- runif(200, -2, 2) fdr <- runif(200, 0.01, 1) TCGAVisualize_volcano( x = log2_foldchange, y = fdr, x.cut = 1.5, y.cut = 0.01, title = "Title example", xlab = expression(paste(Log[2], "FoldChange")) ) ## Not run: beta_diff <- runif(200, -1, 1) fdr <- runif(200, 0.01, 1) TCGAVisualize_volcano( x = beta_diff, y = fdr, x.cut = 1.5, y.cut = 0.01, title = "Title example", xlab = expression(paste("DNA Methylation difference (", beta, "-values)")) ) TCGAVisualize_volcano( x, y, filename = NULL, y.cut = 10000000, x.cut=0.8, names = rep("AAAA",length(x)), legend = "Status", names.fill = FALSE ) TCGAVisualize_volcano( x, y, filename = NULL, y.cut = 10000000, x.cut = 0.8, names = as.character(1:length(x)), legend = "Status", names.fill = TRUE, highlight = c("1","2"), show = "both" ) TCGAVisualize_volcano( x, y, filename = NULL, y.cut = 10000000, x.cut = c(-0.3,0.8), names = as.character(1:length(x)), legend = "Status", names.fill = TRUE, highlight = c("1","2"), show = "both" ) ## End(Not run) while (!(is.null(dev.list()["RStudioGD"]))){dev.off()}
A dataset containing the Sample Ids from TCGA tumor purity measured according to 4 estimates attributes of 9364 tumor patients
Tumor.purity
Tumor.purity
A data frame with 9364 rows and 7 variables:
Sample ID from TCGA barcodes, character string
Cancer type, character string
uses gene expression profiles of 141 immune genes and 141 stromal genes, 0-1 value
uses somatic copy-number data (estimations were available for only 11 cancer types), 0-1 value
(leukocytes unmethylation for purity), which averages 44 non-methylated immune-specific CpG sites, 0-1value
as estimated by image analysis of haematoxylin and eosin stain slides produced by the Nationwide Childrens Hospital Biospecimen Core Resource, 0-1 value
derived consensus measurement as the median purity level after normalizing levels from all methods to give them equal means and s.ds, 0-1 value
...
function to keep raw counts after filtering and/or normalizing.
UseRaw_afterFilter(DataPrep, DataFilt)
UseRaw_afterFilter(DataPrep, DataFilt)
DataPrep |
DataPrep object returned by TCGAanalyze_Preprocessing() |
DataFilt |
Filtered data frame containing samples in columns and genes in rows after normalization and/or filtering steps |
Filtered return object similar to DataPrep with genes removed after normalization and filtering process.
## Not run: dataPrep_raw <- UseRaw_afterFilter(dataPrep, dataFilt) ## End(Not run)
## Not run: dataPrep_raw <- UseRaw_afterFilter(dataPrep, dataFilt) ## End(Not run)