Title: | Programmatic access to the DEE2 RNA expression dataset |
---|---|
Description: | Digital Expression Explorer 2 (or DEE2 for short) is a repository of processed RNA-seq data in the form of counts. It was designed so that researchers could undertake re-analysis and meta-analysis of published RNA-seq studies quickly and easily. As of April 2020, over 1 million SRA datasets have been processed. This package provides an R interface to access these expression data. More information about the DEE2 project can be found at the project homepage (http://dee2.io) and main publication (https://doi.org/10.1093/gigascience/giz022). |
Authors: | Mark Ziemann [aut, cre], Antony Kaspi [aut] |
Maintainer: | Mark Ziemann <[email protected]> |
License: | GPL-3 |
Version: | 1.17.0 |
Built: | 2024-11-29 08:12:18 UTC |
Source: | https://github.com/bioc/getDEE2 |
Digital Expression Explorer 2 (or DEE2 for short) is a repository of processed RNA-seq data in the form of counts. It was designed so that researchers could undertake re-analysis and meta-analysis of published RNA-seq studies quickly and easily. This package provides an R interface to access these expression data. More information about the DEE2 project can be found at the project homepage (http://dee2.io) and main publication (https://doi.org/10.1093/gigascience/giz022).
The getDEE2 function fetches gene expression data from the DEE2 database of RNA sequencing data and returns it as a SummarizedExperiment object.
getDEE2( species, SRRvec, counts = "GeneCounts", metadata = NULL, outfile = NULL, legacy = FALSE, baseURL = "http://dee2.io/cgi-bin/request.sh?", ... )
getDEE2( species, SRRvec, counts = "GeneCounts", metadata = NULL, outfile = NULL, legacy = FALSE, baseURL = "http://dee2.io/cgi-bin/request.sh?", ... )
species |
A character string matching the species of interest. |
SRRvec |
A character string or vector of SRA run accession numbers. |
counts |
A string, either 'GeneCounts', 'TxCounts' or 'Tx2Gene'. When 'GeneCounts' is specified, STAR gene level counts are returned. When 'TxCounts' is specified, kallisto transcript counts are returned. When 'Tx2Gene' is specified, kallisto counts aggregated (by sum) on gene are returned. If left blank, "GeneCounts" will be fetched. |
metadata |
(Optional) name of R object for the meta data. Providing the metadata will speed up performance if multiple queries are made in a session. If left blank, the metadata will be fetched once again. |
outfile |
An optional file name for the downloaded dataset. |
legacy |
Whether data should be returned in the legacy (list) format. Default is FALSE. Leave this FALSE if you want to receive data as Summarized experiment. |
baseURL |
The base URL of the service. Leave this as the default URL unless you want to download from a 3rd party mirror. |
... |
Additional parameters to be passed to download.file. |
a SummarizedExperiment object.
# Example workflow # Fetch metadata mdat <- getDEE2Metadata("celegans") # filter metadata for SRA project SRP009256 mdat1 <- mdat[which(mdat$SRP_accession %in% "SRP009256"),] # create a vector of SRA run accessions to fetch SRRvec <- as.vector(mdat1$SRR_accession) # obtain the data as a SummarizedExperiment x <- getDEE2("celegans",SRRvec,metadata=mdat,counts="GeneCounts") # Next, downstream analysis with your favourite Bioconductor tools :) x<-getDEE2("ecoli",c("SRR1613487","SRR1613488"))
# Example workflow # Fetch metadata mdat <- getDEE2Metadata("celegans") # filter metadata for SRA project SRP009256 mdat1 <- mdat[which(mdat$SRP_accession %in% "SRP009256"),] # create a vector of SRA run accessions to fetch SRRvec <- as.vector(mdat1$SRR_accession) # obtain the data as a SummarizedExperiment x <- getDEE2("celegans",SRRvec,metadata=mdat,counts="GeneCounts") # Next, downstream analysis with your favourite Bioconductor tools :) x<-getDEE2("ecoli",c("SRR1613487","SRR1613488"))
The getDEE2_bundle function fetches gene expression data from DEE2. This function will only work if all SRA runs have been successfully processed for an SRA project. This function returns a SummarizedExperiment object.
getDEE2_bundle( species, query, col, counts = "GeneCounts", bundles = NULL, legacy = FALSE, baseURL = "http://dee2.io/huge/", ... )
getDEE2_bundle( species, query, col, counts = "GeneCounts", bundles = NULL, legacy = FALSE, baseURL = "http://dee2.io/huge/", ... )
species |
A character string matching the species of interest. |
query |
A character string, such as the SRA project accession number or the GEO series accession number |
col |
the column name to be queried, usually "SRP_accession" for SRA project accession or "GSE_accession" for GEO series accession. |
counts |
A string, either 'GeneCounts', 'TxCounts' or 'Tx2Gene'. When 'GeneCounts' is specified, STAR gene level counts are returned. When 'TxCounts' is specified, kallisto transcript counts are returned. When 'Tx2Gene' is specified, kallisto counts aggregated (by sum) on gene are returned. If left blank, "GeneCounts" will be fetched. |
bundles |
optional table of previously downloaded bundles. providing this will speed up performance if multiple queries are made in a session. If left blank, the bundle list will be fetched again. |
legacy |
Whether data should be returned in the legacy (list) format. Default is FALSE. Leave this FALSE if you want to receive data as Summarized experiment. |
baseURL |
The base URL of the service. Leave this as the default URL unless you want to download from a 3rd party mirror. |
... |
Additional parameters to be passed to download.file. |
a SummarizedExperiment object.
x <- getDEE2_bundle("celegans", "SRP133403",col="SRP_accession")
x <- getDEE2_bundle("celegans", "SRP133403",col="SRP_accession")
This function fetches the short metadata for the species of interest.
getDEE2Metadata(species, outfile = NULL, ...)
getDEE2Metadata(species, outfile = NULL, ...)
species |
A character string matching a species of interest. |
outfile |
Optional filename. |
... |
Additional parameters to be passed to download.file. |
a table of metadata.
ecoli_metadata <- getDEE2Metadata("ecoli")
ecoli_metadata <- getDEE2Metadata("ecoli")
This function fetches a table listing all completed projects that are available at DEE2
list_bundles(species)
list_bundles(species)
species |
A character string matching a species of interest. |
a table of project bundles available at DEE2.io/huge
bundles <- list_bundles("celegans")
bundles <- list_bundles("celegans")
This function loads the full metadata, which contains many fields.
loadFullMeta(zipname)
loadFullMeta(zipname)
zipname |
Path to the zipfile. |
a dataframe of full metadata.
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadFullMeta("mydata.zip")
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadFullMeta("mydata.zip")
This function loads STAR gene level counts from a downloaded zip file.
loadGeneCounts(zipname)
loadGeneCounts(zipname)
zipname |
Path to the zipfile. |
a dataframe of gene expression counts.
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadGeneCounts("mydata.zip")
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadGeneCounts("mydata.zip")
This function loads gene information. This information includes gene names and lengths which is useful for downstream analysis.
loadGeneInfo(zipname)
loadGeneInfo(zipname)
zipname |
Path to the zipfile. |
a dataframe of gene information.
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadGeneInfo("mydata.zip")
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadGeneInfo("mydata.zip")
This function loads quality control data. More information about the QC metrics is available from the project github page: https://github.com/markziemann/dee2/blob/master/qc/qc_metrics.md
loadQcMx(zipname)
loadQcMx(zipname)
zipname |
Path to the zipfile. |
a dataframe of quality control metrics.
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadQcMx("mydata.zip")
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadQcMx("mydata.zip")
This function loads the summary metadata, which are the most relevant SRA accession numbers.
loadSummaryMeta(zipname)
loadSummaryMeta(zipname)
zipname |
Path to the zipfile. |
a dataframe of summary metadata.
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadSummaryMeta("mydata.zip")
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadSummaryMeta("mydata.zip")
This function loads Kallisto transcript level counts from a downloaded zip file.
loadTxCounts(zipname)
loadTxCounts(zipname)
zipname |
Path to the zipfile. |
a dataframe of transcript expression counts.
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadTxCounts("mydata.zip")
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadTxCounts("mydata.zip")
This function loads transcript information. This information includes transcript lengths, corresponding parent gene accession and gene symbol that might be useful for downstream analysis.
loadTxInfo(zipname)
loadTxInfo(zipname)
zipname |
Path to the zipfile. |
a dataframe of transcript info
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadTxInfo("mydata.zip")
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),outfile="mydata.zip") y <- loadTxInfo("mydata.zip")
This function sends a query to check whether a dataset is available or not.
query_bundles(species, query, col, bundles = NULL)
query_bundles(species, query, col, bundles = NULL)
species |
A character string matching a species of interest. |
query |
A character string, such as the SRA project accession number or the GEO series accession number |
col |
the column name to be queried, usually "SRP_accession" for SRA project accession or "GSE_accession" for GEO series accession. |
bundles |
optional table of previously downloaded bundles. |
a list of datasets that are present and absent.
query_bundles("celegans", c("SRP133403","SRP133439"), col = "SRP_accession")
query_bundles("celegans", c("SRP133403","SRP133439"), col = "SRP_accession")
This function sends a query to check whether a dataset is available or not.
queryDEE2(species, SRRvec, metadata = NULL, ...)
queryDEE2(species, SRRvec, metadata = NULL, ...)
species |
A character string matching a species of interest. |
SRRvec |
A character string or vector thereof of SRA run accession numbers. |
metadata |
optional R object of DEE2 metadata to query. |
... |
Additional parameters to be passed to download.file. |
a list of datasets that are present and absent.
x <- queryDEE2("ecoli",c("SRR1067773","SRR5350513"))
x <- queryDEE2("ecoli",c("SRR1067773","SRR5350513"))
This function creates a SummarizedExperiment object from a legacy getDEE2 dataset
se(x, counts = "GeneCounts")
se(x, counts = "GeneCounts")
x |
a getDEE2 object. |
counts |
select "GeneCounts" for STAR based gene counts, "TxCounts" for kallisto transcript level counts or "Tx2Gene" for transcript counts aggregated to gene level. Default is "GeneCounts" |
a SummarizedExperiment object
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),legacy=TRUE) y <- se(x)
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),legacy=TRUE) y <- se(x)
Sometimes, each SRA experiment data is represented in two or more runs and they need to be aggregated.
srx_agg(x, counts = "GeneCounts")
srx_agg(x, counts = "GeneCounts")
x |
a getDEE2 object. |
counts |
select "GeneCounts" for STAR based gene counts, "TxCounts" for kallisto transcript level counts or "Tx2Gene" for transcript counts aggregated to gene level. Default is "GeneCounts" |
a dataframe with gene expression data summarised to SRA experiment accession numbers rather than run accession numbers.
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),legacy=TRUE) y <- srx_agg(x)
x <- getDEE2("ecoli",c("SRR1613487","SRR1613488"),legacy=TRUE) y <- srx_agg(x)
This function converts Kallisto transcript-level expression estimates to gene-level estimates. Counts for each transcript are summed to get an aggregated gene level score.
Tx2Gene(x)
Tx2Gene(x)
x |
a getDEE2 object. |
a dataframe of gene expression counts.
x <- getDEE2("scerevisiae",c("SRR1755149","SRR1755150"),legacy=TRUE) x <- Tx2Gene(x)
x <- getDEE2("scerevisiae",c("SRR1755149","SRR1755150"),legacy=TRUE) x <- Tx2Gene(x)