Title: | Explore and download data from the recount project |
---|---|
Description: | Explore and download data from the recount project available at https://jhubiostatistics.shinyapps.io/recount/. Using the recount package you can download RangedSummarizedExperiment objects at the gene, exon or exon-exon junctions level, the raw counts, the phenotype metadata used, the urls to the sample coverage bigWig files or the mean coverage bigWig file for a particular study. The RangedSummarizedExperiment objects can be used by different packages for performing differential expression analysis. Using http://bioconductor.org/packages/derfinder you can perform annotation-agnostic differential expression analyses with the data from the recount project as described at http://www.nature.com/nbt/journal/v35/n4/full/nbt.3838.html. |
Authors: | Leonardo Collado-Torres [aut, cre] , Abhinav Nellore [ctb], Andrew E. Jaffe [ctb] , Margaret A. Taub [ctb], Kai Kammers [ctb], Shannon E. Ellis [ctb] , Kasper Daniel Hansen [ctb] , Ben Langmead [ctb] , Jeffrey T. Leek [aut, ths] |
Maintainer: | Leonardo Collado-Torres <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.33.0 |
Built: | 2024-12-14 03:11:25 UTC |
Source: | https://github.com/bioc/recount |
Given a text query, find the SRA project ids (study accession numbers) that contain the text in their abstract as provided by the SRAdb Bioconductor package.
abstract_search(query, id_only = FALSE, ...)
abstract_search(query, id_only = FALSE, ...)
query |
A character vector with the text to search for via grep in the abstract info available at recount_abstract. |
id_only |
Whether to only return the project id or to return summary information for the project(s) that match the query. |
... |
Additional arguments passed to grep. |
Both the query and the abstracts are searched in lower case.
For a more powerful search use the recount project website at https://jhubiostatistics.shinyapps.io/recount/.
If id_only = TRUE
it returns a character vector with the
project SRA ids (accession numbers). If id_only = FALSE
it returns a
subset of recount_abstract for the abstracts that contained the query.
Leonardo Collado-Torres
browse_study, recount_abstract
## Find the Geuvadis consortium project project_info <- abstract_search("Geuvadis consortium") ## See some summary information for this project project_info
## Find the Geuvadis consortium project project_info <- abstract_search("Geuvadis consortium") ## See some summary information for this project project_info
This function appends sample metadata information to a RangedSummarizedExperiment-class from the recount2 project. The sample metadata comes from curated efforts independent from the original recount2 project. Currently the only information comes from the recount_brain project described in more detail at http://lieberinstitute.github.io/recount-brain/.
add_metadata(rse, source = "recount_brain_v2", is_tcga = FALSE, verbose = TRUE)
add_metadata(rse, source = "recount_brain_v2", is_tcga = FALSE, verbose = TRUE)
rse |
A RangedSummarizedExperiment-class object as downloaded with download_study. If this argument is not specified, the function will return the raw metadata table. |
source |
A valid source name. The only supported options at this
moment are |
is_tcga |
Set to |
verbose |
If |
For source = "recount_brain_v1"
and
source = "recount_brain_v2"
, the metadata columns are
described at http://lieberinstitute.github.io/recount-brain/.
Alternatively, you can explore recount_brain_v2
interactively at
https://jhubiostatistics.shinyapps.io/recount-brain/.
If you use the recount_brain data please cite the Razmara et al. bioRxiv, 2019 https://www.biorxiv.org/content/10.1101/618025v1. A bib file is available via citation('recount').
A RangedSummarizedExperiment-class
object with the sample metadata columns appended to the colData()
slot.
Leonardo Collado-Torres
Razmara et al, bioRxiv, 2019. https://www.biorxiv.org/content/10.1101/618025v1
## Add the sample metadata to an example rse_gene object rse_gene <- add_metadata(rse_gene_SRP009615, "recount_brain_v2") ## Explore the metadata colData(rse_gene) ## For a list of studies present in recount_brain check ## http://lieberinstitute.github.io/recount-brain/. ## recount_brain_v2 includes GTEx and TCGA brain samples in addition to the ## recount_brain_v1 data, plus ontology information. ## Obtain all the recount_brain_v2 metadata if you want to ## explore the metadata manually recount_brain_v2 <- add_metadata(source = "recount_brain_v2")
## Add the sample metadata to an example rse_gene object rse_gene <- add_metadata(rse_gene_SRP009615, "recount_brain_v2") ## Explore the metadata colData(rse_gene) ## For a list of studies present in recount_brain check ## http://lieberinstitute.github.io/recount-brain/. ## recount_brain_v2 includes GTEx and TCGA brain samples in addition to the ## recount_brain_v1 data, plus ontology information. ## Obtain all the recount_brain_v2 metadata if you want to ## explore the metadata manually recount_brain_v2 <- add_metadata(source = "recount_brain_v2")
Shannon Ellis et al (2017) predicted phenotypes based on expression data for
the samples in the recount2 project. Using this function you can add the
predictions to a
RangedSummarizedExperiment-class object
to the colData()
slot.
add_predictions(rse, is_tcga = FALSE, version = "latest", verbose = TRUE)
add_predictions(rse, is_tcga = FALSE, version = "latest", verbose = TRUE)
rse |
A RangedSummarizedExperiment-class object as downloaded with download_study. If this argument is not specified, the function will return the full predictions table. |
is_tcga |
Set to |
version |
The version number for the predicted phenotypes data. It has to match one of the available numbers at https://github.com/leekgroup/recount-website/blob/master/predictions/. Feel free to check if there is a newer version than the default. The version used is printed as part of the file name. |
verbose |
If |
If you use these predicted phenotypes please cite the Ellis et al bioRxiv pre-print available at https://www.biorxiv.org/content/early/2017/06/03/145656. See citation details with citation('recount').
A RangedSummarizedExperiment-class
object with the prediction columns appended to the colData()
slot.
The predicted phenotypes are:
male or female,
cell_line or tissue,
tissue predicted based off of 30 tissues in GTEx,
single or paired end sequencing.
For each of the predicted phenotypes there are several columns as described next:
NA
when not available,
NA
when we did not predict, "Unassigned"
when prediction was ambiguous,
accuracy is assigned per dataset based on comparison to samples for which we had reported phenotype information so there are three distinct values per predictor (GTEx, TCGA, SRA) across all studies.
Leonardo Collado-Torres
Ellis et al, bioRxiv, 2017. https://www.biorxiv.org/content/early/2017/06/03/145656
## Add the predictions to an example rse_gene object rse_gene <- add_predictions(rse_gene_SRP009615) ## Explore the predictions colData(rse_gene) ## Download all the latest predictions PredictedPhenotypes <- add_predictions()
## Add the predictions to an example rse_gene object rse_gene <- add_predictions(rse_gene_SRP009615) ## Explore the predictions colData(rse_gene) ## Download all the latest predictions PredictedPhenotypes <- add_predictions()
Download the metadata from all the projects. This can be useful for finding samples of interests across all projects.
all_metadata(subset = "sra", verbose = TRUE)
all_metadata(subset = "sra", verbose = TRUE)
subset |
Either |
verbose |
If |
Note that for subset = 'gtex'
, there are more variables than
the ones we have for 'sra'. This information corresponds to file
GTEx_Data_V6_Annotations_SampleAttributesDS.txt available at
http://www.gtexportal.org/home/datasets. There you can find the
information describing these variables.
For TCGA we acquired metadata information from 3 different sources:
GDC: via a json query
CGC: via json queries and a custom script to merge the tables
TCGAbiolinks: we used to to parse GDC's XML files For more information, check https://github.com/leekgroup/recount-website/tree/master/metadata/tcga_prep.
A DataFrame-class object with the phenotype metadata.
Leonardo Collado-Torres
metadata <- all_metadata()
metadata <- all_metadata()
Given a SRA study id get the url to browse the study using the SRA website.
browse_study(project, browse = interactive())
browse_study(project, browse = interactive())
project |
A character vector with at least one SRA study id. |
browse |
Whether to open the resulting URL in the browser. |
Returns invisibly the URL for exploring the study in the SRA website.
Leonardo Collado-Torres
## Find the Geuvadis consortium project id <- abstract_search("Geuvadis consortium", id_only = TRUE) id ## Explore the Geuvadis consortium project url <- browse_study(id) ## See the actual URL url
## Find the Geuvadis consortium project id <- abstract_search("Geuvadis consortium", id_only = TRUE) id ## Explore the Geuvadis consortium project url <- browse_study(id) ## See the actual URL url
Given a set of genomic regions as created by expressed_regions, this function computes the coverage matrix for a library size of 40 million 100 bp reads for a given SRA study.
coverage_matrix( project, chr, regions, chunksize = 1000, bpparam = NULL, outdir = NULL, chrlen = NULL, verbose = TRUE, verboseLoad = verbose, scale = TRUE, round = FALSE, ... )
coverage_matrix( project, chr, regions, chunksize = 1000, bpparam = NULL, outdir = NULL, chrlen = NULL, verbose = TRUE, verboseLoad = verbose, scale = TRUE, round = FALSE, ... )
project |
A character vector with one SRA study id. |
chr |
A character vector with the name of the chromosome. |
regions |
A GRanges-class object with regions
for |
chunksize |
A single integer vector defining the chunksize to use for
computing the coverage matrix. Regions will be split into different chunks
which can be useful when using a parallel instance as defined by
|
bpparam |
A BiocParallelParam-class instance which will be used to calculate the coverage matrix in parallel. By default, SerialParam-class will be used. |
outdir |
The destination directory for the downloaded file(s) that were
previously downloaded with download_study. If the files are missing,
but |
chrlen |
The chromosome length in base pairs. If it's |
verbose |
If |
verboseLoad |
If |
scale |
If |
round |
If |
... |
Additional arguments passed to download_study when
|
When using outdir = NULL
the information will be accessed
from the web on the fly. If you encounter internet access problems, it might
be best to first download the BigWig files using download_study. This
might be the best option if you are accessing all chromosomes for a given
project and/or are thinking of using different sets of regions
(for
example, from different cutoffs applied to expressed_regions).
Alternatively check the SciServer
section on the vignette to see
how to access all the recount data via a R Jupyter Notebook.
If you have bwtool
installed, you can use
https://github.com/LieberInstitute/recount.bwtool for faster results.
Note that you will need to run scale_counts after running
coverage_matrix_bwtool()
.
A RangedSummarizedExperiment-class object with the counts stored in the assays slot.
Leonardo Collado-Torres
download_study, findRegions, railMatrix
## Workaround for https://github.com/lawremi/rtracklayer/issues/83 download_study("SRP002001", type = "mean") download_study("SRP002001", type = "samples") ## Define expressed regions for study DRP002835, chrY regions <- expressed_regions("SRP002001", "chrY", cutoff = 5L, maxClusterGap = 3000L, outdir = "SRP002001" ) ## Now calculate the coverage matrix for this study rse <- coverage_matrix("SRP002001", "chrY", regions, outdir = "SRP002001") ## One row per region identical(length(regions), nrow(rse))
## Workaround for https://github.com/lawremi/rtracklayer/issues/83 download_study("SRP002001", type = "mean") download_study("SRP002001", type = "samples") ## Define expressed regions for study DRP002835, chrY regions <- expressed_regions("SRP002001", "chrY", cutoff = 5L, maxClusterGap = 3000L, outdir = "SRP002001" ) ## Now calculate the coverage matrix for this study rse <- coverage_matrix("SRP002001", "chrY", regions, outdir = "SRP002001") ## One row per region identical(length(regions), nrow(rse))
This function is based on the Bioconductor guidelines for querying data from
the web at http://bioconductor.org/developers/how-to/web-query/. It
will run download a set of N.TRIES
times before
giving up. We implemented this function to reduce the number of Bioconductor
build errors due to the occassional errors from our data hosting server.
download_retry(url, destfile = basename(url), mode = "wb", N.TRIES = 3L, ...)
download_retry(url, destfile = basename(url), mode = "wb", N.TRIES = 3L, ...)
url |
The URL to download. Passed to download. |
destfile |
The destination file. Defaults to the base name of the URL. Passed to download. |
mode |
Mode for writing the file. The default |
N.TRIES |
The number of download attempts before giving up; default: 3. Should be an integer of length one with a value greater than 0. |
... |
Additional arguments passed to download. |
An invisible integer code as specified in download.file.
## Download the first files_info.tsv file (version 1) download_retry( recount_url$url[which(recount_url$file_name == "files_info.tsv")[1]] )
## Download the first files_info.tsv file (version 1) download_retry( recount_url$url[which(recount_url$file_name == "files_info.tsv")[1]] )
Download the gene or exon level RangedSummarizedExperiment-class objects provided by the recount project. Alternatively download the counts, metadata or file information for a given SRA study id. You can also download the sample bigWig files or the mean coverage bigWig file.
download_study( project, type = "rse-gene", outdir = project, download = TRUE, version = 2, ... )
download_study( project, type = "rse-gene", outdir = project, download = TRUE, version = 2, ... )
project |
A character vector with one SRA study id. |
type |
Specifies which files to download. The options are:
|
outdir |
The destination directory for the downloaded file(s).
Alternatively check the |
download |
Whether to download the files or just get the download urls. |
version |
A single integer specifying which version of the files to
download. Valid options are 1 and 2, as described in
https://jhubiostatistics.shinyapps.io/recount/ under the
documentation tab. Briefly, version 1 are counts based on reduced exons while
version 2 are based on disjoint exons. This argument mostly just matters for
the exon counts. Defaults to version 2 (disjoint exons).
Use |
... |
Additional arguments passed to download. |
Check http://stackoverflow.com/a/34383991 if you need to find the effective URLs. For example, http://duffel.rail.bio/recount/DRP000366/bw/mean_DRP000366.bw points to a link from SciServer.
Transcript quantifications are described in Fu et al, bioRxiv, 2018. https://www.biorxiv.org/content/10.1101/247346v2
FANTOM-CAT/recount2 quantifications are described in Imada, Sanchez, et al., bioRxiv, 2019. https://www.biorxiv.org/content/10.1101/659490v1
Returns invisibly the URL(s) for the files that were downloaded.
Leonardo Collado-Torres
## Find the URL to download the RangedSummarizedExperiment for the ## Geuvadis consortium study. url <- download_study("ERP001942", download = FALSE) ## See the actual URL url ## Not run: ## Download the example data included in the package for study SRP009615 url2 <- download_study("SRP009615") url2 ## Load the data load(file.path("SRP009615", "rse_gene.Rdata")) ## Compare the data library("testthat") expect_equivalent(rse_gene, rse_gene_SRP009615) ## End(Not run)
## Find the URL to download the RangedSummarizedExperiment for the ## Geuvadis consortium study. url <- download_study("ERP001942", download = FALSE) ## See the actual URL url ## Not run: ## Download the example data included in the package for study SRP009615 url2 <- download_study("SRP009615") url2 ## Load the data load(file.path("SRP009615", "rse_gene.Rdata")) ## Compare the data library("testthat") expect_equivalent(rse_gene, rse_gene_SRP009615) ## End(Not run)
This function uses the pre-computed mean coverage for a given SRA project to identify the expressed regions (ERs) for a given chromosome. It returns a GRanges-class object with the expressed regions as defined by findRegions.
expressed_regions( project, chr, cutoff, outdir = NULL, maxClusterGap = 300L, chrlen = NULL, verbose = TRUE, ... )
expressed_regions( project, chr, cutoff, outdir = NULL, maxClusterGap = 300L, chrlen = NULL, verbose = TRUE, ... )
project |
A character vector with one SRA study id. |
chr |
A character vector with the name of the chromosome. |
cutoff |
The base-pair level cutoff to use. |
outdir |
The destination directory for the downloaded file(s) that were
previously downloaded with download_study. If the files are missing,
but |
maxClusterGap |
This determines the maximum gap between candidate ERs. |
chrlen |
The chromosome length in base pairs. If it's |
verbose |
If |
... |
Additional arguments passed to download_study when
|
A GRanges-class object as created by findRegions.
Leonardo Collado-Torres
download_study, findRegions, railMatrix
## Define expressed regions for study SRP002001, chrY ## Workaround for https://github.com/lawremi/rtracklayer/issues/83 download_study("SRP002001", type = "mean") regions <- expressed_regions("SRP002001", "chrY", cutoff = 5L, maxClusterGap = 3000L, outdir = "SRP002001" ) ## Not run: ## Define the regions for multiple chrs regs <- sapply(chrs, expressed_regions, project = "SRP002001", cutoff = 5L) ## You can then combine them into a single GRanges object if you want to library("GenomicRanges") single <- unlist(GRangesList(regs)) ## End(Not run)
## Define expressed regions for study SRP002001, chrY ## Workaround for https://github.com/lawremi/rtracklayer/issues/83 download_study("SRP002001", type = "mean") regions <- expressed_regions("SRP002001", "chrY", cutoff = 5L, maxClusterGap = 3000L, outdir = "SRP002001" ) ## Not run: ## Define the regions for multiple chrs regs <- sapply(chrs, expressed_regions, project = "SRP002001", cutoff = 5L) ## You can then combine them into a single GRanges object if you want to library("GenomicRanges") single <- unlist(GRangesList(regs)) ## End(Not run)
Given a SRA run id, this function will retrieve the GEO accession id
(starting with GSM) if it's available. Otherwise it will return NA
.
find_geo(run, verbose = FALSE, sleep = 1/2)
find_geo(run, verbose = FALSE, sleep = 1/2)
run |
A character vector of length 1 with the SRA run accession id. |
verbose |
Whether to print a message for the run. Useful when looping over a larger number of SRA run ids. |
sleep |
The number of seconds (or fraction) to wait before downloading
data using getGEO. This is important if you are looking over
|
Although the phenotype information already includes the GEO accession ids, not all projects had GEO entries at the time these tables were created. This function will then be useful to check if there is a GEO accession id for a given sample (run). If there is, you can then retrieve the information using geo_info.
The GEO accession id for the corresponding sample.
Leonardo Collado-Torres
## Find the GEO accession id for for SRX110461 find_geo("SRX110461")
## Find the GEO accession id for for SRX110461 find_geo("SRX110461")
This function builds a data.frame from the GEO characteristics extracted for a given sample. The names of the of columns correspond to the field names. For a given SRA project, this information can be combined for all samples as shown in the examples section.
geo_characteristics(pheno)
geo_characteristics(pheno)
pheno |
A DataFrame-class as created by geo_info. |
A 1 row data.frame with the characteristic fields as column names and the values as the entries on the first row. If the authors of the study used the same names for all samples, you can then combine them using rbind.
Leonardo Collado-Torres
## Load required library library("SummarizedExperiment") ## Get the GEO accession ids # geoids <- sapply(colData(rse_gene_SRP009615)$run[1:2], find_geo) ## The previous example code works nearly all the time but it ## can occassionally fail depending on how rentrez is doing. ## This code makes sure that the example code runs. geoids <- tryCatch( sapply(colData(rse_gene_SRP009615)$run[1:2], find_geo), error = function(e) { c( "SRR387777" = "GSM836270", "SRR387778" = "GSM836271" ) } ) ## Get the data from GEO geodata <- do.call(rbind, sapply(geoids, geo_info)) ## Add characteristics in a way that we can access easily later on geodata <- cbind(geodata, geo_characteristics(geodata)) ## Explore the original characteristics and the result from ## geo_characteristics() geodata[, c("characteristics", "cells", "shrna.expression", "treatment")]
## Load required library library("SummarizedExperiment") ## Get the GEO accession ids # geoids <- sapply(colData(rse_gene_SRP009615)$run[1:2], find_geo) ## The previous example code works nearly all the time but it ## can occassionally fail depending on how rentrez is doing. ## This code makes sure that the example code runs. geoids <- tryCatch( sapply(colData(rse_gene_SRP009615)$run[1:2], find_geo), error = function(e) { c( "SRR387777" = "GSM836270", "SRR387778" = "GSM836271" ) } ) ## Get the data from GEO geodata <- do.call(rbind, sapply(geoids, geo_info)) ## Add characteristics in a way that we can access easily later on geodata <- cbind(geodata, geo_characteristics(geodata)) ## Explore the original characteristics and the result from ## geo_characteristics() geodata[, c("characteristics", "cells", "shrna.expression", "treatment")]
This function uses GEOquery to extract information for a given sample. The GEO accession ids for the sample can be found in the study phenotype table.
geo_info( geoid, verbose = FALSE, sleep = 1/2, getGPL = FALSE, destdir = tempdir(), ... )
geo_info( geoid, verbose = FALSE, sleep = 1/2, getGPL = FALSE, destdir = tempdir(), ... )
geoid |
A character vector of length 1 with the GEO accession id for a given sample. |
verbose |
If |
sleep |
The number of seconds (or fraction) to wait before downloading
data using getGEO. This is important if you are looking over
|
getGPL |
This argument is passed to getGEO and is set
to |
destdir |
This argument is passed to getGEO. |
... |
Additional arguments passed to getGEO. |
Returns a DataFrame-class with the information from GEO available for the given sample.
Leonardo Collado-Torres, Andrew Jaffe
geo_info("GSM836270")
geo_info("GSM836270")
For some analyses you might be interested in transforming the counts into RPKMs which you can do with this function.
getRPKM(rse, length_var = "bp_length", mapped_var = NULL)
getRPKM(rse, length_var = "bp_length", mapped_var = NULL)
rse |
A RangedSummarizedExperiment-class object as downloaded with download_study. |
length_var |
A length 1 character vector with the column name from
|
mapped_var |
A length 1 character vector with the column name from
|
For gene RSE objects, you will want to specify the length_var
because otherwise you will be adjusting for the total gene length instead
of the total exonic sequence length of the gene.
A matrix with the RPKM values.
Andrew Jaffe, Leonardo Collado-Torres
## get RPKM matrix rpkm <- getRPKM(rse_gene_SRP009615) ## You can also get an RPKM matrix after running scale_counts() ## with similar RPKM values rpkm2 <- getRPKM(scale_counts(rse_gene_SRP009615)) rpkm3 <- getRPKM(scale_counts(rse_gene_SRP009615, by = "mapped_reads")) summary(rpkm - rpkm2) summary(rpkm - rpkm3) summary(rpkm2 - rpkm3)
## get RPKM matrix rpkm <- getRPKM(rse_gene_SRP009615) ## You can also get an RPKM matrix after running scale_counts() ## with similar RPKM values rpkm2 <- getRPKM(scale_counts(rse_gene_SRP009615)) rpkm3 <- getRPKM(scale_counts(rse_gene_SRP009615, by = "mapped_reads")) summary(rpkm - rpkm2) summary(rpkm - rpkm3) summary(rpkm2 - rpkm3)
For some analyses you might be interested in transforming the counts into TPMs which you can do with this function. This function uses the gene-level RPKMs to derive TPM values (see Details).
getTPM(rse, length_var = "bp_length", mapped_var = NULL)
getTPM(rse, length_var = "bp_length", mapped_var = NULL)
rse |
A RangedSummarizedExperiment-class object as downloaded with download_study. |
length_var |
A length 1 character vector with the column name from
|
mapped_var |
A length 1 character vector with the column name from
|
For gene RSE objects, you will want to specify the length_var
because otherwise you will be adjusting for the total gene length instead
of the total exonic sequence length of the gene.
As noted in https://support.bioconductor.org/p/124265/, Sonali Arora et al computed TPMs in https://www.biorxiv.org/content/10.1101/445601v2 using the formula: TPM = FPKM / (sum of FPKM over all genes/transcripts) * 10^6
Arora et al mention in their code that the formula comes from
https://doi.org/10.1093/bioinformatics/btp692; specifically
1.1.1 Comparison to RPKM estimation
where they mention an important
assumption: Under the assumption of uniformly distributed reads, we note
that RPKM measures are estimates of ...
There's also a blog post by Harold Pimentel explaining the relationship between FPKM and TPM: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/.
A matrix with the TPM values.
Sonali Arora, Leonardo Collado-Torres
https://www.biorxiv.org/content/10.1101/445601v2 https://arxiv.org/abs/1104.3889
## Compute the TPM matrix from the raw gene-level base-pair counts. tpm <- getTPM(rse_gene_SRP009615)
## Compute the TPM matrix from the raw gene-level base-pair counts. tpm <- getTPM(rse_gene_SRP009615)
As described in the recount workflow, the counts provided by the recount2 project are base-pair counts. You can scale them using scale_counts or compute the read counts using the area under coverage information (AUC). We use the AUC because Rail-RNA soft clips some reads.
read_counts(rse, use_paired_end = TRUE, round = FALSE)
read_counts(rse, use_paired_end = TRUE, round = FALSE)
rse |
A RangedSummarizedExperiment-class object as downloaded with download_study. |
use_paired_end |
A logical vector. When |
round |
Whether to round the counts to integers or not. |
Check the recount workflow for details about the counts provided by the recount2 project. Note that this function should not be used after scale_counts or it will return non-sensical results.
Returns a RangedSummarizedExperiment-class object with the read counts.
Leonardo Collado-Torres
Collado-Torres L, Nellore A and Jaffe AE. recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor version 1; referees: 1 approved, 2 approved with reservations. F1000Research 2017, 6:1558 doi: 10.12688/f1000research.12223.1.
## Difference between read counts and reads downloaded by Rail-RNA colSums(assays(read_counts(rse_gene_SRP009615))$counts) / 1e6 - colData(rse_gene_SRP009615)$reads_downloaded / 1e6 ## Paired-end read counts vs fragment counts (single-end counts) download_study("DRP000499") load("DRP000499/rse_gene.Rdata") colSums(assays(read_counts(rse_gene, use_paired_end = FALSE))$counts) / colSums(assays(read_counts(rse_gene))$counts) ## Difference between paired-end read counts vs paired-end reads downloaded colSums(assays(read_counts(rse_gene))$counts) / 1e6 - colData(rse_gene)$reads_downloaded / 1e6 / 2
## Difference between read counts and reads downloaded by Rail-RNA colSums(assays(read_counts(rse_gene_SRP009615))$counts) / 1e6 - colData(rse_gene_SRP009615)$reads_downloaded / 1e6 ## Paired-end read counts vs fragment counts (single-end counts) download_study("DRP000499") load("DRP000499/rse_gene.Rdata") colSums(assays(read_counts(rse_gene, use_paired_end = FALSE))$counts) / colSums(assays(read_counts(rse_gene))$counts) ## Difference between paired-end read counts vs paired-end reads downloaded colSums(assays(read_counts(rse_gene))$counts) / 1e6 - colData(rse_gene)$reads_downloaded / 1e6 / 2
A data.frame with summary information at the project level for the studies analyzed in the recount project.
A data.frame with 4 columns.
the number of samples in the study,
the species of the study,
the abstract text as provided by the SRAdb Bioconductor package,
the SRA project id.
https://jhubiostatistics.shinyapps.io/recount/
Exon annotation extracted from Gencode v25 (GRCh38.p7) used in recount. Data extraced on October 12th, 2017.
A GRangesList-class with one element per gene. The names match the gene Gencode v25 ids.
https://jhubiostatistics.shinyapps.io/recount/
reproduce_ranges, recount_genes
Gene annotation extracted from Gencode v25 (GRCh38.p7) used in recount. Data extraced on October 12th, 2017. The symbols were updated compared to the version from January 17th, 2017. It includes the sum of the width of the disjoint exons which can be used for normalizing the counts provided in the RangedSummarizedExperiment-class objects.
A GRanges-class with one range per gene. The names match their Gencode v25 ids. The GRanges-class has three metadata columns.
the Gencode v25 ids, identical to the names.
the sum of the width of the disjoint exons for that given gene.
a CharacterList with the corresponding gene symbols.
https://jhubiostatistics.shinyapps.io/recount/
reproduce_ranges, recount_exons
Files and URLs as provided by the recount project. This information is used internally in download_study.
A data.frame with 6 columns.
the original path to the file before being uploaded,
the file name,
the SRA project id,
A logical vector indicating whether the file was part of version 1 (reduced exons)
.
A logical vector indicating whether the file was updated in version 2 (disjoint exons)
. Further details in the recount website documentation tab.
the public URL for the given file.
https://jhubiostatistics.shinyapps.io/recount/
This function reproduces the gene or exon level information used for creating
the RangedSummarizedExperiment-class
objects provided by recount. The annotation is based on
Gencode v25 with the gene-level
information extracted with genes()
(see
transcripts with default arguments.
reproduce_ranges(level = "gene", db = "Gencode.v25")
reproduce_ranges(level = "gene", db = "Gencode.v25")
level |
Either |
db |
Either |
For Gencode.v25, we use the comprehensive gene annotation (regions:
CHR
) from https://www.gencodegenes.org/releases/25.html
(GRCh38.p7).
Note that gene symbols have changed over time. This answer in the Bioconductor support forum details how to obtain the latest gene symbol mappings: https://support.bioconductor.org/p/126148/#126173.
Either a GRanges-class object like recount_genes or a GRangesList-class object like recount_exons.
Leonardo Collado-Torres
recount_genes, recount_exons, https://github.com/nellore, https://jhubiostatistics.shinyapps.io/recount/
## Not run: ## Reproduce gene level information genes <- reproduce_ranges() ## Compare against recount_genes length(genes) length(recount_genes) ## End(Not run)
## Not run: ## Reproduce gene level information genes <- reproduce_ranges() ## Compare against recount_genes length(genes) length(recount_genes) ## End(Not run)
RangedSummarizedExperiment-class at the gene level for study SRP009615. Used as an example in scale_counts. Matches the version2 file (details at https://jhubiostatistics.shinyapps.io/recount/) under the documentation tab.
A RangedSummarizedExperiment-class as created by the recount project for study with SRA id (accession number) SRP009615.
https://jhubiostatistics.shinyapps.io/recount/
In preparation for a differential expression analysis, you will have to choose how to scale the raw counts provided by the recount project. Note that the raw counts are the sum of the base level coverage so you have to take into account the read length or simply the total coverage for the given sample (default option). You might want to do some further scaling to take into account the gene or exon lengths. If you prefer to calculate read counts without scaling check the function read_counts.
scale_counts( rse, by = "auc", targetSize = 4e+07, L = 100, factor_only = FALSE, round = TRUE )
scale_counts( rse, by = "auc", targetSize = 4e+07, L = 100, factor_only = FALSE, round = TRUE )
rse |
A RangedSummarizedExperiment-class object as downloaded with download_study. |
by |
Either |
targetSize |
The target library size in number of single end reads. |
L |
The target read length. Only used when |
factor_only |
Whether to only return the numeric scaling factor or
to return a RangedSummarizedExperiment-class
object with the counts scaled. If set to |
round |
Whether to round the counts to integers or not. |
Rail-RNA http://rail.bio uses soft clipping when aligning
which is why we recommed using by = 'auc'
.
If the reads are from a paired-end library, then the avg_read_length
is the average fragment length. This is taken into account when using
by = 'mapped_reads'
.
If factor_only = TRUE
it returns a numeric vector with the
scaling factor for each sample. If factor_only = FALSE
it returns a
RangedSummarizedExperiment-class object with
the counts already scaled.
Leonardo Collado-Torres
## Load an example rse_gene object rse_gene <- rse_gene_SRP009615 ## Scale counts rse <- scale_counts(rse_gene) ## Find the project used as an example project_info <- abstract_search("GSE32465") ## See some summary information for this project project_info ## Use the following code to re-download this file ## Not run: ## Download download_study(project_info$project) ## Load file load(file.path(project_info$project, "rse_gene.Rdata")) identical(rse_gene, rse_gene_SRP009615) ## End(Not run)
## Load an example rse_gene object rse_gene <- rse_gene_SRP009615 ## Scale counts rse <- scale_counts(rse_gene) ## Find the project used as an example project_info <- abstract_search("GSE32465") ## See some summary information for this project project_info ## Use the following code to re-download this file ## Not run: ## Download download_study(project_info$project) ## Load file load(file.path(project_info$project, "rse_gene.Rdata")) identical(rse_gene, rse_gene_SRP009615) ## End(Not run)
This function uses the Snaptron API to query specific exon-exon junctions that are available via Intropolis as described in the vignette.
snaptron_query(junctions, version = "srav1", verbose = TRUE, async = TRUE)
snaptron_query(junctions, version = "srav1", verbose = TRUE, async = TRUE)
junctions |
A GRanges-class object with the exon-exon junctions of interest. The chromosome names should be in UCSC format, such as 'chr1'. The strand information is ignored in the query. |
version |
Either |
verbose |
If |
async |
Defaults to |
A GRanges-class object with the results from the Snaptron query. For information on the different columns please see http://snaptron.cs.jhu.edu.
Leonardo Collado-Torres
Please cite http://snaptron.cs.jhu.edu if you use this function as Snaptron is a separate project from recount. Thank you!
library("GenomicRanges") ## Define some exon-exon junctions (hg19 coordinates) junctions <- GRanges(seqnames = "chr2", IRanges( start = c(28971710:28971712, 29555081:29555083, 29754982:29754984), end = c(29462417:29462419, 29923338:29923340, 29917714:29917716) )) ## Check against Snaptron SRA version 1 (hg19 coordinates) snaptron_query(junctions) ## Not run: ## Check another set of junctions against SRA version 2 (more data, hg38 ## coordinates) junctions_v2 <- GRanges(seqnames = "chr2", IRanges( start = 29532116:29532118, end = 29694848:29694850 )) snaptron_query(junctions_v2, version = "srav2") ## Check these junctions in GTEx and TCGA data snaptron_query(junctions_v2, version = "gtex") snaptron_query(junctions_v2, version = "tcga") ## End(Not run)
library("GenomicRanges") ## Define some exon-exon junctions (hg19 coordinates) junctions <- GRanges(seqnames = "chr2", IRanges( start = c(28971710:28971712, 29555081:29555083, 29754982:29754984), end = c(29462417:29462419, 29923338:29923340, 29917714:29917716) )) ## Check against Snaptron SRA version 1 (hg19 coordinates) snaptron_query(junctions) ## Not run: ## Check another set of junctions against SRA version 2 (more data, hg38 ## coordinates) junctions_v2 <- GRanges(seqnames = "chr2", IRanges( start = 29532116:29532118, end = 29694848:29694850 )) snaptron_query(junctions_v2, version = "srav2") ## Check these junctions in GTEx and TCGA data snaptron_query(junctions_v2, version = "gtex") snaptron_query(junctions_v2, version = "tcga") ## End(Not run)