recount3 quick start guide

Overview

The recount3 R/Bioconductor package is an interface to the recount3 project. recount3 provides uniformly processed RNA-seq data for hundreds of thousands of samples. The R package makes it possible to easily retrieve this data in standard Bioconductor containers, including RangedSummarizedExperiment. The sections on terminology and available data contains more detail on those subjects.

The main documentation website for all the recount3-related projects is available at recount.bio. Please check that website for more information about how this R/Bioconductor package and other tools are related to each other.

Basics

Installing recount3

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. recount3 is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install recount3 by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("recount3")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

You can install the development version from GitHub with:

BiocManager::install("LieberInstitute/recount3")

Required knowledge

recount3 is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A recount3 user will benefit from being familiar with SummarizedExperiment to understand the objects recount3 generates. It might also prove to be highly beneficial to check the

If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.

Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the recount3 tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

Citing recount3

We hope that recount3 will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("recount3")
#> To cite package 'recount3' in publications use:
#> 
#>   Collado-Torres L (2024). _Explore and download data from the recount3
#>   project_. doi:10.18129/B9.bioc.recount3
#>   <https://doi.org/10.18129/B9.bioc.recount3>,
#>   https://github.com/LieberInstitute/recount3 - R package version
#>   1.17.0, <http://www.bioconductor.org/packages/recount3>.
#> 
#>   Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL,
#>   Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L,
#>   Hansen KD, Langmead B (2021). "recount3: summaries and queries for
#>   large-scale RNA-seq expression and splicing." _Genome Biol_.
#>   doi:10.1186/s13059-021-02533-6
#>   <https://doi.org/10.1186/s13059-021-02533-6>,
#>   <https://doi.org/10.1186/s13059-021-02533-6>.
#> 
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.

Quick start

After installing recount3 (Wilks, Zheng, Chen, Charles et al., 2021), we need to load the package, which will automatically load the required dependencies.

## Load recount3 R package
library("recount3")

If you have identified a study of interest and want to access the gene level expression data, use create_rse() as shown below. create_rse() has arguments that will allow you to specify the annotation of interest for the given organism, and whether you want to download gene, exon or exon-exon junction expression data.

## Find all available human projects
human_projects <- available_projects()
#> 2024-11-19 04:13:30.934035 caching file sra.recount_project.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/sra.recount_project.MD.gz'
#> 2024-11-19 04:13:32.824509 caching file gtex.recount_project.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/gtex/metadata/gtex.recount_project.MD.gz'
#> 2024-11-19 04:13:34.142923 caching file tcga.recount_project.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/tcga/metadata/tcga.recount_project.MD.gz'

## Find the project you are interested in,
## here we use SRP009615 as an example
proj_info <- subset(
    human_projects,
    project == "SRP009615" & project_type == "data_sources"
)

## Create a RangedSummarizedExperiment (RSE) object at the gene level
rse_gene_SRP009615 <- create_rse(proj_info)
#> 2024-11-19 04:13:38.826849 downloading and reading the metadata.
#> 2024-11-19 04:13:39.306234 caching file sra.sra.SRP009615.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.sra.SRP009615.MD.gz'
#> 2024-11-19 04:13:40.693078 caching file sra.recount_project.SRP009615.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_project.SRP009615.MD.gz'
#> 2024-11-19 04:13:42.091205 caching file sra.recount_qc.SRP009615.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_qc.SRP009615.MD.gz'
#> 2024-11-19 04:13:43.405852 caching file sra.recount_seq_qc.SRP009615.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_seq_qc.SRP009615.MD.gz'
#> 2024-11-19 04:13:44.617273 caching file sra.recount_pred.SRP009615.MD.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_pred.SRP009615.MD.gz'
#> 2024-11-19 04:13:45.484657 downloading and reading the feature information.
#> 2024-11-19 04:13:45.977617 caching file human.gene_sums.G026.gtf.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/annotations/gene_sums/human.gene_sums.G026.gtf.gz'
#> 2024-11-19 04:13:47.274143 downloading and reading the counts: 12 samples across 63856 features.
#> 2024-11-19 04:13:47.757026 caching file sra.gene_sums.SRP009615.G026.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/gene_sums/15/SRP009615/sra.gene_sums.SRP009615.G026.gz'
#> 2024-11-19 04:13:48.837017 constructing the RangedSummarizedExperiment (rse) object.

## Explore that RSE object
rse_gene_SRP009615
#> class: RangedSummarizedExperiment 
#> dim: 63856 12 
#> metadata(8): time_created recount3_version ... annotation recount3_url
#> assays(1): raw_counts
#> rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
#>   ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
#> rowData names(10): source type ... havana_gene tag
#> colnames(12): SRR387777 SRR387778 ... SRR389077 SRR389078
#> colData names(175): rail_id external_id ...
#>   recount_pred.curated.cell_line BigWigURL

You can also interactively choose your study of interest

## Note that you can interactively explore the available projects
proj_info_interactive <- interactiveDisplayBase::display(human_projects)

## Select a single row, then hit "send". The following code checks this.
stopifnot(nrow(proj_info_interactive) == 1)

## Then create the RSE object
rse_gene_interactive <- create_rse(proj_info_interactive)

Once you have a RSE file, you can use transform_counts() to transform the raw coverage counts.

## Once you have your RSE object, you can transform the raw coverage
## base-pair coverage counts using transform_counts().
## For RPKM, TPM or read outputs, check the details in transform_counts().
assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615)

Now you are ready to continue with downstream analysis software.

recount3 also supports accessing the BigWig raw coverage files as well as specific study or collection sample metadata. Please continue to the users guide for more detailed information.

Users guide

recount3 (Wilks, Zheng, Chen, Charles et al., 2021) provides an interface for downloading the recount3 raw files and building Bioconductor-friendly R objects (Huber, Carey, Gentleman, Anders et al., 2015; Morgan, Obenchain, Hester, and Pagès, 2019) that can be used with many downstream packages. To achieve this, the raw data is organized by study from a specific data source. That same study can be a part of one or more collections, which is a manually curated set of studies with collection-specific sample metadata (see the Data source vs collection for details). To get started with recount3, you will need to identify the ID for the study of interest from either human or mouse for a particular annotation of interest. Once you have identified study, data source or collection, and annotation, recount3 can be used to build a RangedSummarizedExperiment object (Morgan, Obenchain, Hester, and Pagès, 2019) for either gene, exon or exon-exon junction expression feature data. Furthermore, recount3 provides access to the coverage BigWig files that can be quantified for custom set of genomic regions using megadepth. Furthermore, snapcount allows fast-queries for custom exon-exon junctions and other custom input.

Available data

recount3 provides access to most of the recount3 raw files in a form that is R/Bioconductor-friendly. As a summary of the data provided by the recount3 project (Figure @ref(fig:recountWorkflowFig1)), the main data files provided are:

  • metadata: information about the samples in recount3, which can come from sources such as the Sequence Read Archive as well as recount3 quality metrics.
  • gene: RNA expression data quantified at the gene annotation level. This information is provided for multiple annotations that are organism-specific. Similar to the recount2 project, the recount3 project provides counts at the base-pair coverage level (Collado-Torres, Nellore, and Jaffe, 2017).
  • exon: RNA expression data quantified at the exon annotation level. The data is also annotation-specific and the counts are also at the base-pair coverage level.
  • exon-exon junctions: RNA expression data quantified at the exon-exon junction level. This data is annotation-agnostic (it does not depend on the annotation) and is variable across each study because different sets of exon-exon junctions are measured in each study.
  • bigWig: RNA expression data in raw format at the base-pair coverage resolution. This raw data when coupled with a given annotation can be used to generate gene and exon level counts using software such as megadepth. It enables exploring the RNA expression landscape in an annotation-agnostic way.
Overview of the data available in recount2 and recount3. Reads (pink boxes) aligned to the reference genome can be used to compute a base-pair coverage curve and identify exon-exon junctions (split reads). Gene and exon count matrices are generated using annotation information providing the gene (green boxes) and exon (blue boxes) coordinates together with the base-level coverage curve. The reads spanning exon-exon junctions (jx) are used to compute a third count matrix that might include unannotated junctions (jx 3 and 4). Without using annotation information, expressed regions (orange box) can be determined from the base-level coverage curve to then construct data-driven count matrices. DOI: < https://doi.org/10.12688/f1000research.12223.1>.

Overview of the data available in recount2 and recount3. Reads (pink boxes) aligned to the reference genome can be used to compute a base-pair coverage curve and identify exon-exon junctions (split reads). Gene and exon count matrices are generated using annotation information providing the gene (green boxes) and exon (blue boxes) coordinates together with the base-level coverage curve. The reads spanning exon-exon junctions (jx) are used to compute a third count matrix that might include unannotated junctions (jx 3 and 4). Without using annotation information, expressed regions (orange box) can be determined from the base-level coverage curve to then construct data-driven count matrices. DOI: < https://doi.org/10.12688/f1000research.12223.1>.

Terminology

Here we describe some of the common terminology and acronyms used throughout the rest of the documentation. recount3 enables creating RangedSummarizedExperiment objects that contain expression quantitative data (Figure @ref(fig:recountWorkflowFig2)). As a quick overview, some of the main terms are:

  • rse: a RangedSummarizedExperiment object from SummarizedExperiment (Morgan, Obenchain, Hester, and Pagès, 2019) that contains:
    • counts: a matrix with the expression feature data (either: gene, exon, or exon-exon junctions) and that can be accessed using assays(counts).
    • metadata: a table with information about the samples and quality metrics that can be accessed using colData(rse).
    • annotation: a table-like object with information about the expression features which can be annotation-specific (gene and exons) or annotation-agnostic (exon-exon junctions). This information can be accessed using rowRanges(rse).
recount2 and recount3 provide coverage count matrices in RangedSummarizedExperiment (rse) objects. Once the rse object has been downloaded and loaded into R (using `recount3::create_rse()` or `recount2::download_study()`), the feature information is accessed with `rowRanges(rse)` (blue box), the counts with assays(rse) (pink box) and the sample metadata with `colData(rse)` (green box). The sample metadata stored in `colData(rse)` can be expanded with custom code (brown box) matching by a unique sample identifier such as the SRA Run ID. The rse object is inside the purple box and matching data is highlighted in each box. DOI: < https://doi.org/10.12688/f1000research.12223.1>.

recount2 and recount3 provide coverage count matrices in RangedSummarizedExperiment (rse) objects. Once the rse object has been downloaded and loaded into R (using recount3::create_rse() or recount2::download_study()), the feature information is accessed with rowRanges(rse) (blue box), the counts with assays(rse) (pink box) and the sample metadata with colData(rse) (green box). The sample metadata stored in colData(rse) can be expanded with custom code (brown box) matching by a unique sample identifier such as the SRA Run ID. The rse object is inside the purple box and matching data is highlighted in each box. DOI: < https://doi.org/10.12688/f1000research.12223.1>.

recount3 enables accessing data from multiple reference organisms from public projects. To identify these projects, the key terms we use are:

  • project: the project ID, which is typically the same ID that is used externally to identify that project in databases like the Sequence Read Archive 1
  • project_home: the location for where the project is located at in the main recount3 data host: IDIES SciServer. We have two types of project homes:
    • data_source: this identifies where the raw data was downloaded from for creating recount3.
    • collection: this identifies a human-curated set of projects or samples. The main difference compared to a data_source is that a collection has collection-specific metadata. For example, someone might have created a table of information about samples in several projects by reading the papers describing these studies as in recount-brain (Razmara, Ellis, Sokolowski, Davis et al., 2019).
  • organism: the reference organism (species) for the RNA-seq data, which is either human (hg38) or mouse (mm10).

Many of the recount3 raw files include three columns that are used to identify each sample and that allow merging the data across these files. Those are:

Find a study

In order to access data from recount3, the first step is to identify a project that you are interested in working with. Most of the project IDs are the ones you can find on the Sequence Read Archive (SRA). For example, SRP009615 which we use in the examples in this vignette.The exceptions are the Genotype-Expression and The Cancer Genome Atlas human studies, commonly known as GTEx and TCGA. Both GTEx and TCGA are available in recount3 by tissue.

Through available_projects()

While you can use external websites to find a study of interest, you can also use available_projects() to list the projects that are available in recount3 as shown below. This will return a data.frame() object that lists the unique project IDs.

human_projects <- available_projects()
#> 2024-11-19 04:13:49.474414 caching file sra.recount_project.MD.gz.
#> 2024-11-19 04:13:49.892237 caching file gtex.recount_project.MD.gz.
#> 2024-11-19 04:13:50.263645 caching file tcga.recount_project.MD.gz.
dim(human_projects)
#> [1] 8742    6
head(human_projects)
#>     project organism file_source     project_home project_type n_samples
#> 1 SRP107565    human         sra data_sources/sra data_sources       216
#> 2 SRP149665    human         sra data_sources/sra data_sources         4
#> 3 SRP017465    human         sra data_sources/sra data_sources        23
#> 4 SRP119165    human         sra data_sources/sra data_sources         6
#> 5 SRP133965    human         sra data_sources/sra data_sources        12
#> 6 SRP096765    human         sra data_sources/sra data_sources         7

## Select a study of interest
project_info <- subset(
    human_projects,
    project == "SRP009615" & project_type == "data_sources"
)
project_info
#>        project organism file_source     project_home project_type n_samples
#> 1838 SRP009615    human         sra data_sources/sra data_sources        12

Let’s say that you are interested in the GTEx projects, you could then filter by file_source. We’ll focus only on those entries that from a data source, and not from a collection for now.

subset(human_projects, file_source == "gtex" & project_type == "data_sources")
#>              project organism file_source      project_home project_type
#> 8678  ADIPOSE_TISSUE    human        gtex data_sources/gtex data_sources
#> 8679          MUSCLE    human        gtex data_sources/gtex data_sources
#> 8680    BLOOD_VESSEL    human        gtex data_sources/gtex data_sources
#> 8681           HEART    human        gtex data_sources/gtex data_sources
#> 8682           OVARY    human        gtex data_sources/gtex data_sources
#> 8683          UTERUS    human        gtex data_sources/gtex data_sources
#> 8684          VAGINA    human        gtex data_sources/gtex data_sources
#> 8685          BREAST    human        gtex data_sources/gtex data_sources
#> 8686            SKIN    human        gtex data_sources/gtex data_sources
#> 8687  SALIVARY_GLAND    human        gtex data_sources/gtex data_sources
#> 8688           BRAIN    human        gtex data_sources/gtex data_sources
#> 8689   ADRENAL_GLAND    human        gtex data_sources/gtex data_sources
#> 8690         THYROID    human        gtex data_sources/gtex data_sources
#> 8691            LUNG    human        gtex data_sources/gtex data_sources
#> 8692          SPLEEN    human        gtex data_sources/gtex data_sources
#> 8693        PANCREAS    human        gtex data_sources/gtex data_sources
#> 8694       ESOPHAGUS    human        gtex data_sources/gtex data_sources
#> 8695         STOMACH    human        gtex data_sources/gtex data_sources
#> 8696           COLON    human        gtex data_sources/gtex data_sources
#> 8697 SMALL_INTESTINE    human        gtex data_sources/gtex data_sources
#> 8698        PROSTATE    human        gtex data_sources/gtex data_sources
#> 8699          TESTIS    human        gtex data_sources/gtex data_sources
#> 8700           NERVE    human        gtex data_sources/gtex data_sources
#> 8701       PITUITARY    human        gtex data_sources/gtex data_sources
#> 8702           BLOOD    human        gtex data_sources/gtex data_sources
#> 8703           LIVER    human        gtex data_sources/gtex data_sources
#> 8704          KIDNEY    human        gtex data_sources/gtex data_sources
#> 8705    CERVIX_UTERI    human        gtex data_sources/gtex data_sources
#> 8706  FALLOPIAN_TUBE    human        gtex data_sources/gtex data_sources
#> 8707         BLADDER    human        gtex data_sources/gtex data_sources
#> 8708        STUDY_NA    human        gtex data_sources/gtex data_sources
#> 8709     BONE_MARROW    human        gtex data_sources/gtex data_sources
#>      n_samples
#> 8678      1293
#> 8679       881
#> 8680      1398
#> 8681       942
#> 8682       195
#> 8683       159
#> 8684       173
#> 8685       482
#> 8686      1940
#> 8687       178
#> 8688      2931
#> 8689       274
#> 8690       706
#> 8691       655
#> 8692       255
#> 8693       360
#> 8694      1577
#> 8695       384
#> 8696       822
#> 8697       193
#> 8698       263
#> 8699       410
#> 8700       659
#> 8701       301
#> 8702      1048
#> 8703       251
#> 8704        98
#> 8705        19
#> 8706         9
#> 8707        21
#> 8708       133
#> 8709       204

Note that one of the projects for GTEx is STUDY_NA, that’s because in recount3 we processed all GTEx samples, including some that had no tissue assigned and were not used by the GTEx consortium.

If you prefer to view the list of available studies interactively, you can do so with interactiveDisplayBase as shown below. You’ll want to assign the output of interactiveDisplayBase::display() to an object so you can save your selections and use them later. By doing so, you’ll be able to select a study of interest, and save the information for later use after you hit the “send” button.

## Alternatively, interactively browse the human projects,
## select one, then hit send
selected_study <- interactiveDisplayBase::display(human_projects)

Ultimately, we need three pieces of information in order to download a specific dataset from recount3. Those are:

  • project: the project ID,
  • organism: either human or mouse,
  • project_home: where the data lives which varies between data sources and collections.
project_info[, c("project", "organism", "project_home")]
#>        project organism     project_home
#> 1838 SRP009615    human data_sources/sra

Available annotations

Now that we have identified our project of interest, the next step is to choose an annotation that we want to work with. The annotation files available depend on the organism. To facilitate finding the specific names we use in recount3, we have provided the function annotation_options().

annotation_options("human")
#> [1] "gencode_v26" "gencode_v29" "fantom6_cat" "refseq"      "ercc"       
#> [6] "sirv"
annotation_options("mouse")
#> [1] "gencode_v23"

The main sources are:

In recount3 we have provided multiple annotations, which is different from recount (recount2 R/Bioconductor package) where all files were computed using GENCODE version 25. However, in both, you might be interested in quantifying your annotation of interest, as described further below in the BigWig files section.

Build a RSE

Once you have chosen an annotation and a project, you can now build a RangedSummarizedExperiment object (Huber, Carey, Gentleman, Anders et al., 2015; Morgan, Obenchain, Hester, and Pagès, 2019). To do so, we recommend using the create_rse() function as shown below for GENCODE v26 (the default annotation for human files). create_rse() internally uses several other recount3 functions for locating the necessary raw files, downloading them, reading them, and building the RangedSummarizedExperiment (RSE) object. create_rse() shows several status message updates that you can silence with suppressMessages(create_rse()) if you want to.

## Create a RSE object at the gene level
rse_gene_SRP009615 <- create_rse(project_info)
#> 2024-11-19 04:13:52.87458 downloading and reading the metadata.
#> 2024-11-19 04:13:53.247372 caching file sra.sra.SRP009615.MD.gz.
#> 2024-11-19 04:13:53.648427 caching file sra.recount_project.SRP009615.MD.gz.
#> 2024-11-19 04:13:54.030213 caching file sra.recount_qc.SRP009615.MD.gz.
#> 2024-11-19 04:13:54.413135 caching file sra.recount_seq_qc.SRP009615.MD.gz.
#> 2024-11-19 04:13:54.808313 caching file sra.recount_pred.SRP009615.MD.gz.
#> 2024-11-19 04:13:54.865161 downloading and reading the feature information.
#> 2024-11-19 04:13:55.236595 caching file human.gene_sums.G026.gtf.gz.
#> 2024-11-19 04:13:55.650685 downloading and reading the counts: 12 samples across 63856 features.
#> 2024-11-19 04:13:56.009675 caching file sra.gene_sums.SRP009615.G026.gz.
#> 2024-11-19 04:13:56.128985 constructing the RangedSummarizedExperiment (rse) object.

## Explore the resulting RSE gene object
rse_gene_SRP009615
#> class: RangedSummarizedExperiment 
#> dim: 63856 12 
#> metadata(8): time_created recount3_version ... annotation recount3_url
#> assays(1): raw_counts
#> rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
#>   ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
#> rowData names(10): source type ... havana_gene tag
#> colnames(12): SRR387777 SRR387778 ... SRR389077 SRR389078
#> colData names(175): rail_id external_id ...
#>   recount_pred.curated.cell_line BigWigURL

Explore the RSE

Because the RSE object is created at run-time in recount3, unlike the static files provided by recount for recount2, create_rse() stores information about how this RSE object was made under metadata(). This information is useful in case you share the RSE object and someone else wants to be able to re-make the object with the latest data 4.

## Information about how this RSE object was made
metadata(rse_gene_SRP009615)
#> $time_created
#> [1] "2024-11-19 04:13:56 UTC"
#> 
#> $recount3_version
#>           package ondiskversion loadedversion
#> recount3 recount3        1.17.0        1.17.0
#>                                                path
#> recount3 /tmp/RtmprSvjM2/Rinst22aa62102ff7/recount3
#>                                          loadedpath attached is_base       date
#> recount3 /tmp/RtmprSvjM2/Rinst22aa62102ff7/recount3     TRUE   FALSE 2024-11-19
#>                                         source md5ok
#> recount3 https://bioc.r-universe.dev (R 4.4.2)    NA
#>                                    library
#> recount3 /tmp/RtmprSvjM2/Rinst22aa62102ff7
#> 
#> $project
#> [1] "SRP009615"
#> 
#> $project_home
#> [1] "data_sources/sra"
#> 
#> $type
#> [1] "gene"
#> 
#> $organism
#> [1] "human"
#> 
#> $annotation
#> [1] "gencode_v26"
#> 
#> $recount3_url
#> [1] "http://duffel.rail.bio/recount3"

The SRP009615 study was composed of 12 samples, for which we have 63,856 genes in GENCODE v26. The annotation-specific information is available through rowRanges() as shown below with the gene_id column used to identify genes in each of the annotations 5.

## Number of genes by number of samples
dim(rse_gene_SRP009615)
#> [1] 63856    12

## Information about the genes
rowRanges(rse_gene_SRP009615)
#> GRanges object with 63856 ranges and 10 metadata columns:
#>                              seqnames            ranges strand |   source
#>                                 <Rle>         <IRanges>  <Rle> | <factor>
#>          ENSG00000278704.1 GL000009.2       56140-58376      - |  ENSEMBL
#>          ENSG00000277400.1 GL000194.1      53590-115018      - |  ENSEMBL
#>          ENSG00000274847.1 GL000194.1      53594-115055      - |  ENSEMBL
#>          ENSG00000277428.1 GL000195.1       37434-37534      - |  ENSEMBL
#>          ENSG00000276256.1 GL000195.1       42939-49164      - |  ENSEMBL
#>                        ...        ...               ...    ... .      ...
#>   ENSG00000124334.17_PAR_Y       chrY 57184101-57197337      + |   HAVANA
#>   ENSG00000185203.12_PAR_Y       chrY 57201143-57203357      - |   HAVANA
#>    ENSG00000270726.6_PAR_Y       chrY 57190738-57208756      + |   HAVANA
#>   ENSG00000182484.15_PAR_Y       chrY 57207346-57212230      + |   HAVANA
#>    ENSG00000227159.8_PAR_Y       chrY 57212184-57214397      - |   HAVANA
#>                                type bp_length     phase                gene_id
#>                            <factor> <numeric> <integer>            <character>
#>          ENSG00000278704.1     gene      2237      <NA>      ENSG00000278704.1
#>          ENSG00000277400.1     gene      2179      <NA>      ENSG00000277400.1
#>          ENSG00000274847.1     gene      1599      <NA>      ENSG00000274847.1
#>          ENSG00000277428.1     gene       101      <NA>      ENSG00000277428.1
#>          ENSG00000276256.1     gene      2195      <NA>      ENSG00000276256.1
#>                        ...      ...       ...       ...                    ...
#>   ENSG00000124334.17_PAR_Y     gene      2504      <NA> ENSG00000124334.17_P..
#>   ENSG00000185203.12_PAR_Y     gene      1054      <NA> ENSG00000185203.12_P..
#>    ENSG00000270726.6_PAR_Y     gene       773      <NA> ENSG00000270726.6_PA..
#>   ENSG00000182484.15_PAR_Y     gene      4618      <NA> ENSG00000182484.15_P..
#>    ENSG00000227159.8_PAR_Y     gene      1306      <NA> ENSG00000227159.8_PA..
#>                                         gene_type   gene_name       level
#>                                       <character> <character> <character>
#>          ENSG00000278704.1         protein_coding  BX004987.1           3
#>          ENSG00000277400.1         protein_coding  AC145212.2           3
#>          ENSG00000274847.1         protein_coding  AC145212.1           3
#>          ENSG00000277428.1               misc_RNA       Y_RNA           3
#>          ENSG00000276256.1         protein_coding  AC011043.1           3
#>                        ...                    ...         ...         ...
#>   ENSG00000124334.17_PAR_Y         protein_coding        IL9R           2
#>   ENSG00000185203.12_PAR_Y              antisense      WASIR1           2
#>    ENSG00000270726.6_PAR_Y   processed_transcript AJ271736.10           2
#>   ENSG00000182484.15_PAR_Y transcribed_unproces..      WASH6P           2
#>    ENSG00000227159.8_PAR_Y unprocessed_pseudogene    DDX11L16           2
#>                                     havana_gene         tag
#>                                     <character> <character>
#>          ENSG00000278704.1                 <NA>        <NA>
#>          ENSG00000277400.1                 <NA>        <NA>
#>          ENSG00000274847.1                 <NA>        <NA>
#>          ENSG00000277428.1                 <NA>        <NA>
#>          ENSG00000276256.1                 <NA>        <NA>
#>                        ...                  ...         ...
#>   ENSG00000124334.17_PAR_Y OTTHUMG00000022720.1         PAR
#>   ENSG00000185203.12_PAR_Y OTTHUMG00000022676.3         PAR
#>    ENSG00000270726.6_PAR_Y OTTHUMG00000184987.2         PAR
#>   ENSG00000182484.15_PAR_Y OTTHUMG00000022677.5         PAR
#>    ENSG00000227159.8_PAR_Y OTTHUMG00000022678.1         PAR
#>   -------
#>   seqinfo: 374 sequences from an unspecified genome; no seqlengths

Sample metadata

The sample metadata provided in recount3 is much more extensive than the one in recount for the recount2 project because it’s includes for quality control metrics, predictions, and information used internally by recount3 functions such as create_rse(). All individual metadata tables include the columns rail_id, external_id and study which are used for merging the different tables. Finally, **BigWigUrl* provides the URL for the BigWig file for the given sample.

## Sample metadata
recount3_cols <- colnames(colData(rse_gene_SRP009615))

## How many are there?
length(recount3_cols)
#> [1] 175

## View the first few ones
head(recount3_cols)
#> [1] "rail_id"            "external_id"        "study"             
#> [4] "sra.sample_acc.x"   "sra.experiment_acc" "sra.submission_acc"

## Group them by source
recount3_cols_groups <- table(gsub("\\..*", "", recount3_cols))

## Common sources and number of columns per source
recount3_cols_groups[recount3_cols_groups > 1]
#> 
#>    recount_pred recount_project      recount_qc  recount_seq_qc             sra 
#>               7               5             109              12              38

## Unique columns
recount3_cols_groups[recount3_cols_groups == 1]
#> 
#>   BigWigURL external_id     rail_id       study 
#>           1           1           1           1
## Explore them all
recount3_cols

For studies from SRA, we can further extract the SRA attributes using expand_sra_attributes() as shown below.

rse_gene_SRP009615_expanded <-
    expand_sra_attributes(rse_gene_SRP009615)
colData(rse_gene_SRP009615_expanded)[, ncol(colData(rse_gene_SRP009615)):ncol(colData(rse_gene_SRP009615_expanded))]
#> DataFrame with 12 rows and 5 columns
#>                        BigWigURL sra_attribute.cells
#>                      <character>         <character>
#> SRR387777 http://duffel.rail.b..                K562
#> SRR387778 http://duffel.rail.b..                K562
#> SRR387779 http://duffel.rail.b..                K562
#> SRR387780 http://duffel.rail.b..                K562
#> SRR389079 http://duffel.rail.b..                K562
#> ...                          ...                 ...
#> SRR389082 http://duffel.rail.b..                K562
#> SRR389083 http://duffel.rail.b..                K562
#> SRR389084 http://duffel.rail.b..                K562
#> SRR389077 http://duffel.rail.b..                K562
#> SRR389078 http://duffel.rail.b..                K562
#>           sra_attribute.shRNA_expression sra_attribute.source_name
#>                              <character>               <character>
#> SRR387777                             no                    SL2933
#> SRR387778             yes, targeting SRF                    SL2934
#> SRR387779                             no                    SL5265
#> SRR387780              yes targeting SRF                    SL3141
#> SRR389079            no shRNA expression                    SL6485
#> ...                                  ...                       ...
#> SRR389082         expressing shRNA tar..                    SL2592
#> SRR389083            no shRNA expression                    SL4337
#> SRR389084         expressing shRNA tar..                    SL4326
#> SRR389077            no shRNA expression                    SL1584
#> SRR389078         expressing shRNA tar..                    SL1583
#>           sra_attribute.treatment
#>                       <character>
#> SRR387777               Puromycin
#> SRR387778  Puromycin, doxycycline
#> SRR387779               Puromycin
#> SRR387780  Puromycin, doxycycline
#> SRR389079               Puromycin
#> ...                           ...
#> SRR389082  Puromycin, doxycycline
#> SRR389083               Puromycin
#> SRR389084  Puromycin, doxycycline
#> SRR389077               Puromycin
#> SRR389078  Puromycin, doxycycline

Counts

The counts in recount3 are raw base-pair coverage counts, similar to those in recount. To further understand them, check the recountWorkflow DOI 10.12688/f1000research.12223.1. To highlight that these are raw base-pair coverage counts, they are stored in the “raw_counts” assay

assayNames(rse_gene_SRP009615)
#> [1] "raw_counts"

Using transform_counts() you can scale the counts and assign them to the “counts” assays slot to use them in downstream packages such as DESeq2 and limma.

## Once you have your RSE object, you can transform the raw coverage
## base-pair coverage counts using transform_counts().
## For RPKM, TPM or read outputs, check the details in transform_counts().
assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615)

Just like with recount for recount2, you can transform the raw base-pair coverage counts (Collado-Torres, Nellore, and Jaffe, 2017) to read counts with compute_read_counts(), RPKM with recount::getRPKM() or TPM values with recount::getTPM(). Check transform_counts() from recount3 for more details.

Exon

recount3 provides an interface to raw files that go beyond gene counts, as well as other features you might be interested in. For instance, you might want to study expression at the exon expression level instead of gene. To do so, use the type argument in create_rse() as shown below.

## Create a RSE object at the exon level
rse_exon_SRP009615 <- create_rse(
    proj_info,
    type = "exon"
)
#> 2024-11-19 04:13:56.42206 downloading and reading the metadata.
#> 2024-11-19 04:13:56.766378 caching file sra.sra.SRP009615.MD.gz.
#> 2024-11-19 04:13:57.150363 caching file sra.recount_project.SRP009615.MD.gz.
#> 2024-11-19 04:13:57.594617 caching file sra.recount_qc.SRP009615.MD.gz.
#> 2024-11-19 04:13:57.989075 caching file sra.recount_seq_qc.SRP009615.MD.gz.
#> 2024-11-19 04:13:58.385188 caching file sra.recount_pred.SRP009615.MD.gz.
#> 2024-11-19 04:13:58.43891 downloading and reading the feature information.
#> 2024-11-19 04:13:58.887101 caching file human.exon_sums.G026.gtf.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/annotations/exon_sums/human.exon_sums.G026.gtf.gz'
#> 2024-11-19 04:14:19.077249 downloading and reading the counts: 12 samples across 1299686 features.
#> 2024-11-19 04:14:19.746057 caching file sra.exon_sums.SRP009615.G026.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/exon_sums/15/SRP009615/sra.exon_sums.SRP009615.G026.gz'
#> 2024-11-19 04:14:23.567664 constructing the RangedSummarizedExperiment (rse) object.

## Explore the resulting RSE exon object
rse_exon_SRP009615
#> class: RangedSummarizedExperiment 
#> dim: 1299686 12 
#> metadata(8): time_created recount3_version ... annotation recount3_url
#> assays(1): raw_counts
#> rownames(1299686): GL000009.2|56140|58376|- GL000194.1|53594|54832|-
#>   ... chrY|57213880|57213964|- chrY|57214350|57214397|-
#> rowData names(21): source type ... ont ccdsid
#> colnames(12): SRR387777 SRR387778 ... SRR389077 SRR389078
#> colData names(175): rail_id external_id ...
#>   recount_pred.curated.cell_line BigWigURL

## Explore the object
dim(rse_exon_SRP009615)
#> [1] 1299686      12

Each exon is shown in this output, so, you might have to filter the exons of interest. Unlike in recount2, these are actual exons and not disjoint exons 6.

## Exon annotation information
rowRanges(rse_exon_SRP009615)
#> GRanges object with 1299686 ranges and 21 metadata columns:
#>                                seqnames            ranges strand |   source
#>                                   <Rle>         <IRanges>  <Rle> | <factor>
#>     GL000009.2|56140|58376|- GL000009.2       56140-58376      - |  ENSEMBL
#>     GL000194.1|53594|54832|- GL000194.1       53594-54832      - |  ENSEMBL
#>     GL000194.1|55446|55676|- GL000194.1       55446-55676      - |  ENSEMBL
#>     GL000194.1|53590|55676|- GL000194.1       53590-55676      - |  ENSEMBL
#>   GL000194.1|112792|112850|- GL000194.1     112792-112850      - |  ENSEMBL
#>                          ...        ...               ...    ... .      ...
#>     chrY|57212184|57213125|-       chrY 57212184-57213125      - |   HAVANA
#>     chrY|57213204|57213357|-       chrY 57213204-57213357      - |   HAVANA
#>     chrY|57213526|57213602|-       chrY 57213526-57213602      - |   HAVANA
#>     chrY|57213880|57213964|-       chrY 57213880-57213964      - |   HAVANA
#>     chrY|57214350|57214397|-       chrY 57214350-57214397      - |   HAVANA
#>                                  type bp_length     phase
#>                              <factor> <numeric> <integer>
#>     GL000009.2|56140|58376|-     exon      2237      <NA>
#>     GL000194.1|53594|54832|-     exon      1239      <NA>
#>     GL000194.1|55446|55676|-     exon       231      <NA>
#>     GL000194.1|53590|55676|-     exon      2087      <NA>
#>   GL000194.1|112792|112850|-     exon        59      <NA>
#>                          ...      ...       ...       ...
#>     chrY|57212184|57213125|-     exon       942      <NA>
#>     chrY|57213204|57213357|-     exon       154      <NA>
#>     chrY|57213526|57213602|-     exon        77      <NA>
#>     chrY|57213880|57213964|-     exon        85      <NA>
#>     chrY|57214350|57214397|-     exon        48      <NA>
#>                                             gene_id          transcript_id
#>                                         <character>            <character>
#>     GL000009.2|56140|58376|-      ENSG00000278704.1      ENST00000618686.1
#>     GL000194.1|53594|54832|-      ENSG00000274847.1      ENST00000400754.4
#>     GL000194.1|55446|55676|-      ENSG00000274847.1      ENST00000400754.4
#>     GL000194.1|53590|55676|-      ENSG00000277400.1      ENST00000613230.1
#>   GL000194.1|112792|112850|-      ENSG00000274847.1      ENST00000400754.4
#>                          ...                    ...                    ...
#>     chrY|57212184|57213125|- ENSG00000227159.8_PA.. ENST00000507418.6_PA..
#>     chrY|57213204|57213357|- ENSG00000227159.8_PA.. ENST00000507418.6_PA..
#>     chrY|57213526|57213602|- ENSG00000227159.8_PA.. ENST00000507418.6_PA..
#>     chrY|57213880|57213964|- ENSG00000227159.8_PA.. ENST00000507418.6_PA..
#>     chrY|57214350|57214397|- ENSG00000227159.8_PA.. ENST00000507418.6_PA..
#>                                           gene_type   gene_name
#>                                         <character> <character>
#>     GL000009.2|56140|58376|-         protein_coding  BX004987.1
#>     GL000194.1|53594|54832|-         protein_coding  AC145212.1
#>     GL000194.1|55446|55676|-         protein_coding  AC145212.1
#>     GL000194.1|53590|55676|-         protein_coding  AC145212.2
#>   GL000194.1|112792|112850|-         protein_coding  AC145212.1
#>                          ...                    ...         ...
#>     chrY|57212184|57213125|- unprocessed_pseudogene    DDX11L16
#>     chrY|57213204|57213357|- unprocessed_pseudogene    DDX11L16
#>     chrY|57213526|57213602|- unprocessed_pseudogene    DDX11L16
#>     chrY|57213880|57213964|- unprocessed_pseudogene    DDX11L16
#>     chrY|57214350|57214397|- unprocessed_pseudogene    DDX11L16
#>                                     transcript_type transcript_name exon_number
#>                                         <character>     <character> <character>
#>     GL000009.2|56140|58376|-         protein_coding  BX004987.1-201           1
#>     GL000194.1|53594|54832|-         protein_coding  AC145212.1-201           4
#>     GL000194.1|55446|55676|-         protein_coding  AC145212.1-201           3
#>     GL000194.1|53590|55676|-         protein_coding  AC145212.2-201           3
#>   GL000194.1|112792|112850|-         protein_coding  AC145212.1-201           2
#>                          ...                    ...             ...         ...
#>     chrY|57212184|57213125|- unprocessed_pseudogene    DDX11L16-001           5
#>     chrY|57213204|57213357|- unprocessed_pseudogene    DDX11L16-001           4
#>     chrY|57213526|57213602|- unprocessed_pseudogene    DDX11L16-001           3
#>     chrY|57213880|57213964|- unprocessed_pseudogene    DDX11L16-001           2
#>     chrY|57214350|57214397|- unprocessed_pseudogene    DDX11L16-001           1
#>                                        exon_id       level        protein_id
#>                                    <character> <character>       <character>
#>     GL000009.2|56140|58376|- ENSE00003753029.1           3 ENSP00000484918.1
#>     GL000194.1|53594|54832|- ENSE00002218789.2           3 ENSP00000478910.1
#>     GL000194.1|55446|55676|- ENSE00003714436.1           3 ENSP00000478910.1
#>     GL000194.1|53590|55676|- ENSE00003723764.1           3 ENSP00000483280.1
#>   GL000194.1|112792|112850|- ENSE00003713687.1           3 ENSP00000478910.1
#>                          ...               ...         ...               ...
#>     chrY|57212184|57213125|- ENSE00002023900.1           2              <NA>
#>     chrY|57213204|57213357|- ENSE00002036959.1           2              <NA>
#>     chrY|57213526|57213602|- ENSE00002021169.1           2              <NA>
#>     chrY|57213880|57213964|- ENSE00002046926.1           2              <NA>
#>     chrY|57214350|57214397|- ENSE00002072208.1           2              <NA>
#>                              transcript_support_level         tag
#>                                           <character> <character>
#>     GL000009.2|56140|58376|-                       NA       basic
#>     GL000194.1|53594|54832|-                        1       basic
#>     GL000194.1|55446|55676|-                        1       basic
#>     GL000194.1|53590|55676|-                        1       basic
#>   GL000194.1|112792|112850|-                        1       basic
#>                          ...                      ...         ...
#>     chrY|57212184|57213125|-                       NA         PAR
#>     chrY|57213204|57213357|-                       NA         PAR
#>     chrY|57213526|57213602|-                       NA         PAR
#>     chrY|57213880|57213964|-                       NA         PAR
#>     chrY|57214350|57214397|-                       NA         PAR
#>                                     recount_exon_id          havana_gene
#>                                         <character>          <character>
#>     GL000009.2|56140|58376|- GL000009.2|56140|583..                 <NA>
#>     GL000194.1|53594|54832|- GL000194.1|53594|548..                 <NA>
#>     GL000194.1|55446|55676|- GL000194.1|55446|556..                 <NA>
#>     GL000194.1|53590|55676|- GL000194.1|53590|556..                 <NA>
#>   GL000194.1|112792|112850|- GL000194.1|112792|11..                 <NA>
#>                          ...                    ...                  ...
#>     chrY|57212184|57213125|- chrY|57212184|572131.. OTTHUMG00000022678.1
#>     chrY|57213204|57213357|- chrY|57213204|572133.. OTTHUMG00000022678.1
#>     chrY|57213526|57213602|- chrY|57213526|572136.. OTTHUMG00000022678.1
#>     chrY|57213880|57213964|- chrY|57213880|572139.. OTTHUMG00000022678.1
#>     chrY|57214350|57214397|- chrY|57214350|572143.. OTTHUMG00000022678.1
#>                                 havana_transcript         ont      ccdsid
#>                                       <character> <character> <character>
#>     GL000009.2|56140|58376|-                 <NA>        <NA>        <NA>
#>     GL000194.1|53594|54832|-                 <NA>        <NA>        <NA>
#>     GL000194.1|55446|55676|-                 <NA>        <NA>        <NA>
#>     GL000194.1|53590|55676|-                 <NA>        <NA>        <NA>
#>   GL000194.1|112792|112850|-                 <NA>        <NA>        <NA>
#>                          ...                  ...         ...         ...
#>     chrY|57212184|57213125|- OTTHUMT00000058841.1 PGO:0000005        <NA>
#>     chrY|57213204|57213357|- OTTHUMT00000058841.1 PGO:0000005        <NA>
#>     chrY|57213526|57213602|- OTTHUMT00000058841.1 PGO:0000005        <NA>
#>     chrY|57213880|57213964|- OTTHUMT00000058841.1 PGO:0000005        <NA>
#>     chrY|57214350|57214397|- OTTHUMT00000058841.1 PGO:0000005        <NA>
#>   -------
#>   seqinfo: 374 sequences from an unspecified genome; no seqlengths

## Exon ids are repeated because a given exon can be part of
## more than one transcript
length(unique(rowRanges(rse_exon_SRP009615)$exon_id))
#> [1] 742049

Because there are many more exons than genes, this type of analysis uses more computational resources. Thus, for some large projects you might need to use a high performance computing environment. To help you proceed with caution, create_rse() shows how many features and samples it’s trying to access. So if you get an out of memory error, you’ll know why that happened.

## Check how much memory the gene and exon RSE objects use
pryr::object_size(rse_gene_SRP009615)
#> 30.94 MB
pryr::object_size(rse_exon_SRP009615)
#> 528.18 MB

Exon-exon junctions

In recount3 we have also provided the option to create RSE files for exon-exon junctions. Unlike the gene/exon RSE files, only the junctions present in a given project are included in the files, so you’ll have to be more careful when merging exon-exon junction RSE files. Furthermore, these are actual read counts and not raw base-pair counts. Given the sparsity of the data, the counts are provided using a sparse matrix object from Matrix. Thus exon-exon junction files can be less memory demanding than the exon RSE files.

## Create a RSE object at the exon-exon junction level
rse_jxn_SRP009615 <- create_rse(
    proj_info,
    type = "jxn"
)
#> 2024-11-19 04:14:27.814055 downloading and reading the metadata.
#> 2024-11-19 04:14:28.347004 caching file sra.sra.SRP009615.MD.gz.
#> 2024-11-19 04:14:28.743799 caching file sra.recount_project.SRP009615.MD.gz.
#> 2024-11-19 04:14:29.137834 caching file sra.recount_qc.SRP009615.MD.gz.
#> 2024-11-19 04:14:29.530861 caching file sra.recount_seq_qc.SRP009615.MD.gz.
#> 2024-11-19 04:14:29.946315 caching file sra.recount_pred.SRP009615.MD.gz.
#> 2024-11-19 04:14:30.000928 downloading and reading the feature information.
#> 2024-11-19 04:14:30.453603 caching file sra.junctions.SRP009615.ALL.RR.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/junctions/15/SRP009615/sra.junctions.SRP009615.ALL.RR.gz'
#> 2024-11-19 04:14:33.55437 downloading and reading the counts: 12 samples across 281448 features.
#> 2024-11-19 04:14:34.025664 caching file sra.junctions.SRP009615.ALL.MM.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/junctions/15/SRP009615/sra.junctions.SRP009615.ALL.MM.gz'
#> 2024-11-19 04:14:35.967826 matching exon-exon junction counts with the metadata.
#> 2024-11-19 04:14:36.480661 caching file sra.junctions.SRP009615.ALL.ID.gz.
#> adding rname 'http://duffel.rail.bio/recount3/human/data_sources/sra/junctions/15/SRP009615/sra.junctions.SRP009615.ALL.ID.gz'
#> 2024-11-19 04:14:37.316421 constructing the RangedSummarizedExperiment (rse) object.

## Explore the resulting RSE exon-exon junctions object
rse_jxn_SRP009615
#> class: RangedSummarizedExperiment 
#> dim: 281448 12 
#> metadata(9): time_created recount3_version ... jxn_format recount3_url
#> assays(1): counts
#> rownames(281448): chr1:11845-12009:+ chr1:12698-13220:+ ...
#>   chrY:56848810-56851543:- chrY:56850515-56850921:+
#> rowData names(6): length annotated ... left_annotated right_annotated
#> colnames(12): SRR387777 SRR387778 ... SRR389077 SRR389078
#> colData names(175): rail_id external_id ...
#>   recount_pred.curated.cell_line BigWigURL
dim(rse_jxn_SRP009615)
#> [1] 281448     12

## Exon-exon junction information
rowRanges(rse_jxn_SRP009615)
#> GRanges object with 281448 ranges and 6 metadata columns:
#>                            seqnames            ranges strand |    length
#>                               <Rle>         <IRanges>  <Rle> | <integer>
#>         chr1:11845-12009:+     chr1       11845-12009      + |       165
#>         chr1:12698-13220:+     chr1       12698-13220      + |       523
#>        chr1:14696-185174:-     chr1      14696-185174      - |    170479
#>         chr1:14830-14969:-     chr1       14830-14969      - |       140
#>         chr1:14830-15020:-     chr1       14830-15020      - |       191
#>                        ...      ...               ...    ... .       ...
#>   chrY:56846131-56846553:+     chrY 56846131-56846553      + |       423
#>   chrY:56846268-56846553:+     chrY 56846268-56846553      + |       286
#>   chrY:56846486-56846553:+     chrY 56846486-56846553      + |        68
#>   chrY:56848810-56851543:-     chrY 56848810-56851543      - |      2734
#>   chrY:56850515-56850921:+     chrY 56850515-56850921      + |       407
#>                            annotated  left_motif right_motif
#>                            <integer> <character> <character>
#>         chr1:11845-12009:+         0          GT          AG
#>         chr1:12698-13220:+         1          GT          AG
#>        chr1:14696-185174:-         0          CT          AC
#>         chr1:14830-14969:-         1          CT          AC
#>         chr1:14830-15020:-         0          CT          AC
#>                        ...       ...         ...         ...
#>   chrY:56846131-56846553:+         0          GT          AG
#>   chrY:56846268-56846553:+         0          GT          AG
#>   chrY:56846486-56846553:+         0          GT          AG
#>   chrY:56848810-56851543:-         0          CT          AC
#>   chrY:56850515-56850921:+         0          GT          AG
#>                                    left_annotated        right_annotated
#>                                       <character>            <character>
#>         chr1:11845-12009:+                      0              aC19,sG19
#>         chr1:12698-13220:+ aC19,gC19,gC24,gC25,.. aC19,cH38,gC19,gC24,..
#>        chr1:14696-185174:-                      0                      0
#>         chr1:14830-14969:- aC19,cH38,gC19,kG19,.. aC19,cH38,gC19,kG19,..
#>         chr1:14830-15020:- aC19,cH38,gC19,kG19,..                      0
#>                        ...                    ...                    ...
#>   chrY:56846131-56846553:+                      0                      0
#>   chrY:56846268-56846553:+                      0                      0
#>   chrY:56846486-56846553:+                      0                      0
#>   chrY:56848810-56851543:-                      0                      0
#>   chrY:56850515-56850921:+                      0                      0
#>   -------
#>   seqinfo: 97 sequences from an unspecified genome; no seqlengths

## Memory used
pryr::object_size(rse_jxn_SRP009615)
#> 60.30 MB

BigWig files

Internally we used GenomicFeatures::exonicParts() when processing all annotations in recount3 instead of GenomicRanges::disjoin() that was used in recount2. We then re-assembled the counts for each exon/gene to create the count files provided in recount3. However, you might want to exclude the overlapping exonic parts from the counts. If that’s the case or if you are interested in specific regions of the hg38/mm10 genomes, you might want to access the coverage BigWig files.

## BigWig file names
## The full URL is stored in BigWigUrl
basename(head(rse_gene_SRP009615$BigWigURL))
#> [1] "sra.base_sums.SRP009615_SRR387777.ALL.bw"
#> [2] "sra.base_sums.SRP009615_SRR387778.ALL.bw"
#> [3] "sra.base_sums.SRP009615_SRR387779.ALL.bw"
#> [4] "sra.base_sums.SRP009615_SRR387780.ALL.bw"
#> [5] "sra.base_sums.SRP009615_SRR389079.ALL.bw"
#> [6] "sra.base_sums.SRP009615_SRR389080.ALL.bw"

These BigWig files can be accessed using rtracklayer::import.bw() from R, or other tools such as bwtool that we’ve used in the past 7. Using them, you can compute a coverage matrix for a given set of regions.

One new software we developed is megadepth for which we have provided an R/Bioconductor package interface. megadepth is faster at accessing BigWig files and is the software we used internally for generating the recount3 data. megadepth provides convenient to use functions for quantifying a set of regions, which might be of interest for co-expression analyses where double counting exonic parts can be problematic.

You can also use derfinder and derfinderPlot if you are interested in visualizing the base-pair coverage data for a specific region using these BigWig coverage files.

Local files

recount3 depends on BiocFileCache (Shepherd and Morgan, 2024) for organizing the raw files and caching them, such that if you use the same file more than once, you only have to download it once. BiocFileCache will automatically update the file if it detects that the file has changed in the source

If you want to inspect which files you have downloaded or even delete them, them you’ll want to use recount3_cache_files() and recount3_cache_rm() as illustrated below.

## List the URLs of the recount3 data that you have downloaded
recount3_cache_files()
#>  [1] "http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/sra.recount_project.MD.gz"                       
#>  [2] "http://duffel.rail.bio/recount3/human/data_sources/gtex/metadata/gtex.recount_project.MD.gz"                     
#>  [3] "http://duffel.rail.bio/recount3/human/data_sources/tcga/metadata/tcga.recount_project.MD.gz"                     
#>  [4] "http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.sra.SRP009615.MD.gz"            
#>  [5] "http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_project.SRP009615.MD.gz"
#>  [6] "http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_qc.SRP009615.MD.gz"     
#>  [7] "http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_seq_qc.SRP009615.MD.gz" 
#>  [8] "http://duffel.rail.bio/recount3/human/data_sources/sra/metadata/15/SRP009615/sra.recount_pred.SRP009615.MD.gz"   
#>  [9] "http://duffel.rail.bio/recount3/human/annotations/gene_sums/human.gene_sums.G026.gtf.gz"                         
#> [10] "http://duffel.rail.bio/recount3/human/data_sources/sra/gene_sums/15/SRP009615/sra.gene_sums.SRP009615.G026.gz"   
#> [11] "http://duffel.rail.bio/recount3/human/annotations/exon_sums/human.exon_sums.G026.gtf.gz"                         
#> [12] "http://duffel.rail.bio/recount3/human/data_sources/sra/exon_sums/15/SRP009615/sra.exon_sums.SRP009615.G026.gz"   
#> [13] "http://duffel.rail.bio/recount3/human/data_sources/sra/junctions/15/SRP009615/sra.junctions.SRP009615.ALL.RR.gz" 
#> [14] "http://duffel.rail.bio/recount3/human/data_sources/sra/junctions/15/SRP009615/sra.junctions.SRP009615.ALL.MM.gz" 
#> [15] "http://duffel.rail.bio/recount3/human/data_sources/sra/junctions/15/SRP009615/sra.junctions.SRP009615.ALL.ID.gz"
## Delete the recount3 files from your cache
recount3_cache_rm()

## Check that no files are listed
recount3_cache_files()

Your own mirror

recount3 functions such as create_rse() have a recount3_url argument that can be changed to point to a mirror or to a path in your computing system. This argument enables using recount3 with data stored in other locations, or even generated using the same processing pipeline that was used for recount3 but for your own/private data.

The main documentation website documents how the raw files should be organized in your mirror or for your own data. You can inspect the structure of the data by checking the internals of locate_url() and locate_url_ann(). Both functions can be used to get the full list of URLs. In addition, for a given mirror, available_projects() will show the local data sources and collections. Finally, file_retrieve() won’t cache the data if it detects that the data already exists in the local disk.

In particular, this functionality can be useful if you want to access the data through Registry of Open Data on AWS or at IDIES using SciServer Compute, which are the two official mirrors for recount3 data.

Teams involved

The ReCount family involves the following teams:

Project history

To clarify the relationship between the R/Bioconductor packages and the phases of ReCount (Frazee, Langmead, and Leek, 2011) please check the table below:

Year Phase Main references R/Bioconductor
2011 ReCount DOI: 10.1186/1471-2105-12-449 none
2017 recount2 DOI: 10.1038/nbt.3838 10.12688/f1000research.12223.1 recount
2021 recount3 DOI: 10.1186/s13059-021-02533-6 recount3

Reproducibility

The recount3 package (Collado-Torres, 2024) was made possible thanks to:

Code for creating the vignette

## Create the vignette
library("rmarkdown")
system.time(render("recount3-quickstart.Rmd", "BiocStyle::html_document"))

## Extract the R code
library("knitr")
knit("recount3-quickstart.Rmd", tangle = TRUE)
## Clean up
file.remove("quickstartRef.bib")
#> [1] TRUE

Date the vignette was generated.

#> [1] "2024-11-19 04:14:38 UTC"

Wallclock time spent generating the vignette.

#> Time difference of 1.313 mins

R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       Ubuntu 24.04.1 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2024-11-19
#>  pandoc   3.2.1 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  abind                  1.4-8     2024-09-12 [2] RSPM (R 4.4.0)
#>  backports              1.5.0     2024-05-23 [2] RSPM (R 4.4.0)
#>  bibtex                 0.5.1     2023-01-26 [2] RSPM (R 4.4.0)
#>  Biobase              * 2.67.0    2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  BiocFileCache          2.15.0    2024-11-19 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  BiocGenerics         * 0.53.3    2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  BiocIO                 1.17.0    2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  BiocManager            1.30.25   2024-08-28 [2] RSPM (R 4.4.0)
#>  BiocParallel           1.41.0    2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  BiocStyle            * 2.35.0    2024-11-19 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  Biostrings             2.75.1    2024-11-07 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  bit                    4.5.0     2024-09-20 [2] RSPM (R 4.4.0)
#>  bit64                  4.5.2     2024-09-22 [2] RSPM (R 4.4.0)
#>  bitops                 1.0-9     2024-10-03 [2] RSPM (R 4.4.0)
#>  blob                   1.2.4     2023-03-17 [2] RSPM (R 4.4.0)
#>  bslib                  0.8.0     2024-07-29 [2] RSPM (R 4.4.0)
#>  buildtools             1.0.0     2024-11-18 [3] local (/pkg)
#>  cachem                 1.1.0     2024-05-16 [2] RSPM (R 4.4.0)
#>  cli                    3.6.3     2024-06-21 [2] RSPM (R 4.4.0)
#>  codetools              0.2-20    2024-03-31 [2] RSPM (R 4.4.0)
#>  crayon                 1.5.3     2024-06-20 [2] RSPM (R 4.4.0)
#>  curl                   6.0.1     2024-11-14 [2] RSPM (R 4.4.0)
#>  data.table             1.16.2    2024-10-10 [2] RSPM (R 4.4.0)
#>  DBI                    1.2.3     2024-06-02 [2] RSPM (R 4.4.0)
#>  dbplyr                 2.5.0     2024-03-19 [2] RSPM (R 4.4.0)
#>  DelayedArray           0.33.2    2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  digest                 0.6.37    2024-08-19 [2] RSPM (R 4.4.0)
#>  dplyr                  1.1.4     2023-11-17 [2] RSPM (R 4.4.0)
#>  evaluate               1.0.1     2024-10-10 [2] RSPM (R 4.4.0)
#>  fansi                  1.0.6     2023-12-08 [2] RSPM (R 4.4.0)
#>  fastmap                1.2.0     2024-05-15 [2] RSPM (R 4.4.0)
#>  filelock               1.0.3     2023-12-11 [2] RSPM (R 4.4.0)
#>  generics             * 0.1.3     2022-07-05 [2] RSPM (R 4.4.0)
#>  GenomeInfoDb         * 1.43.1    2024-11-18 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  GenomeInfoDbData       1.2.13    2024-11-19 [2] Bioconductor
#>  GenomicAlignments      1.43.0    2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  GenomicRanges        * 1.59.1    2024-11-19 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  glue                   1.8.0     2024-09-30 [2] RSPM (R 4.4.0)
#>  htmltools              0.5.8.1   2024-04-04 [2] RSPM (R 4.4.0)
#>  httr                   1.4.7     2023-08-15 [2] RSPM (R 4.4.0)
#>  IRanges              * 2.41.1    2024-11-17 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  jquerylib              0.1.4     2021-04-26 [2] RSPM (R 4.4.0)
#>  jsonlite               1.8.9     2024-09-20 [2] RSPM (R 4.4.0)
#>  knitcitations        * 1.0.12    2021-01-10 [2] RSPM (R 4.4.0)
#>  knitr                  1.49      2024-11-08 [2] RSPM (R 4.4.0)
#>  lattice                0.22-6    2024-03-20 [2] RSPM (R 4.4.0)
#>  lifecycle              1.0.4     2023-11-07 [2] RSPM (R 4.4.0)
#>  lobstr                 1.1.2     2022-06-22 [2] RSPM (R 4.4.0)
#>  lubridate              1.9.3     2023-09-27 [2] RSPM (R 4.4.0)
#>  magrittr               2.0.3     2022-03-30 [2] RSPM (R 4.4.0)
#>  maketools              1.3.1     2024-10-04 [3] RSPM (R 4.4.0)
#>  Matrix                 1.7-1     2024-10-18 [2] RSPM (R 4.4.0)
#>  MatrixGenerics       * 1.19.0    2024-11-06 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  matrixStats          * 1.4.1     2024-09-08 [2] RSPM (R 4.4.0)
#>  memoise                2.0.1     2021-11-26 [2] RSPM (R 4.4.0)
#>  pillar                 1.9.0     2023-03-22 [2] RSPM (R 4.4.0)
#>  pkgconfig              2.0.3     2019-09-22 [2] RSPM (R 4.4.0)
#>  plyr                   1.8.9     2023-10-02 [2] RSPM (R 4.4.0)
#>  prettyunits            1.2.0     2023-09-24 [2] RSPM (R 4.4.0)
#>  pryr                   0.1.6     2023-01-17 [2] RSPM (R 4.4.0)
#>  purrr                  1.0.2     2023-08-10 [2] RSPM (R 4.4.0)
#>  R.methodsS3            1.8.2     2022-06-13 [2] RSPM (R 4.4.0)
#>  R.oo                   1.27.0    2024-11-01 [2] RSPM (R 4.4.0)
#>  R.utils                2.12.3    2023-11-18 [2] RSPM (R 4.4.0)
#>  R6                     2.5.1     2021-08-19 [2] RSPM (R 4.4.0)
#>  Rcpp                   1.0.13-1  2024-11-02 [2] RSPM (R 4.4.0)
#>  RCurl                  1.98-1.16 2024-07-11 [2] RSPM (R 4.4.0)
#>  recount3             * 1.17.0    2024-11-19 [1] https://bioc.r-universe.dev (R 4.4.2)
#>  RefManageR             1.4.0     2022-09-30 [2] RSPM (R 4.4.0)
#>  restfulr               0.0.15    2022-06-16 [2] RSPM (R 4.4.2)
#>  rjson                  0.2.23    2024-09-16 [2] RSPM (R 4.4.0)
#>  rlang                  1.1.4     2024-06-04 [2] RSPM (R 4.4.0)
#>  rmarkdown              2.29      2024-11-04 [2] RSPM (R 4.4.0)
#>  Rsamtools              2.23.0    2024-11-18 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  RSQLite                2.3.8     2024-11-17 [2] RSPM (R 4.4.0)
#>  rtracklayer            1.67.0    2024-11-06 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  S4Arrays               1.7.1     2024-11-18 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  S4Vectors            * 0.45.2    2024-11-16 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  sass                   0.4.9     2024-03-15 [2] RSPM (R 4.4.0)
#>  sessioninfo          * 1.2.2     2021-12-06 [2] RSPM (R 4.4.0)
#>  SparseArray            1.7.2     2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  stringi                1.8.4     2024-05-06 [2] RSPM (R 4.4.0)
#>  stringr                1.5.1     2023-11-14 [2] RSPM (R 4.4.0)
#>  SummarizedExperiment * 1.37.0    2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  sys                    3.4.3     2024-10-04 [2] RSPM (R 4.4.0)
#>  tibble                 3.2.1     2023-03-20 [2] RSPM (R 4.4.0)
#>  tidyselect             1.2.1     2024-03-11 [2] RSPM (R 4.4.0)
#>  timechange             0.3.0     2024-01-18 [2] RSPM (R 4.4.0)
#>  UCSC.utils             1.3.0     2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  utf8                   1.2.4     2023-10-22 [2] RSPM (R 4.4.0)
#>  vctrs                  0.6.5     2023-12-01 [2] RSPM (R 4.4.0)
#>  withr                  3.0.2     2024-10-28 [2] RSPM (R 4.4.0)
#>  xfun                   0.49      2024-10-31 [2] RSPM (R 4.4.0)
#>  XML                    3.99-0.17 2024-06-25 [2] RSPM (R 4.4.0)
#>  xml2                   1.3.6     2023-12-04 [2] RSPM (R 4.4.0)
#>  XVector                0.47.0    2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  yaml                   2.3.10    2024-07-26 [2] RSPM (R 4.4.0)
#>  zlibbioc               1.52.0    2024-10-29 [2] Bioconductor 3.20 (R 4.4.2)
#> 
#>  [1] /tmp/RtmprSvjM2/Rinst22aa62102ff7
#>  [2] /github/workspace/pkglib
#>  [3] /usr/local/lib/R/site-library
#>  [4] /usr/lib/R/site-library
#>  [5] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Bibliography

This vignette was generated using BiocStyle (Oleś, 2024) with knitr (Xie, 2014) and rmarkdown (Allaire, Xie, Dervieux, McPherson et al., 2024) running behind the scenes.

Citations made with knitcitations (Boettiger, 2021).

[1] J. Allaire, Y. Xie, C. Dervieux, J. McPherson, et al. rmarkdown: Dynamic Documents for R. R package version 2.29. 2024. https://github.com/rstudio/rmarkdown.

[2] S. Balcome. interactiveDisplayBase: Base package for enabling powerful shiny web displays of Bioconductor objects. R package version 1.45.0. 2021.

[3] T. Barrett, M. Dowle, A. Srinivasan, J. Gorecki, et al. data.table: Extension of data.frame. R package version 1.16.2, https://Rdatatable.gitlab.io/data.table, https://github.com/Rdatatable/data.table. 2024. https://r-datatable.com.

[4] D. Bates, M. Maechler, and M. Jagan. Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.7-1. 2024. https://Matrix.R-forge.R-project.org.

[5] H. Bengtsson. R.utils: Various Programming Utilities. R package version 2.12.3, https://github.com/HenrikBengtsson/R.utils. 2023. https://henrikbengtsson.github.io/R.utils/.

[6] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.

[7] L. Collado-Torres. Explore and download data from the recount3 project. https://github.com/LieberInstitute/recount3 - R package version 1.17.0. 2024. DOI: 10.18129/B9.bioc.recount3. http://www.bioconductor.org/packages/recount3.

[8] L. Collado-Torres, A. Nellore, and A. E. Jaffe. “recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor [version 1; referees: 1 approved, 2 approved with reservations]”. In: F1000Research (2017). DOI: 10.12688/f1000research.12223.1. https://f1000research.com/articles/6-1558/v1.

[9] L. Collado-Torres, A. Nellore, K. Kammers, S. E. Ellis, et al. “Reproducible RNA-seq analysis using recount2”. In: Nature Biotechnology (2017). DOI: 10.1038/nbt.3838. http://www.nature.com/nbt/journal/v35/n4/full/nbt.3838.html.

[10] A. C. Frazee, B. Langmead, and J. T. Leek. “ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets”. In: BMC Bioinformatics (2011). DOI: 10.1186/1471-2105-12-449. https://doi.org/10.1186/1471-2105-12-449.

[11] J. Hester. covr: Test Coverage for Packages. R package version 3.6.4, https://github.com/r-lib/covr. 2023. https://covr.r-lib.org.

[12] W. Huber, V. J. Carey, R. Gentleman, S. Anders, et al. “Orchestrating high-throughput genomic analysis with Bioconductor”. In: Nat Methods (2015). DOI: 10.1038/nmeth.3252.

[13] E. Imada, D. F. Sanchez, L. Collado-Torres, C. Wilks, et al. “Recounting the FANTOM CAGE–Associated Transcriptome”. In: Genome Research (2020). DOI: 10.1101/gr.254656.119. https://doi.org/10.1101/gr.254656.119.

[14] M. Lawrence, R. Gentleman, and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.

[15] M. Lawrence, W. Huber, H. Pagès, P. Aboyoun, et al. “Software for Computing and Annotating Genomic Ranges”. In: PLoS Computational Biology 9 (8 2013). DOI: 10.1371/journal.pcbi.1003118. http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118}.

[16] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.

[17] M. Morgan, V. Obenchain, J. Hester, and H. Pagès. SummarizedExperiment: SummarizedExperiment container. 2019. DOI: 10.18129/B9.bioc.SummarizedExperiment.

[18] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.35.0. 2024. https://github.com/Bioconductor/BiocStyle.

[19] H. Pagès, M. Lawrence, and P. Aboyoun. S4Vectors: Foundation of vector-like and list-like containers in Bioconductor. R package version 0.45.2. 2024. https://bioconductor.org/packages/S4Vectors.

[20] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2024. https://www.R-project.org/.

[21] A. Razmara, S. E. Ellis, D. J. Sokolowski, S. Davis, et al. “recount-brain: a curated repository of human brain RNA-seq datasets metadata”. In: bioRxiv (2019). DOI: 10.1101/618025. https://doi.org/10.1101/618025.

[22] L. Shepherd and M. Morgan. BiocFileCache: Manage Files Across Sessions. R package version 2.15.0. 2024.

[23] H. Wickham. httr: Tools for Working with URLs and HTTP. R package version 1.4.7, https://github.com/r-lib/httr. 2023. https://httr.r-lib.org/.

[24] H. Wickham. pryr: Tools for Computing on the Language. R package version 0.1.6. 2023. https://github.com/hadley/pryr.

[25] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5-10. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

[26] H. Wickham, W. Chang, R. Flight, K. Müller, et al. sessioninfo: R Session Information. R package version 1.2.2, https://r-lib.github.io/sessioninfo/. 2021. https://github.com/r-lib/sessioninfo#readme.

[27] C. Wilks, S. C. Zheng, F. Y. Chen, R. Charles, et al. “recount3: summaries and queries for large-scale RNA-seq expression and splicing”. In: Genome Biol (2021). DOI: 10.1186/s13059-021-02533-6. https://doi.org/10.1186/s13059-021-02533-6.

[28] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.


  1. GTEx and TCGA are broken up by tissue as described later in this vignette.↩︎

  2. These are 92 control spike-in sequences that are commonly used in bulk RNA-seq projects.↩︎

  3. 69 controls sequences.↩︎

  4. This new design allows us to couple the expression data with metadata on the fly, as well as have flexibility in case we uncover an error in the files.↩︎

  5. Although ERCC and SIRV are technically not genes.↩︎

  6. Check the BigWig files section further below.↩︎

  7. For example in recount.bwtool .↩︎