This vignette provides the implementation of the procedure described in point 7 of our Guidelines for RNA-Seq data analysis1 protocol available from the Epigenesys website.
Briefly, it details the step necessary to: 1. create a non-redundant annotation
count reads per feature
pre-analyse the data, i.e. assess the pertinence of the samples’ charateristics in the light of their biological provenance; i.e. in other words perform a so called “biological QA” using assessment methods such as Principal Component Analysis, Multi-dimensional Scaling, Hierarcical Clustering, etc.
The aim of this vignette is to go through these steps using the easyRNASeq package, hence the rationale of the implementation will not be discussed, albeit relevant litterature will be pointed at when necessarry.
Throughout this vignette we are going to replicate the analysis conducted in Robinson, Delhomme et al.(Robinson et al. 2014), a study looking at sexual dimorphism in Eurasian aspen.
To perform the listed steps, we need to instantiate a number of objects to store the minimal set of parameters describing the conducted RNA-Seq experiment, e.g. the BAM files location, the annotation location and type, the sequencing parameters, etc.
To get started with this process, we load the package into our R session:
library(easyRNASeq)
before instantiating an AnnotParam object informing on the location and type of the annotation to be used.
The AnnotParam class is meant to store the minimal set of information necessary to retrieve the annotation
The minimal information to provide is:
In this tutorial, we will reproduce the analysis performed in Robinson, Delhomme et al. (Robinson et al. 2014). For that we will start by downloading the original annotation gff3 file for P. trichocarpa, a close related species of the trees used in the study into the current directory.
download.file(url=paste0("ftp://ftp.plantgenie.org/Data/PopGenIE/",
"Populus_trichocarpa/v3.0/v10.1/GFF3/",
"Ptrichocarpa_210_v3.0_gene_exons.gff3.gz"),
destfile=,"./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz")
## [1] TRUE
Before instantiating an “AnnotParam” object.
annotParam <- AnnotParam(
datasource="./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz")
This annotation file however contains multiple copy of the same exons, i.e. when exons are shared by several isoforms of a gene. This might result in so-called “multiple-counting” and as described in these guidelines2, we will to circumvent that issue create a set of synthetic transcripts.
One major caveat of estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, might sometimes align to several features (e.g. transcripts or genes) with alignments of equivalent quality.
This, for example, might happen as a result of gene duplication and the presence of repetitive or common domains. To avoid counting unique mRNA fragments multiple times, the stringent approach is to keep only uniquely mapping reads - being aware of potential consequences, see the note below.
Not only can multiple counting arise from a biological reason, but also from technical artifacts, introduced mostly by poorly formatted gff3/gtf annotation files. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a synthetic transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript: a gene structure with a one to one relationship.
To create such a structure, we use the createSyntheticTranscripts function on the file we just downloaded, simply by passing our annotParam object as argument.
annotParam <- createSyntheticTranscripts(annotParam,verbose=FALSE)
This function returns an updated annotParam object that contains the newly created, flattened transcript annotation. This object can then be saved as an rda file for later re-use or for sharing with collaborators.
save(annotParam,
file="./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda")
Instead of updating the annotParam object, we could have created an object of class Genome_Intervals from the genomeIntervals package, using the same function but using the actual datasource of the previous annotParam object as argument rather than the object itself.
gI <- createSyntheticTranscripts(
"./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
verbose=FALSE)
This gI object can then be exported as a gff3 file.
writeGff3(gI,
file="./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts.gff3.gz")
Note: Ignoring multi-mapping reads may introduce biases in the read counts of some genes (such as that of paralogs or of very conserved gene families), but in the context of a conservative first analysis we are of the current opinion that they are best ignored. One should of course assess how many reads are multi-mapping (check for example the STAR output) and possibly extract them from the alignment read file to visualize them using a genome browser so as to understand where they are located and how they may affect any analysis. Based on this, one may, at a later stage, decide to relax the counting parameters to accept multi-mapping reads.
High throughput sequencing RNA-Seq data comes in a multitude of flavours, i.e. even from a single provider, protocol - e.g. strand specific, paired-end - reads characteristics - e.g. read length - will vary.
The easyRNASeq simpleRNASeq method will infer these information based on excerpts sampled from the data. However, it is always best to provide these information, as 1. the inference is done on small excerpt and can fail 2. it is always good to document an analysis
By default easyRNASeq simpleRNASeq will keep the inferred parameters over the user-provided parameters if these do not agree and emit corresponding warnings. The choice to rely on inferred parameters over user-provided one is to enforce user to cross-validate their knowledge of the data characteristics, as these are crucial for an adequate processing. Remember GIGO3.
If the automatic inference does fail, please let me know, so that I optimise it. Meanwhile, you can use the override argument to enforce the use of user-passed parameters.
To reproduce the results from Robinson, Delhomme et al. (Robinson et al. 2014), we first need to download an excerpt of the data.
We first retrieve the file listing and md5 codes
download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/RnaSeqTutorial/",
"data/star/md5.txt"),
destfile="md5.txt")
In this part of the vignette, we will NOT process all the data, albeit it would be possible, but for the sake of brevity, we will only retrieve the six first datasets. We get these from the sample information contained within this package.
data(RobinsonDelhomme2014)
lapply(RobinsonDelhomme2014[1:6,"Filename"],function(f){
# BAM files
download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/",
"RnaSeqTutorial/data/star/",f),
destfile=as.character(f))
# BAM index files
download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/",
"RnaSeqTutorial/data/star/",f,".bai"),
destfile=as.character(paste0(f,".bai")))
})
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
These six files - as the rest of the dataset - have been sequenced on an Illumina HiSeq 2500 in paired-end mode using a non-strand specific library protocol with a read length of 100 bp. The raw data have been processed as described in the aforementioned guidelines4 and as such have been filtered for rRNA sequences, trimmed for adapters and clipped for quality. The resulting reads (of length 50-100bp) have then been aligned using STAR [Dobin:2013p5293].
Using these information, we finally generate the BamParam object.
bamParam <- BamParam(paired = TRUE,
stranded = FALSE)
A third parameter yieldSize can be set to speed up the processing on multi-CPU or multi-core computers. It splits and processed the BAM files in chunk of size yieldSize with a default of 1M reads.
Finally, we create the list of BAM files we just retrieved.
bamFiles <- getBamFileList(dir(".","*\\.bam$"),
dir(".","*\\.bai$"))
The final set of parameters we need to define encapsulate the AnnotParam and BamParam and detail how the read summarization should be performed. simpleRNASeq supports A) 2 modes of counting:
by read
by bp
the latter of which, was the default counting method the easyRNASeq function. Due to the more complex implementation required, the non-evidence of increase in counting accuracy and the extended support of the read-based approach by the mainstream, standardised Bioconductor package has led the read method to be the default in simpleRNASeq. Due to lack of time for maintenance and improvement, the bp-based method is also not recommended.
over B) 4 feature types: exon, transcript, gene or any feature provided by the user. The latter may be for example used for counting reads in promoter regions.
Given a flattened transcript structure - as created in a previous section - summarizing by transcripts or genes is equivalent. Note that using a non flattened annotation with any feature type will result in multiple counting!! i.e. the product of a single mRNA fragment will be counted for every features it overlap, hence introducing a significant bias in downstream analyses.
Given a flattened transcript structure, summarizing by exon enables the use of the resulting count table for processes such as differential exon usage analyses, as implemented in the DEXSeq package.
For the Robinson, Delhomme et al. dataset, we are interested in the gene expression, hence we create our RnaSeqParam object as follows:
rnaSeqParam <- RnaSeqParam(annotParam = annotParam,
bamParam = bamParam,
countBy = "genes",
precision = "read")
We can now run the quantification
sexp1 <- simpleRNASeq(bamFiles=bamFiles,
param=rnaSeqParam,
verbose=TRUE)
## ==========================
## simpleRNASeq version 2.43.0
## ==========================
## Creating a RangedSummarizedExperiment.
## ==========================
## Processing the alignments.
## ==========================
## Pre-processing 6 BAM files.
## Validating the BAM files.
## Extracted 1446 reference sequences information.
## Extracting parameter from 202_subset_sortmerna_trimmomatic_sorted.bam
## Extracting parameter from 207_subset_sortmerna_trimmomatic_sorted.bam
## Extracting parameter from 213.1_subset_sortmerna_trimmomatic_sorted.bam
## Extracting parameter from 221_subset_sortmerna_trimmomatic_sorted.bam
## Extracting parameter from 226.1_subset_sortmerna_trimmomatic_sorted.bam
## Extracting parameter from 229.1_subset_sortmerna_trimmomatic_sorted.bam
## Found 0 single-end BAM files.
## Found 6 paired-end BAM files.
## Bam file: 202_subset_sortmerna_trimmomatic_sorted.bam has reads of length 50bp-101bp
## Bam file: 207_subset_sortmerna_trimmomatic_sorted.bam has reads of length 50bp-101bp
## Bam file: 213.1_subset_sortmerna_trimmomatic_sorted.bam has reads of length 50bp-101bp
## Bam file: 221_subset_sortmerna_trimmomatic_sorted.bam has reads of length 50bp-101bp
## Bam file: 226.1_subset_sortmerna_trimmomatic_sorted.bam has reads of length 50bp-101bp
## Bam file: 229.1_subset_sortmerna_trimmomatic_sorted.bam has reads of length 50bp-101bp
## Warning in FUN(X[[i]], ...): Bam file:
## 202_subset_sortmerna_trimmomatic_sorted.bam is considered unstranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 202_subset_sortmerna_trimmomatic_sorted.bam Strandedness could not be
## determined using 4492 regions spanning 1379593 bp on either strand at a 90%
## cutoff; 23.35 percent appear to be stranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 207_subset_sortmerna_trimmomatic_sorted.bam is considered unstranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 207_subset_sortmerna_trimmomatic_sorted.bam Strandedness could not be
## determined using 4706 regions spanning 1462563 bp on either strand at a 90%
## cutoff; 20.61 percent appear to be stranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 213.1_subset_sortmerna_trimmomatic_sorted.bam is considered unstranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 213.1_subset_sortmerna_trimmomatic_sorted.bam Strandedness could not be
## determined using 4457 regions spanning 1338599 bp on either strand at a 90%
## cutoff; 23.02 percent appear to be stranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 221_subset_sortmerna_trimmomatic_sorted.bam is considered unstranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 221_subset_sortmerna_trimmomatic_sorted.bam Strandedness could not be
## determined using 4458 regions spanning 1368624 bp on either strand at a 90%
## cutoff; 23.1 percent appear to be stranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 226.1_subset_sortmerna_trimmomatic_sorted.bam is considered unstranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 226.1_subset_sortmerna_trimmomatic_sorted.bam Strandedness could not be
## determined using 4431 regions spanning 1368899 bp on either strand at a 90%
## cutoff; 22.3 percent appear to be stranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 229.1_subset_sortmerna_trimmomatic_sorted.bam is considered unstranded.
## Warning in FUN(X[[i]], ...): Bam file:
## 229.1_subset_sortmerna_trimmomatic_sorted.bam Strandedness could not be
## determined using 4378 regions spanning 1346291 bp on either strand at a 90%
## cutoff; 19.96 percent appear to be stranded.
## Warning in simpleRNASeq(bamFiles = bamFiles, param = rnaSeqParam, verbose =
## TRUE): As of version 2.15.5, easyRNASeq assumes that, if the data is strand
## specific, the sequencing was done using a protocol such as the Illumina TruSeq,
## where the reverse strand is quantified - i.e. the strandProtocol argument of
## the BamParam class defaults to 'reverse'.
## Streaming 202_subset_sortmerna_trimmomatic_sorted.bam
## Read 44496 reads
## Streaming 207_subset_sortmerna_trimmomatic_sorted.bam
## Read 49318 reads
## Streaming 213.1_subset_sortmerna_trimmomatic_sorted.bam
## Read 43300 reads
## Streaming 221_subset_sortmerna_trimmomatic_sorted.bam
## Read 47521 reads
## Streaming 226.1_subset_sortmerna_trimmomatic_sorted.bam
## Read 45860 reads
## Streaming 229.1_subset_sortmerna_trimmomatic_sorted.bam
## Read 49077 reads
## Bam file: 202_subset_sortmerna_trimmomatic_sorted.bam has 44496 reads.
## Bam file: 207_subset_sortmerna_trimmomatic_sorted.bam has 49318 reads.
## Bam file: 213.1_subset_sortmerna_trimmomatic_sorted.bam has 43300 reads.
## Bam file: 221_subset_sortmerna_trimmomatic_sorted.bam has 47521 reads.
## Bam file: 226.1_subset_sortmerna_trimmomatic_sorted.bam has 45860 reads.
## Bam file: 229.1_subset_sortmerna_trimmomatic_sorted.bam has 49077 reads.
## ==========================
## Processing the annotation
## ==========================
## Validating the annotation source
## No validation performed at that stage
## Fetching the annotation
## Using the provided annotation as such
## ==========================
## Sanity checking
## ==========================
## ==========================
## Creating the count table
## ==========================
## Using 1 CPU core
## Streaming 202_subset_sortmerna_trimmomatic_sorted.bam
## The data is paired-end; only valid mates will be kept.
## The data is unstranded; overlapping features will be ignored.
## The summarization is by 'read' and the mode is: Union.
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 542 alignments with ambiguous pairing were dumped.
## Use 'getDumpedAlignments()' to retrieve them from the dump environment.
## Processing 21977 reads
## Done with 202_subset_sortmerna_trimmomatic_sorted.bam
## Streaming 207_subset_sortmerna_trimmomatic_sorted.bam
## The data is paired-end; only valid mates will be kept.
## The data is unstranded; overlapping features will be ignored.
## The summarization is by 'read' and the mode is: Union.
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 816 alignments with ambiguous pairing were dumped.
## Use 'getDumpedAlignments()' to retrieve them from the dump environment.
## Processing 24251 reads
## Done with 207_subset_sortmerna_trimmomatic_sorted.bam
## Streaming 213.1_subset_sortmerna_trimmomatic_sorted.bam
## The data is paired-end; only valid mates will be kept.
## The data is unstranded; overlapping features will be ignored.
## The summarization is by 'read' and the mode is: Union.
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 476 alignments with ambiguous pairing were dumped.
## Use 'getDumpedAlignments()' to retrieve them from the dump environment.
## Processing 21412 reads
## Done with 213.1_subset_sortmerna_trimmomatic_sorted.bam
## Streaming 221_subset_sortmerna_trimmomatic_sorted.bam
## The data is paired-end; only valid mates will be kept.
## The data is unstranded; overlapping features will be ignored.
## The summarization is by 'read' and the mode is: Union.
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 680 alignments with ambiguous pairing were dumped.
## Use 'getDumpedAlignments()' to retrieve them from the dump environment.
## Processing 23420 reads
## Done with 221_subset_sortmerna_trimmomatic_sorted.bam
## Streaming 226.1_subset_sortmerna_trimmomatic_sorted.bam
## The data is paired-end; only valid mates will be kept.
## The data is unstranded; overlapping features will be ignored.
## The summarization is by 'read' and the mode is: Union.
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 532 alignments with ambiguous pairing were dumped.
## Use 'getDumpedAlignments()' to retrieve them from the dump environment.
## Processing 22664 reads
## Done with 226.1_subset_sortmerna_trimmomatic_sorted.bam
## Streaming 229.1_subset_sortmerna_trimmomatic_sorted.bam
## The data is paired-end; only valid mates will be kept.
## The data is unstranded; overlapping features will be ignored.
## The summarization is by 'read' and the mode is: Union.
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 798 alignments with ambiguous pairing were dumped.
## Use 'getDumpedAlignments()' to retrieve them from the dump environment.
## Processing 24139 reads
## Done with 229.1_subset_sortmerna_trimmomatic_sorted.bam
## ==========================
## Returning a
## RangedSummarizedExperiment
## ==========================
## Warning in file.remove(c("./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
## "./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda", :
## cannot remove file '1', reason 'No such file or directory'
## Warning in file.remove(c("./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
## "./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda", :
## cannot remove file '2', reason 'No such file or directory'
## Warning in file.remove(c("./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
## "./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda", :
## cannot remove file '3', reason 'No such file or directory'
## Warning in file.remove(c("./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
## "./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda", :
## cannot remove file '4', reason 'No such file or directory'
## Warning in file.remove(c("./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
## "./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda", :
## cannot remove file '5', reason 'No such file or directory'
## Warning in file.remove(c("./Ptrichocarpa_210_v3.0_gene_exons.gff3.gz",
## "./Ptrichocarpa_210_v3.0_gene_exons_synthetic-transcripts_annotParam.rda", :
## cannot remove file '6', reason 'No such file or directory'
## [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] easyRNASeq_2.43.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] blob_1.2.4 genomeIntervals_1.63.0
## [5] filelock_1.0.3 Biostrings_2.75.3
## [7] bitops_1.0-9 fastmap_1.2.0
## [9] BiocFileCache_2.15.0 GenomicAlignments_1.43.0
## [11] digest_0.6.37 lifecycle_1.0.4
## [13] pwalign_1.3.1 statmod_1.5.0
## [15] KEGGREST_1.47.0 RSQLite_2.3.9
## [17] magrittr_2.0.3 compiler_4.4.2
## [19] progress_1.2.3 rlang_1.1.4
## [21] sass_0.4.9 tools_4.4.2
## [23] yaml_2.3.10 knitr_1.49
## [25] prettyunits_1.2.0 S4Arrays_1.7.1
## [27] bit_4.5.0.1 interp_1.1-6
## [29] curl_6.0.1 DelayedArray_0.33.3
## [31] xml2_1.3.6 RColorBrewer_1.1-3
## [33] abind_1.4-8 ShortRead_1.65.0
## [35] BiocParallel_1.41.0 withr_3.0.2
## [37] purrr_1.0.2 hwriter_1.3.2.1
## [39] BiocGenerics_0.53.3 sys_3.4.3
## [41] grid_4.4.2 stats4_4.4.2
## [43] latticeExtra_0.6-30 edgeR_4.5.1
## [45] biomaRt_2.63.0 SummarizedExperiment_1.37.0
## [47] cli_3.6.3 rmarkdown_2.29
## [49] crayon_1.5.3 intervals_0.15.5
## [51] generics_0.1.3 httr_1.4.7
## [53] DBI_1.2.3 cachem_1.1.0
## [55] stringr_1.5.1 zlibbioc_1.52.0
## [57] parallel_4.4.2 AnnotationDbi_1.69.0
## [59] BiocManager_1.30.25 XVector_0.47.0
## [61] matrixStats_1.4.1 vctrs_0.6.5
## [63] Matrix_1.7-1 jsonlite_1.8.9
## [65] hms_1.1.3 IRanges_2.41.2
## [67] S4Vectors_0.45.2 bit64_4.5.2
## [69] jpeg_0.1-10 maketools_1.3.1
## [71] locfit_1.5-9.10 LSD_4.1-0
## [73] limma_3.63.2 jquerylib_0.1.4
## [75] glue_1.8.0 codetools_0.2-20
## [77] stringi_1.8.4 GenomeInfoDb_1.43.2
## [79] deldir_2.0-4 GenomicRanges_1.59.1
## [81] UCSC.utils_1.3.0 tibble_3.2.1
## [83] pillar_1.10.0 rappdirs_0.3.3
## [85] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [87] httr2_1.0.7 R6_2.5.1
## [89] dbplyr_2.5.0 evaluate_1.0.1
## [91] lattice_0.22-6 Biobase_2.67.0
## [93] png_0.1-8 Rsamtools_2.23.1
## [95] memoise_2.0.1 bslib_0.8.0
## [97] Rcpp_1.0.13-1 SparseArray_1.7.2
## [99] xfun_0.49 MatrixGenerics_1.19.0
## [101] buildtools_1.0.0 pkgconfig_2.0.3
http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis↩︎
http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis↩︎
Garbage In Garbage Out↩︎
http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis↩︎