Title: | Diagnostics for assessing genomic DNA contamination in RNA-seq data |
---|---|
Description: | Provides diagnostics for assessing genomic DNA contamination in RNA-seq data, as well as plots representing these diagnostics. Moreover, the package can be used to get an insight into the strand library protocol used and, in case of strand-specific libraries, the strandedness of the data. Furthermore, it provides functionality to filter out reads of potential gDNA origin. |
Authors: | Beatriz Calvo-Serra [aut], Robert Castelo [aut, cre] |
Maintainer: | Robert Castelo <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.5.0 |
Built: | 2024-11-21 06:16:44 UTC |
Source: | https://github.com/bioc/gDNAx |
The gDNAx package provides diagnostics for assessing genomic DNA contamination in RNA-seq data, as well as plots representing these diagnostics. Moreover, the package can be used to get an insight into the strand library protocol used and, in case of strand-specific libraries, the strandedness of the data. Furthermore, it provides functionality to filter out reads of potential gDNA origin.
The main functions are:
gDNAdx()
- calculate diagnostics for assessing the presence of genomic DNA in RNA-seq data over a subset of the alignments in the input BAM files.
getDx()
and plot()
- get and plot statistics on genomic DNA contamination levels, respectively.
strandedness()
- obtain estimates of strandedness in RNA-seq data samples based on the proportion of reads aligning to the same or opposite strand as transcripts in the annotations.
classifyStrandMode()
- classify the output of strandedness()
into strand modes for each BAM file.
filterBAMtxFlag
and filterBAMtx
- filter alignments in a BAM file using criteria based on a transcriptome annotation.
For detailed information on usage, see the package vignette, by typing
vignette("gDNAx")
.
All questions and bug reports should be posted to the Bioconductor Support Site:
https://support.bioconductor.org
The code of the development version of the package is available at the GitHub repository:
https://github.com/functionalgenomics/gDNAx
Maintainer: Beatriz Calvo-Serra [email protected]
Authors:
Robert Castelo [email protected]
Useful links:
Report bugs at https://github.com/functionalgenomics/gDNAx/issues
Filter alignments in a BAM file using criteria based on a transcriptome annotation.
Use 'filterBAMtxFlag()' to set what types of alignment in a BAM file should be filtered using the function 'filterBAMtx()', among being splice-compatible with one or more exon-exon junctions, splice-compatible exonic, splice-compatible exonic in a window, intronic or intergenic.
filterBAMtx( object, path = ".", txflag = filterBAMtxFlag(), param = ScanBamParam(), yieldSize = 1e+06, wsize = 1000, wstep = 100, pstrness = 0.6, p.value = 0.05, p.adj.method = p.adjust.methods, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) filterBAMtxFlag( isSpliceCompatibleJunction = FALSE, isSpliceCompatibleExonic = FALSE, isInStrandedWindow = FALSE, isIntronic = FALSE, isIntergenic = FALSE ) testBAMtxFlag(flag, value)
filterBAMtx( object, path = ".", txflag = filterBAMtxFlag(), param = ScanBamParam(), yieldSize = 1e+06, wsize = 1000, wstep = 100, pstrness = 0.6, p.value = 0.05, p.adj.method = p.adjust.methods, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) filterBAMtxFlag( isSpliceCompatibleJunction = FALSE, isSpliceCompatibleExonic = FALSE, isInStrandedWindow = FALSE, isIntronic = FALSE, isIntergenic = FALSE ) testBAMtxFlag(flag, value)
object |
gDNAx object obtained with the function 'gDNAdx()'. |
path |
Directory where to write the output BAM files. |
txflag |
A value from a call to the function 'filterBAMtxFlag()'. |
param |
A 'ScanBamParam' object. |
yieldSize |
(Default 1e6) Number of records in the input BAM file to yield each time the file is read. The lower the value, the smaller memory consumption, but in the case of large BAM files, values below 1e6 records may decrease the overall performance. |
wsize |
(Default 1000) Window size employed when the argument
|
wstep |
(Default 100) Window step employed when the argument
|
pstrness |
(Default 0.6) Strandedness value above which we consider a target read alignment to occur in stranded window. |
p.value |
(Default 0.05) Numeric value between 0 and 1 specifying the
adjusted p-value cutoff under which we reject the null hypothesis that a
target read alignment occurs in a window with an strandedness value below
the one given in the parameter |
p.adj.method |
(Default "holm") Method used to adjust p-values that
are compared against the cutoff value specified in the parameter
|
verbose |
(Default TRUE) Logical value indicating if progress should be reported through the execution of the code. |
BPPARAM |
An object of a BiocParallelParam subclass
to configure the parallel execution of the code. By default, a
SerialParam object is used, which does not use any
parallelization, with the flag |
isSpliceCompatibleJunction |
(Default FALSE) Logical value indicating
if spliced alignments overlapping a transcript in a
"splice compatible" way should be included in the BAM file. For
paired-end reads, one or both alignments must have one or more splice
sites compatible with splicing. See
|
isSpliceCompatibleExonic |
(Default FALSE) Logical value indicating
if alignments without a splice site, but that overlap a transcript
in a "splice compatible" way, should be included in the BAM file.
For paired-end reads, none of the alignments must be spliced, and
each pair can be in different exons (or in the same one), as long as
they are "splice compatible". See
|
isInStrandedWindow |
(Default FALSE) Logical value indicating whether an alignment occurs in a stranded windows. More concretely, for each alignment, strandedness will be tested with respect to the rest of the alignments occurring in overlapping windows. This filter will assign a TRUE value when those tests are passed. |
isIntronic |
(Default FALSE) Logical value indicating if alignments mapping to introns should be included in the BAM file. |
isIntergenic |
(Default FALSE) Logical value indicating if alignments aligned to intergenic regions should be included in the BAM file. |
flag |
A value from a call to the function 'filterBAMtxFlag()'. |
value |
A character vector with the name of a flag. |
A vector of output filename paths.
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Getting the 'gDNAx' object bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,7)] # using a subset of samples gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA) # Filtering splice-compatible alignments and writing them into new BAM files fbf <- filterBAMtxFlag(isSpliceCompatibleJunction=TRUE, isSpliceCompatibleExonic=TRUE) dir <- tempdir() fstats <- filterBAMtx(gdnax, path=dir, txflag=fbf) list.files(dir, pattern="*.bam$") # Filtering splice-compatible alignments and writing them into new BAM files fbf <- filterBAMtxFlag(isSpliceCompatibleJunction=FALSE, isSpliceCompatibleExonic=FALSE, isInStrandedWindow=FALSE, isIntronic=FALSE, isIntergenic=FALSE) testBAMtxFlag(fbf, "isSpliceCompatibleJunction")
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Getting the 'gDNAx' object bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,7)] # using a subset of samples gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA) # Filtering splice-compatible alignments and writing them into new BAM files fbf <- filterBAMtxFlag(isSpliceCompatibleJunction=TRUE, isSpliceCompatibleExonic=TRUE) dir <- tempdir() fstats <- filterBAMtx(gdnax, path=dir, txflag=fbf) list.files(dir, pattern="*.bam$") # Filtering splice-compatible alignments and writing them into new BAM files fbf <- filterBAMtxFlag(isSpliceCompatibleJunction=FALSE, isSpliceCompatibleExonic=FALSE, isInStrandedWindow=FALSE, isIntronic=FALSE, isIntergenic=FALSE) testBAMtxFlag(fbf, "isSpliceCompatibleJunction")
Calculate diagnostics for assessing the presence of genomic DNA (gDNA) in RNA-seq data over a subset of the alignments in the input BAM files.
Plot diagnostics calculated with gDNAdx()
Using the output from gDNAdx(), plot the genomic origin of the alignments.
Plot fragments length distributions estimated with gDNAdx()
gDNAdx( bfl, txdb, singleEnd, strandMode, stdChrom = TRUE, yieldSize = 100000L, exonsBy = c("gene", "tx"), minnaln = 2e+05, useRMSK = TRUE, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) ## S4 method for signature 'gDNAx,ANY' plot(x, group = 1L, labelpoints = FALSE, ...) plotAlnOrigins(x, group = 1L) plotFrgLength(x)
gDNAdx( bfl, txdb, singleEnd, strandMode, stdChrom = TRUE, yieldSize = 100000L, exonsBy = c("gene", "tx"), minnaln = 2e+05, useRMSK = TRUE, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) ## S4 method for signature 'gDNAx,ANY' plot(x, group = 1L, labelpoints = FALSE, ...) plotAlnOrigins(x, group = 1L) plotFrgLength(x)
bfl |
A |
txdb |
A character string of a |
singleEnd |
(missing by default) Logical value indicating if reads are
single ( |
strandMode |
(missing by default) Integer value either 0, 1, 2 or
|
stdChrom |
(Default TRUE) Logical value indicating whether only
alignments in the 'standard chromosomes' should be used. Consult the help
page of the function |
yieldSize |
(Default 1e5) Number of records to read from each input BAM file to calculate the diagnostics. |
exonsBy |
(Default 'gene') Character string, either |
minnaln |
(Default 200000) Minimum number of read alignments overlapping
exonic regions considered necessary for a reliable estimation of strandedness
values. A warning message is given if the number of such available alignments
is smaller than the one given through this parameter. This parameter only
applies if no argument is given for the |
useRMSK |
(Default TRUE) Logical value indicating if RepeatMasker
annotations should be used when building intergenic and intronic genomic
ranges for gDNA estimation. If |
verbose |
(Default TRUE) Logical value indicating if progress should be reported through the execution of the code. |
BPPARAM |
An object of a BiocParallelParam subclass
to configure the parallel execution of the code. By default, a
SerialParam object is used, which does not use any
parallelization, with the flag |
x |
A 'gDNAx' object. |
group |
A string character vector or a factor, with as many values as BAM files analyzed in 'x', whose values define groups among those BAM files. |
labelpoints |
(Default FALSE) A logical indicator that labels points in those plots where each point represents a BAM file. Labels correspond to the index number of the BAM file in 'x'. |
... |
Named arguments to be passed to |
A gDNAx object.
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Retrieving BAM files bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,4,7)] # using a subset of samples # Getting information about the gDNA concentrations of each BAM file pdat <- LiYu22phenoData(bamfiles) gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA) gdnax # plot gDNA diagnostic measures plot(gdnax, group=pdat$gDNA, pch=19) # plot origin of alignments per sample plotAlnOrigins(gdnax, group=pdat$gDNA) # plot fragments length distributions plotFrgLength(gdnax)
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Retrieving BAM files bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,4,7)] # using a subset of samples # Getting information about the gDNA concentrations of each BAM file pdat <- LiYu22phenoData(bamfiles) gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA) gdnax # plot gDNA diagnostic measures plot(gdnax, group=pdat$gDNA, pch=19) # plot origin of alignments per sample plotAlnOrigins(gdnax, group=pdat$gDNA) # plot fragments length distributions plotFrgLength(gdnax)
Remove gDNA contamination from RNA-seq data by filtering read alignments in
BAM files that putatively have a gDNA origin. This is currently a wrapper
with convenient default values for the function filterBAMtx()
,
please use that function if you need greater control on how to filter RNA-seq
alignments.
gDNAtx( x, path = ".", sbparam = NULL, yieldSize = 1000000L, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) )
gDNAtx( x, path = ".", sbparam = NULL, yieldSize = 1000000L, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) )
x |
|
path |
Directory where to write the filtered BAM files. |
sbparam |
Either |
yieldSize |
(Default 1e6) Number of records in the input BAM file to yield each time the file is read. The lower the value, the smaller memory consumption, but in the case of large BAM files, values below 1e6 records may decrease the overall performance. |
verbose |
(Default TRUE) Logical value indicating if progress should be reported through the execution of the code. |
BPPARAM |
An object of a BiocParallelParam subclass
to configure the parallel execution of the code. By default, a
SerialParam object is used, which does not use any
parallelization, with the flag |
A data.frame
object with the number of filtered read alignments
tallied by their origin.
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene ## fetch sample BAM files bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,7)] # using a subset of samples ## diagnose gDNA contamination gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA) ## remove gDNA contamination dir <- tempdir() fstats <- gDNAtx(gdnax, path=dir) fstats list.files(dir, pattern="*.bam$")
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene ## fetch sample BAM files bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,7)] # using a subset of samples ## diagnose gDNA contamination gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA) ## remove gDNA contamination dir <- tempdir() fstats <- gDNAtx(gdnax, path=dir) fstats list.files(dir, pattern="*.bam$")
This is a class for storing the results of a call to the 'gDNAdx()' function.
## S4 method for signature 'gDNAx' getDx(x) ## S4 method for signature 'gDNAx' show(object) ## S4 method for signature 'gDNAx' getIgc(x) ## S4 method for signature 'gDNAx' getInt(x) ## S4 method for signature 'gDNAx' singleEnd(x) ## S4 method for signature 'gDNAx' strandMode(x) ## S4 replacement method for signature 'gDNAx' strandMode(x) <- value ## S4 method for signature 'gDNAx' allStrandModes(x) ## S4 method for signature 'gDNAx' strandedness(x, ...)
## S4 method for signature 'gDNAx' getDx(x) ## S4 method for signature 'gDNAx' show(object) ## S4 method for signature 'gDNAx' getIgc(x) ## S4 method for signature 'gDNAx' getInt(x) ## S4 method for signature 'gDNAx' singleEnd(x) ## S4 method for signature 'gDNAx' strandMode(x) ## S4 replacement method for signature 'gDNAx' strandMode(x) <- value ## S4 method for signature 'gDNAx' allStrandModes(x) ## S4 method for signature 'gDNAx' strandedness(x, ...)
x |
A gDNAx object. |
object |
A gDNAx object. |
value |
Integer value either 0, 1, 2 or |
... |
Further arguments when |
getIgc()
: A GRanges
object with intergenic ranges.
getInt()
: A GRanges
object with intron ranges.
singleEnd()
: Logical value indicating whether the
gDNAx
object contains data from a single-end
(TRUE
) or a paired-end (FALSE
) RNA-seq experiment.
strandMode()
: Integer value indicating whether the
gDNAx
object contains data from an unstranded
(NA
), stranded with the first mate read indicating the real
strand (1
), or stranded with the last mate read indicating
the real strand (2
) from an RNA-seq experiment.
Vector of strand modes for each BAM file in the gDNAx object.
A data.frame
object with strandedness values for each
BAM file in the gDNAx object.
bfl
A BamFileList object.
txdbpkg
A TxDb object.
singleEnd
Logical value indicating if reads are single (TRUE
)
or paired-end (FALSE
).
strandMode
Integer value either 0, 1, 2 or NA
, indicating how
the strand of a pair of read alignments should be inferred from the strand
of the first and last alignments in the pair. See the
GAlignmentPairs
class for further detail.
allStrandModes
Vector of integer values each of them corresponding to
a strandMode
value estimated from a BAM file.
stdChrom
Logical value indicating whether only alignments in the
'standard chromosomes' should be used. Consult the help page of the function
keepStandardChromosomes
from the package
GenomeInfoDb
for further information.
readLength
Integer value storing the read length.
yieldSize
Integer value storing the number of alignments employed by
the function gDNAdx()
.
diagnostics
A 'data.frame' object storing the diagnostics calculated by the function 'gDNAdx()'.
strandedness
A 'data.frame' object storing the estimated values of
strandedness, calculated when the argument strandMode
is missing in
the call to the function 'gDNAdx()'.
igcfrglen
A 'list' object storing the fragment lengths derived from alignments in intergenic regions.
intfrglen
A 'list' object storing the fragment lengths derived from alignments in intronic regions.
scjfrglen
A 'list' object storing the fragment lengths derived from spliced-compatible junction alignments in transcripts.
scefrglen
A 'list' object storing the fragment lengths derived from spliced-compatible exonic alignments in transcripts.
sicfrglen
A 'list' object storing the fragment lengths derived from splice-incompatible alignments in transcripts.
intergenic
A 'GRanges' object storing the intergenic feature annotations.
intronic
A 'GRanges' object storing the intronic feature annotations.
transcripts
A 'GRangesList' object storing the transcript annotations.
tx2gene
A string character vector storing the correspondence between transcripts and genes according to an 'TxDb' object.
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene bamfiles <- LiYu22subsetBAMfiles() # Retrieving BAM files ## one could simply call 'gDNAx(bamfiles, txdb)' but we give the arguments ## below to reduce time and verbosity when running this example gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA, useRMSK=FALSE, verbose=FALSE) gdnax # Getting statistics dx <- getDx(gdnax) head(dx) gdnax igc <- getIgc(gdnax) head(igc, n=3) int <- getInt(gdnax) head(int, n=3) singleEnd(gdnax) strandMode(gdnax) strandMode(gdnax) <- NA allStrandModes(gdnax) strandedness(gdnax)
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene bamfiles <- LiYu22subsetBAMfiles() # Retrieving BAM files ## one could simply call 'gDNAx(bamfiles, txdb)' but we give the arguments ## below to reduce time and verbosity when running this example gdnax <- gDNAdx(bamfiles, txdb, singleEnd=FALSE, strandMode=NA, useRMSK=FALSE, verbose=FALSE) gdnax # Getting statistics dx <- getDx(gdnax) head(dx) gdnax igc <- getIgc(gdnax) head(igc, n=3) int <- getInt(gdnax) head(int, n=3) singleEnd(gdnax) strandMode(gdnax) strandMode(gdnax) <- NA allStrandModes(gdnax) strandedness(gdnax)
THIS FUNCTION HAS BEEN DEPRECATED, HAS BEEN REPLACED BY strandedness()
AND WILL WILL BE DEFUNCT AT THE NEXT RELEASE.
Identify strandMode
(strandedness) in RNA-seq data samples based on
Given strandedness values calculated assuming either the same or the opposite
strand of gene annotations, classify them into an strand mode according to
cutoff values specified in the parameters.
Compute strandedness for each feature in RNA-seq data samples based on the proportion of reads aligning to the same strand as feature annotations in relation to the total number of reads aligning to that feature.
identifyStrandMode( bfl, txdb, singleEnd, stdChrom = TRUE, exonsBy = c("gene", "tx"), minnaln = 2e+05, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) ## S4 method for signature 'character' strandedness( x, txdb, singleEnd, stdChrom = TRUE, exonsBy = c("gene", "tx"), minnaln = 2e+05, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) ## S4 method for signature 'BamFileList' strandedness( x, txdb, singleEnd, stdChrom = TRUE, exonsBy = c("gene", "tx"), minnaln = 2e+05, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) classifyStrandMode( strnessdat, strcutoff = 0.9, weakstrcutoff = 0.6, warnweakstr = TRUE ) strnessByFeature( bfl, features, singleEnd = TRUE, strandMode = 1L, yieldSize = 1000000L, ambiguous = FALSE, p = 0.6, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) )
identifyStrandMode( bfl, txdb, singleEnd, stdChrom = TRUE, exonsBy = c("gene", "tx"), minnaln = 2e+05, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) ## S4 method for signature 'character' strandedness( x, txdb, singleEnd, stdChrom = TRUE, exonsBy = c("gene", "tx"), minnaln = 2e+05, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) ## S4 method for signature 'BamFileList' strandedness( x, txdb, singleEnd, stdChrom = TRUE, exonsBy = c("gene", "tx"), minnaln = 2e+05, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) ) classifyStrandMode( strnessdat, strcutoff = 0.9, weakstrcutoff = 0.6, warnweakstr = TRUE ) strnessByFeature( bfl, features, singleEnd = TRUE, strandMode = 1L, yieldSize = 1000000L, ambiguous = FALSE, p = 0.6, verbose = TRUE, BPPARAM = SerialParam(progressbar = verbose) )
bfl |
A |
txdb |
A character string of a |
singleEnd |
(Default FALSE) Logical value indicating if reads are
single ( |
stdChrom |
(Default TRUE) Logical value indicating whether only
alignments in the 'standard chromosomes' should be used. Consult the help
page of the function |
exonsBy |
(Default 'gene') Character string, either |
minnaln |
(Default 200000) Minimum number of read alignments overlapping exonic regions considered necessary for a reliable estimation of strandedness values. A warning message is given if the number of such available alignments is smaller than the one given through this parameter. |
verbose |
(Default TRUE) Logical value indicating if progress should be reported through the execution of the code. |
BPPARAM |
An object of a BiocParallelParam subclass
to configure the parallel execution of the code. By default, a
SerialParam object is used, which does not use any
parallelization, with the flag |
x |
A BamFileList object. |
strnessdat |
A |
strcutoff |
(Default 0.9) Minimum cutoff above which a strandedness value is considered to strongly support that read alignments originate from a specific strand. |
weakstrcutoff |
(Default 0.6) Minimum cutoff above which a strandedness value is considered to weakly support that read alignments originate from a specific strand. |
warnweakstr |
(Default TRUE) Logical value indicating whether to warn the user when strandedness values only provide a weakly support for a specific strand. |
features |
A |
strandMode |
(Default 1L) Numeric vector which can take values 0, 1,
or 2. The strand mode is a per-object switch on
|
yieldSize |
(Default 5e5) Field inherited from
|
ambiguous |
(Default FALSE) Logical value indicating if reads that overlap a region with features annotated to both strands should be included in the strandedness value computation. |
p |
(Default 0.6) Numeric value for the exact binomial test performed for the strandedness value of each feature, representing the hypothesized probability of success (i.e. the strandedness value expected for a non-stranded dataset). |
Identify strandMode
(strandedness) in RNA-seq data samples based on
the proportion of reads aligning to the same or opposite strand as
transcripts in the annotations.
If the value in the "strandMode1" column is > 0.90, strandMode
is set
to 1L. If "strandMode2" column is > 0.90, strandMode
is set to 2L. If
"strandMode1" and "strandMode2" are comprised between 0.40 and 0.60,
strandMode
is set to NA
If none of the three previous criteria
are met, strandMode
is set to "ambiguous". This criteria can be
conservative in some cases (e.g. when there is genomic DNA contamination),
for this reason we recommend to check the data.frame with strandedness
values.
In case of single-end data, the same criteria are used, but the
interpretation of strandMode = 1L
and strandMode = 2L
changes: when strandMode = 1L
the strand of the read is concordant
with the reference annotations, when strandMode = 2L
the correct
read strand is the opposite to the one of the read.
A subset of 200,000 alignments overlapping gene annotations are used to compute strandedness.
Strandedness is computed for each feature and BAM file according to the
strandMode
specified in case of paired-end data. For single-end,
the original strand of reads is used. All alignments from the BAM file(s)
are considered to compute the strandedness.
The p
value should be close to 0.5, representing the strandedness
expected for a non-stranded RNA-seq library.
A list object with two elements:
"strandMode": the strandMode
of the sample(s) following
GAlignmentPairs
class definition. If all samples have the same strandMode
,
the length of the vector is 1. It can take values: NA
(library is not strand-specific), 1 (strand of pair is strand of
its first alignment), 2 (strand of pair is strand of its second
alignment) or "ambiguous" (additional category used here for
samples not fitting any of the three previous categories).
See "Details" section below to know the classification criteria,
as well as to how interpret results for single-end data.
"Strandedness": data.frame with one row per sample and 3 columns. "strandMode1": proportion of alignments aligned to the same strand than a transcript according to the strand of its first alignment. "strandMode2": proportion of alignments aligned to the same strand than a transcript according to the strand of its second alignment. "ambiguous": alignments aligned to regions with transcripts in both strands.
A vector of integer values, NA
, 1
, or 2
,
A SummarizedExperiment with three assays:
"strness": contains strandedness values for each feature and sample.
"counts": number of reads aligning to each feature on the same
strand (according to strandMode
).
"counts_invstrand": number of reads aligning to each feature but on
the opposite strand (according to strandMode
).
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Retrieving BAM files bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,7)] # using a subset of samples strandM <- identifyStrandMode(bamfiles, txdb, singleEnd=FALSE) strandM$strandMode head(strandM$Strandedness) strnessdat <- data.frame(strandMode1=c(0.91, 0.92, 0.93), strandMode2=c(0.09, 0.08, 0.07)) classifyStrandMode(strnessdat) features <- range(exonsBy(txdb, by="gene")) features <- features[1:100] sByFeature <- strnessByFeature(bamfiles, features, singleEnd=FALSE, strandMode=2L) sByFeature
library(gDNAinRNAseqData) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Retrieving BAM files bamfiles <- LiYu22subsetBAMfiles() bamfiles <- bamfiles[c(1,7)] # using a subset of samples strandM <- identifyStrandMode(bamfiles, txdb, singleEnd=FALSE) strandM$strandMode head(strandM$Strandedness) strnessdat <- data.frame(strandMode1=c(0.91, 0.92, 0.93), strandMode2=c(0.09, 0.08, 0.07)) classifyStrandMode(strnessdat) features <- range(exonsBy(txdb, by="gene")) features <- features[1:100] sByFeature <- strnessByFeature(bamfiles, features, singleEnd=FALSE, strandMode=2L) sByFeature