Title: | Count summarization and normalization for RNA-Seq data |
---|---|
Description: | Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package. |
Authors: | Nicolas Delhomme, Ismael Padioleau, Bastian Schiffthaler, Niklas Maehler |
Maintainer: | Nicolas Delhomme <[email protected]> |
License: | Artistic-2.0 |
Version: | 2.43.0 |
Built: | 2024-12-19 04:27:07 UTC |
Source: | https://github.com/bioc/easyRNASeq |
A class holding all the necessary parameters to retrieve the necessary annotation for processing an RNA-Seq experiment.
Objects can be created by calls of the
form new("AnnotParamCharacter", ...)
or new("AnnotParamObject", ...)
(both subject to API changes) or using the
AnnotParam
constructor (failsafe, prefered).
The class AnnotParam
in itself is virtual and hence cannot be instantiated.
Nicolas Delhomme
showClass("AnnotParam")
showClass("AnnotParam")
A class describing the parameters of a bam file issued from an RNA-Seq experiment.
Objects can be created by calls of the
form new("BamParam", ...)
or using the BamParam constructor.
The BamParam
class has
the following slots:
paired
stranded
strandProtocol
yieldSize
all of which can be accessed using the accordingly names accessor.
Nicolas Delhomme
showClass("BamParam")
showClass("BamParam")
Display the basename of the bam file represented by a BamFile
object.
## S4 method for signature 'BamFile' basename(path)
## S4 method for signature 'BamFile' basename(path)
path |
an object of class |
Display the basename of the bam file linked to by a
BamFile
object.
Manages the tutorial, example and vignette data using the
BiocFileCache
package
fetchData(fileURL) tutorialData(...)
fetchData(fileURL) tutorialData(...)
... |
unused for the time being |
fileURL |
The URL of the file to retrieve. Alternatively, the ID of the file in the BiocFileCache (i.e. the file basename), can be used. |
internal function to set up the cache
A function to fetch tutorial data, a file at a time
the function to retrieve all the tutorial data and cache it, if it is not already available
the function to retrieve all the tutorial data and cache it, if it is not already available
tdir <- tutorialData() gAnnot.path <- fetchData("gAnnot.rda") vdir <- vignetteData() md5.txt <- fetchData("md5.txt")
tdir <- tutorialData() gAnnot.path <- fetchData("gAnnot.rda") vdir <- vignetteData() md5.txt <- fetchData("md5.txt")
This function create a set of synthetic transcripts from a provided annotation file in "gff3" or "gtf" format. As detailed in http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis, one major caveat of estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, might sometimes align to several features (e.g. transcripts or genes) with alignments of equivalent quality. This, for example, might happen as a result of gene duplication and the presence of repetitive or common domains. To avoid counting unique mRNA fragments multiple times, the stringent approach is to keep only uniquely mapping reads - being aware of potential consequences. Not only can "multiple counting" arise from a biological reason, but also from technical artifacts, introduced mostly by poorly formatted gff3/gtf annotation files. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a "synthetic" transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript - a gene structure with a one (gene) to one (transcript) relationship.
## S4 method for signature 'AnnotParamCharacter' createSyntheticTranscripts( obj, features = c("mRNA", "miRNA", "tRNA", "transcript"), verbose = TRUE ) ## S4 method for signature 'character' createSyntheticTranscripts( obj, features = c("mRNA", "miRNA", "tRNA", "transcript"), verbose = TRUE, output = c("Genome_intervals", "GRanges"), input = c("gff3", "gtf") )
## S4 method for signature 'AnnotParamCharacter' createSyntheticTranscripts( obj, features = c("mRNA", "miRNA", "tRNA", "transcript"), verbose = TRUE ) ## S4 method for signature 'character' createSyntheticTranscripts( obj, features = c("mRNA", "miRNA", "tRNA", "transcript"), verbose = TRUE, output = c("Genome_intervals", "GRanges"), input = c("gff3", "gtf") )
obj |
a |
features |
one or more of 'mRNA', 'miRNA', 'tRNA', 'transcript' |
verbose |
increase the verbosity (default TRUE) |
output |
the output type, one of 'Genome_intervals' or 'GRanges' |
input |
the type of input, one of 'gff3' or 'gtf' |
The createSyntheticTranscripts
function implements this, taking
advantage of the hierarchical structure of the gff3/gtf file. Exon
features are related to their transcript (parent), which themselves derives
from their gene parents. Using this relationship, exons are combined per gene
into a flattened transcript structure. Note that this might not avoid multiple
counting if genes overlap on opposing strands. There, only strand specific
sequencing data has the power to disentangle these situations.
As gff3/gtf file can contain a large number of feature types, the
createSyntheticTranscripts
currently only supports: mRNA,
miRNA, tRNA and transcript. Please contact me if you
need additional features to be considered. Note however, that I will only
add features that are part of the sequenceontology.org SOFA
(SO_Feature_Annotation) ontology.
Depending on the obj
class.
AnnotParamCharacter
: a AnnotParamObject
object
a character
filename: depending on the selected output
value, a Genome_intervals
or a GRanges
object.
Nicolas Delhomme
For the input:
For the output:
# get the example file Dm.gtf <- fetchData("Drosophila_melanogaster.BDGP5.77.with-chr.gtf.gz") # create the AnnotParam annotParam <- AnnotParam( datasource=Dm.gtf, type="gtf") # create the synthetic transcripts annotParam <- createSyntheticTranscripts(annotParam,verbose=FALSE)
# get the example file Dm.gtf <- fetchData("Drosophila_melanogaster.BDGP5.77.with-chr.gtf.gz") # create the AnnotParam annotParam <- AnnotParam( datasource=Dm.gtf, type="gtf") # create the synthetic transcripts annotParam <- createSyntheticTranscripts(annotParam,verbose=FALSE)
fetchAnnotation
knownOrganisms
The easyRNASeq
function is superseded by the
simpleRNASeq
function to consolidate and
prune the overall package. The changes are based on user comments and on the
general standardization occuring in the field.
The fetchCoverage
function only had two
parameters deprecated as the consequence of the package consolidation. As the scanBam
function is not called directly anymore but through higher level functions (from the
GenomicRanges package), the 'what' and 'isUnmappedQuery' parameters were obsolete.
These functions and generics define 'accessors' (to get and set values) for objects in the easyRNASeq package.
genomicAnnotation(obj) readCounts(obj,count=c("exons","features","genes","islands","transcripts"), summarization=c("bestExons","geneModels"),unique=FALSE) genomicAnnotation(obj) <- value
genomicAnnotation(obj) readCounts(obj,count=c("exons","features","genes","islands","transcripts"), summarization=c("bestExons","geneModels"),unique=FALSE) genomicAnnotation(obj) <- value
obj |
An object derived from class |
count |
The type of count you want to access, 'genes','features','exons','transcripts' or 'islands' |
summarization |
If count is set to genes, precise the type of summarization, 'bestExons' or 'geneModels' |
unique |
For the 'exons' count only. Should the counts returned be unique for their identifier (i.e. the matrix row names)? |
value |
The replacement value. |
Usually, the value of the corresponding slot, or other simple content
described on the help page of easyRNASeq
.
Nicolas Delhomme
# This class is deprecated and as such there are no exmples of its use
# This class is deprecated and as such there are no exmples of its use
The annotation can be retrieved in two ways
biomaRtUse biomaRt and Ensembl to get organism specific annotation.
gff3/gtfUse a gff3 or gtf local annotation file.
When using biomaRt, it is
important that the organism
argument to
AnnotParam
is set the prefix of one of the value
available using the biomaRt
listDatasets
function, e.g.
"Dmelanogaster".
When reading from a gff3/gtf file, a version 3 formatted
gff or a gtf (an Ensembl defined gff2 version) is expected. The function
genomeIntervals genomeIntervals-readGff3
is
used to import the data.
## S4 method for signature 'AnnotParam' getAnnotation(obj, verbose = FALSE, ...)
## S4 method for signature 'AnnotParam' getAnnotation(obj, verbose = FALSE, ...)
obj |
An object of class |
verbose |
a boolean to turn on verbosity |
... |
See details |
... are for additional arguments, passed to the biomaRt
getBM
function or to the
readGffGtf
internal function that takes an optional arguments: annotation.type that
default to "exon". This is used to select the proper rows of the gff or gtf
file.
A GRanges
containing the fetched
annotations.
Nicolas Delhomme
## Not run: library("RnaSeqTutorial") getAnnotation( AnnotParam( organism="Dmelanogaster", datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial"), type="gff3" )) ## End(Not run)
## Not run: library("RnaSeqTutorial") getAnnotation( AnnotParam( organism="Dmelanogaster", datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial"), type="gff3" )) ## End(Not run)
These functions and generics define 'accessors' (to get and set values) for
AnnotParam
objects within the easyRNASeq package.
Implemented are:
datasource
type
datasource(object) ## S4 method for signature 'AnnotParam' type(x)
datasource(object) ## S4 method for signature 'AnnotParam' type(x)
object |
An object derived from class |
x |
An object derived from class |
The value of the corresponding slot.
Nicolas Delhomme
The AnnotParam
class.
The type and organism generics are imported from the BSgenome
and
Biostrings
package, respectively.
# fetch the example data Dm.annot <- fetchData("Dmel-mRNA-exon-r5.52.gff3.gz") annot <- AnnotParam(datasource=Dm.annot) # get the datasource Parameter datasource(annot)
# fetch the example data Dm.annot <- fetchData("Dmel-mRNA-exon-r5.52.gff3.gz") annot <- AnnotParam(datasource=Dm.annot) # get the datasource Parameter datasource(annot)
This constructs a AnnotParam
object.
The datasource parameter (see details) is mandatory, however
other parameters, i.e. when the datasource is not a
GRanges
default to "genes" and gff3", indicating that the datasource is in
the gff3 format and that the contained information needs to be grouped
by "genes". This representing the most common use case. Hence, it is
left to the user to refine the parameters accordingly to the annotation
he is providing or whishes to retrieve.
## S4 method for signature 'character' AnnotParam( datasource = character(0), type = c("gff3", "biomaRt", "gtf", "rda") )
## S4 method for signature 'character' AnnotParam( datasource = character(0), type = c("gff3", "biomaRt", "gtf", "rda") )
datasource |
a character or a |
type |
one of NULL, biomaRt, gff3, gtf or rda. Default to NULL. See details. |
Note that calling the constructor without argument fails, as the datasource
is a mandatory parameter. Calling the constructor with additional (not all)
parameters will affect the value of the selected parameters, leaving the other
parameters unaffected.
There are three parameters for an AnnotParam
object:
datasourceIf no type is provided, the datasource should
be GRanges
object containing the genic information. These can be obtained
using the getAnnotation
function.
typeOne of biomaRt, gff3, gtf or rda. The default is "gff3".
In all cases, the datasource is a character
describing:
For biomaRt, the name of the organism as known by the ensembl Mart, e.g. dmelanogaster or hsapiens.
For gff3, gtf or rda, the filename (including the full or relative path).
# create an object to retrieve annotation from biomaRt annotParam <- AnnotParam(datasource="Hsapiens",type="biomaRt") # get the datasource and type datasource(annotParam) type(annotParam) # create an object to retrieve annotation from an rda object # fetch the example data gAnnot.rda <- fetchData("gAnnot.rda") annotParam <- AnnotParam(datasource=gAnnot.rda,type="rda")
# create an object to retrieve annotation from biomaRt annotParam <- AnnotParam(datasource="Hsapiens",type="biomaRt") # get the datasource and type datasource(annotParam) type(annotParam) # create an object to retrieve annotation from an rda object # fetch the example data gAnnot.rda <- fetchData("gAnnot.rda") annotParam <- AnnotParam(datasource=gAnnot.rda,type="rda")
These functions and generics define 'accessors' (to get and set values) for
BamParam
objects within the easyRNASeq package.
yieldSize(object,...) paired(object) stranded(object) strandProtocol(object)
yieldSize(object,...) paired(object) stranded(object) strandProtocol(object)
object |
An object derived from class |
... |
Additional parameter inherited from the
|
The value of the corresponding slot.
Nicolas Delhomme
The BamParam
class
The RnaSeqParam yieldSize
accessor
bp <- BamParam() ## get the yieldSize Parameter ysize <-yieldSize(bp)
bp <- BamParam() ## get the yieldSize Parameter ysize <-yieldSize(bp)
This constructs a BamParam
object.
The default parameters are derived from the currently most
common RNA-Seq experimental use-case and are detailed below:
paired is TRUE, i.e. paired-end sequencing is expected.
stranded is FALSE i.e. stranded sequencing is not expected.
yieldSize is set to 1,000,000. This is the amount of reads iteratively processed from the bam file stream. It is a compromise between speed, process-parallelization and memory usage.
## S4 method for signature 'ANY' BamParam( paired = TRUE, stranded = FALSE, strandProtocol = c("reverse", "forward"), yieldSize = 1000000L )
## S4 method for signature 'ANY' BamParam( paired = TRUE, stranded = FALSE, strandProtocol = c("reverse", "forward"), yieldSize = 1000000L )
paired |
boolean whether the BAM file contains paired-end data or not |
stranded |
boolean whether the reads are strand specific |
strandProtocol |
factor with values 'reverse' and 'forward' specifying the type of strand specificity protocol. 'reverse', the reads are on the opposite strand to the gene; typical for Illumina TRUSEQ strand-specific protocol. |
yieldSize |
the amount of reads to be streamed at a time. Default to 1M |
Calling the constructor without argument result in the default parameter described above to be returned. Calling the constructor with any parameter will affect the value of the selected parameters, leaving the other parameters unaffected.
# the defaults BamParam() # change the default BamParam(paired=FALSE) BamParam(stranded=TRUE,yieldSize=1L) BamParam(stranded=TRUE,strandProtocol="forward",yieldSize=1L)
# the defaults BamParam() # change the default BamParam(paired=FALSE) BamParam(stranded=TRUE,yieldSize=1L) BamParam(stranded=TRUE,strandProtocol="forward",yieldSize=1L)
Convert a count table obtained from the easyRNASeq function into an RPKM corrected count table.
## S4 method for signature 'matrix,ANY,vector,vector' RPKM( obj, from = c("exons", "features", "transcripts", "bestExons", "geneModels", "islands"), lib.size = numeric(1), feature.size = integer(1), simplify = TRUE, ... )
## S4 method for signature 'matrix,ANY,vector,vector' RPKM( obj, from = c("exons", "features", "transcripts", "bestExons", "geneModels", "islands"), lib.size = numeric(1), feature.size = integer(1), simplify = TRUE, ... )
obj |
An object of class |
from |
Determine the kind of coverage to use, choice limited to: exons, features, transcripts, bestExons, geneModels or islands. |
lib.size |
Precise the library size. It should be a named numeric list, i.e. named after the sample names. |
feature.size |
Precise the feature (e.g. exons, genes) sizes. It should be a named numeric list, named after the feature names. |
simplify |
If set to TRUE, whenever a feature (exon, feature, ...) is duplicated in the count table, it is only returned once. |
... |
additional arguments. See details |
RPKM accepts two sets of arguments:
RNAseq,character the
... are additional arguments to be passed to the
readCounts
method.
matrix,named vectornormalize a count matrix by providing the feature sizes (e.g. gene sizes) as a named vector where the names match the row names of the count matrix and the lib sizes as a named vector where the names match the column names of the count matrix.
A matrix
containing RPKM corrected read counts.
Nicolas Delhomme
## Not run: ## get an RNAseq object rnaSeq <- easyRNASeq(filesDirectory= system.file( "extdata", package="RnaSeqTutorial"), pattern="[A,C,T,G]{6}\\.bam$", format="bam", readLength=36L, organism="Dmelanogaster", chr.sizes=as.list(seqlengths(Dmelanogaster)), annotationMethod="rda", annotationFile=system.file( "data", "gAnnot.rda", package="RnaSeqTutorial"), count="exons", outputFormat="RNAseq") ## get the RPKM rpkm <- RPKM(rnaSeq,from="exons") ## the same from a count table count.table <- readCounts(rnaSeq,count="exons") ## get the RPKM ## verify that the feature are sorted as the count.table all(.getName(rnaSeq,"exon") == rownames(count.table)) feature.size <- unlist(width(ranges(rnaSeq))) ## verify that the samples are ordered in the same way all(names(librarySize(rnaSeq)) == colnames(count.table)) ## get the RPKM rpkm <- RPKM(count.table, feature.size=feature.size, lib.size=librarySize(rnaSeq)) ## End(Not run)
## Not run: ## get an RNAseq object rnaSeq <- easyRNASeq(filesDirectory= system.file( "extdata", package="RnaSeqTutorial"), pattern="[A,C,T,G]{6}\\.bam$", format="bam", readLength=36L, organism="Dmelanogaster", chr.sizes=as.list(seqlengths(Dmelanogaster)), annotationMethod="rda", annotationFile=system.file( "data", "gAnnot.rda", package="RnaSeqTutorial"), count="exons", outputFormat="RNAseq") ## get the RPKM rpkm <- RPKM(rnaSeq,from="exons") ## the same from a count table count.table <- readCounts(rnaSeq,count="exons") ## get the RPKM ## verify that the feature are sorted as the count.table all(.getName(rnaSeq,"exon") == rownames(count.table)) feature.size <- unlist(width(ranges(rnaSeq))) ## verify that the samples are ordered in the same way all(names(librarySize(rnaSeq)) == colnames(count.table)) ## get the RPKM rpkm <- RPKM(count.table, feature.size=feature.size, lib.size=librarySize(rnaSeq)) ## End(Not run)
Computes the genomic reads' coverage from a read file in bam format or any format supported by ShortRead.
## S4 method for signature 'RNAseq' fetchCoverage( obj, format = c("aln", "bam"), filename = character(1), filter = srFilter(), type = "SolexaExport", chr.sel = c(), validity.check = TRUE, chr.map = data.frame(), ignoreWarnings = FALSE, gapped = TRUE, paired = FALSE, stranded = FALSE, bp.coverage = FALSE, ... )
## S4 method for signature 'RNAseq' fetchCoverage( obj, format = c("aln", "bam"), filename = character(1), filter = srFilter(), type = "SolexaExport", chr.sel = c(), validity.check = TRUE, chr.map = data.frame(), ignoreWarnings = FALSE, gapped = TRUE, paired = FALSE, stranded = FALSE, bp.coverage = FALSE, ... )
obj |
An |
format |
The format of the reads, one of "aln","bam". If not "bam", all the types supported by the ShortRead package are supported too. |
filename |
The full path of the file to use |
filter |
The filter to be applied when loading the data using the "aln" format |
type |
The type of data when using the "aln" format. See the ShortRead package. |
chr.sel |
A vector of chromosome names to subset the final results. |
validity.check |
Shall UCSC chromosome name convention be enforced |
chr.map |
A data.frame describing the mapping of original chromosome names towards wished chromosome names. See details. |
ignoreWarnings |
set to TRUE (bad idea! they have a good reason to be there) if you do not want warning messages. |
gapped |
Is the bam file provided containing gapped alignments? |
paired |
Is the bam file containing PE reads? |
stranded |
Is the bam file from a strand specific protocol? |
bp.coverage |
a boolean that default to FALSE to decide whether coverage is to be calculated and stored by bp |
... |
additional arguments. See details |
... for fetchCoverage: Can be used for readAligned method from package ShortRead. The use of the dots for the scanBamFlag method from package Rsamtools has been deprecated, as were the 'what' and 'isUnmappedQuery' argument to the function
An RNAseq
object. The slot readCoverage
contains a SimpleRleList object representing a list of coverage vectors,
one per chromosome.
Nicolas Delhomme
## Not run: library("RnaSeqTutorial") library(BSgenome.Dmelanogaster.UCSC.dm3) obj <- new('RNAseq', organismName="Dmelanogaster", readLength=36L, chrSize=as.list(seqlengths(Dmelanogaster)) ) obj <- fetchCoverage( obj, format="bam", filename=system.file( "extdata", "ACACTG.bam", package="RnaSeqTutorial") ) ## End(Not run)
## Not run: library("RnaSeqTutorial") library(BSgenome.Dmelanogaster.UCSC.dm3) obj <- new('RNAseq', organismName="Dmelanogaster", readLength=36L, chrSize=as.list(seqlengths(Dmelanogaster)) ) obj <- fetchCoverage( obj, format="bam", filename=system.file( "extdata", "ACACTG.bam", package="RnaSeqTutorial") ) ## End(Not run)
The fetchAnnotation
and knownOrganisms
function are now
defunct. The fetchAnnotation
function has been replaced by the
getAnnotation
method.
Nicolas Delhomme
Describes extensions to the GenomicRanges
package.
For GRanges
and
GRangesList
objects:
colnames
returns the column name of a GRanges
or
GRangesList
object.
unsafeAppend
appends two GAlignments
object together bypassing most sanity checks. Faster than the standard c
or
append
function.
colnames(x, do.NULL = TRUE, prefix = "col") unsafeAppend(obj1,obj2)
colnames(x, do.NULL = TRUE, prefix = "col") unsafeAppend(obj1,obj2)
x |
An object of the |
do.NULL |
see |
prefix |
see |
obj1 |
A |
obj2 |
A |
colnames
returns the actual column names of the elementMetadata slot of the
GRanges
or GRangesList
object.
The elementMetadata contains a DataFrame
object used
to store additional information provided by the user, such as exon ID in
our case.
unsafeAppend
appends two GAlignments
objects.
colnames
: A vector of column names.
unsafeAppend
: A GAlignments
object
Nicolas Delhomme
# an example of annotation grngs <- GRanges(seqnames=c("chr01","chr01","chr02"), ranges=IRanges( start=c(10,30,100), end=c(21,53,123)), strand=c("+","+","-"), transcripts=c("trA1","trA2","trB"), gene=c("gA","gA","gB"), exon=c("e1","e2","e3") ) # accessing the colnames colnames(grngs) # creating a GRangesList grngsList<-split(grngs,seqnames(grngs)) # accessing the colnames colnames(grngsList) # For unsafeAppend library(GenomicAlignments) unsafeAppend(GAlignments(),GAlignments())
# an example of annotation grngs <- GRanges(seqnames=c("chr01","chr01","chr02"), ranges=IRanges( start=c(10,30,100), end=c(21,53,123)), strand=c("+","+","-"), transcripts=c("trA1","trA2","trB"), gene=c("gA","gA","gB"), exon=c("e1","e2","e3") ) # accessing the colnames colnames(grngs) # creating a GRangesList grngsList<-split(grngs,seqnames(grngs)) # accessing the colnames colnames(grngsList) # For unsafeAppend library(GenomicAlignments) unsafeAppend(GAlignments(),GAlignments())
Process the coverage to locate regions with a minimum coverage (min.cov). If regions are separated by a gap shorter than a maximum length (max.gap), they are unified. Only islands longer than min.length are returned. These functions are now outdated and would need to be actualized.
## S4 method for signature 'RNAseq' findIslands( obj, max.gap = integer(1), min.cov = 1L, min.length = integer(1), plot = TRUE, ... )
## S4 method for signature 'RNAseq' findIslands( obj, max.gap = integer(1), min.cov = 1L, min.length = integer(1), plot = TRUE, ... )
obj |
An object of class |
max.gap |
Maximum gap between two peaks to build an island |
min.cov |
Minimum coverage for an island to be returned |
min.length |
Minimum size of an island to be returned |
plot |
If TRUE, draw plots of coverage distribution. Help the user to select an appropriate value for the minimum coverage. |
... |
See details |
... are for providing additional options to the
hist
plot function.
An RNAseq
object with the readIsland slot set with a
GRanges containing the selected islands and the readCount slot
actualized with a list containing the count table per island.
Nicolas Delhomme
## Not run: # NOTE that this function might need to be actualized obj <- new('RNAseq', organismName="Dmelanogaster", readLength=36L, chrSize=as.list(seqlengths(Dmelanogaster)) ) # fetch the example data bamFilePath <- fetchData("ACACTG.bam") obj <- fetchCoverage(obj,format="bam",filename=bamFilePath) obj <- findIslands( obj, max.gap=10L, min.cov=10L, min.length=200L) ## End(Not run)
## Not run: # NOTE that this function might need to be actualized obj <- new('RNAseq', organismName="Dmelanogaster", readLength=36L, chrSize=as.list(seqlengths(Dmelanogaster)) ) # fetch the example data bamFilePath <- fetchData("ACACTG.bam") obj <- fetchCoverage(obj,format="bam",filename=bamFilePath) obj <- findIslands( obj, max.gap=10L, min.cov=10L, min.length=200L) ## End(Not run)
These functions and generics define 'accessors' (to get and set values) for
RnaSeqParam
objects within the easyRNASeq package.
Implemented are:
annotParam
bamParam
countBy
datasource
paired
precision
stranded
strandProtocol
yieldSize
## S4 method for signature 'RnaSeqParam' yieldSize(object)
## S4 method for signature 'RnaSeqParam' yieldSize(object)
object |
An object derived from class |
The value of the corresponding slot.
Nicolas Delhomme
The AnnotParam
class
The BamParam
class
The RnaSeqParam
class
The BamParam yieldSize
accessor
## create the necessary AnnotParam annotParam <- AnnotParam( datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")) ## create the RnaSeqParam rsp <- RnaSeqParam(annotParam=annotParam) ## get the yieldSize Parameter ysize <-yieldSize(rsp)
## create the necessary AnnotParam annotParam <- AnnotParam( datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")) ## create the RnaSeqParam rsp <- RnaSeqParam(annotParam=annotParam) ## get the yieldSize Parameter ysize <-yieldSize(rsp)
This constructs a RnaSeqParam
object, that combines
all the necessary parameters for the analysis of RNA-Seq data. As much as
possible, these parameters are determined automa-gi/ti-cally. It describes
three sets of parameters:
parameters describing the annotation
parameters describing the BAM files, i.e. the type of sequencing that was conducted.
parameters describing how the counting should be done.
The first two are provided through sepcific objects: AnnotParam
and
BamParam
respectively. The third one is a set
constituted of:
countBy: the feature per which the counts should be summarized ( exon, transcript or gene. A forth possibility - feature - can be used to define arbitrary genomic loci)
precision: the precision at which the counts should be performed:
bp or reads. bp used to be the default in the easyRNASeq
package,
whereas now reads is, following the Bioconductor main stream development.
The default parameters for the BamParam
parameter are
derived from the currently most common RNA-Seq experimental use-case:
strand-specific paired-end Illumina sequencing. See the respective manual
pages of AnnotParam
and
BamParam
for more details.
## S4 method for signature 'ANY' RnaSeqParam( annotParam = AnnotParam(), bamParam = BamParam(), countBy = c("exons", "features", "genes", "transcripts"), precision = c("read", "bp") )
## S4 method for signature 'ANY' RnaSeqParam( annotParam = AnnotParam(), bamParam = BamParam(), countBy = c("exons", "features", "genes", "transcripts"), precision = c("read", "bp") )
annotParam |
An object derived from class |
bamParam |
An object derived from class |
countBy |
TODO |
precision |
A character value, either 'read' or 'bp' that defines the precision at which counting is done, either per read or per covered bp. 'read' is the default. |
annotParam <- AnnotParam( datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")) ## create the RnaSeqParam rsp <- RnaSeqParam(annotParam=annotParam) ## change some defaults RnaSeqParam(countBy="features",annotParam=annotParam) RnaSeqParam(bamParam=BamParam(stranded=TRUE,yieldSize=1L),annotParam=annotParam)
annotParam <- AnnotParam( datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial")) ## create the RnaSeqParam rsp <- RnaSeqParam(annotParam=annotParam) ## change some defaults RnaSeqParam(countBy="features",annotParam=annotParam) RnaSeqParam(bamParam=BamParam(stranded=TRUE,yieldSize=1L),annotParam=annotParam)
Summarize the read counts per exon, feature, gene, transcript or island.
exonCounts
: for that summarization, reads are
summarized per exons. An "exon" field is necessary in the annotation object
for this to work. See
easyRNASeq annotation
methods
for more details on the annotation object.
featureCounts
is similar to the 'exons' one. This is just a
wrapper to summarize count for genomic features that are not exon related.
I.e. one could use it to measure eRNAs. Again, a "feature" field is
necessary in the annotation object for this to work.
geneCounts
sums the counts per either bestExons
or
geneModels
. In either case, the annotation object needs to contain
both an "exon" and a "gene" field.
islandCounts
sums the
counts per computed islands.
transcriptCounts
sums the counts
obtained by exons into their respective transcripts. Note that this often
result in counting some reads several times. For this function to work you
need both an "exon" and a "transcript" field in your annotation object. To
avoid this, one could create transcript specific synthetic exons, i.e.
features that would be unique to a transcript. To offer this possibility,
transcripts count can be summarized from "features", in which case the
annotation object need to have both the "feature" and "transcript" fields
defined.
exonCounts(obj) featureCounts(obj) transcriptCounts(obj,from="exons") geneCounts(obj,summarization=c("bestExons","geneModels"),...) islandCounts(obj,force=FALSE,...)
exonCounts(obj) featureCounts(obj) transcriptCounts(obj,from="exons") geneCounts(obj,summarization=c("bestExons","geneModels"),...) islandCounts(obj,force=FALSE,...)
obj |
An object derived from class |
force |
For |
from |
either "exons" or "features" can be used to summarize per transcript |
summarization |
Method use for summarize genes |
... |
See details |
... for
geneCounts: additional options for the
.geneModelSummarization
islandCounts: additional options for
findIslands
A numeric vector containing count per exon, feature, gene or transcript.
Nicolas Delhomme
easyRNASeq
annotation methods
.geneModelSummarization
findIslands
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) # get the example data files tdir <- tutorialData() # get an example annotation file - we retrieve it from GitHub using curl gAnnot.rda <- fetchData("gAnnot.rda") # create an RNAseq object # summarizing 2 bam files by exons rnaSeq <- easyRNASeq(tdir, organism="Dmelanogaster", chr.sizes=seqlengths(Dmelanogaster), readLength=36L, annotationMethod="rda", annotationFile=gAnnot.rda, format="bam", count="exons", pattern="[A,C,T,G]{6}\\.bam$", outputFormat="RNAseq") # summing up the exons by transcript rnaSeq <- transcriptCounts(rnaSeq) ## End(Not run)
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) # get the example data files tdir <- tutorialData() # get an example annotation file - we retrieve it from GitHub using curl gAnnot.rda <- fetchData("gAnnot.rda") # create an RNAseq object # summarizing 2 bam files by exons rnaSeq <- easyRNASeq(tdir, organism="Dmelanogaster", chr.sizes=seqlengths(Dmelanogaster), readLength=36L, annotationMethod="rda", annotationFile=gAnnot.rda, format="bam", count="exons", pattern="[A,C,T,G]{6}\\.bam$", outputFormat="RNAseq") # summing up the exons by transcript rnaSeq <- transcriptCounts(rnaSeq) ## End(Not run)
The package contains a dataset from the Robinson, Delhomme et al., 2014 publication.
RobinsonDelhomme2014a normalised expression count table. This dataset was generated from 17 Populus tremula - Eurasian aspen - trees used to assess the sexual dimorphism of this dioecious species. This count matrix has been generating following published pre-processing guidelines - see http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis - and the resulting HTSeq files have been collated and the obtained raw count matrix submitted to a variance stabilising transformation. Subsequently, the values have been transformed so that the minimal vst values - that corresponds to an absence of expression - is 0. Hence the counts in the matrix are library-size normalized, variance stabilised expression values, with a minimal value of 0.
This function is a wrapper around the more low level functionalities of the package. Is the easiest way to get a count matrix from a set of read files. It does the following:
use ShortRead/Rsamtools
methods
for loading/pre-processing the data.
fetch the annotations
depending on the provided arguments
get the reads coverage
from
the provided file(s)
summarize the
reads
according to the selected summarization features
optionally
apply
a data correction (i.e. generating RPKM).
use edgeR methods
for
post-processing the data, this being strongly recommended over RPKM).
## S4 method for signature 'character' easyRNASeq( filesDirectory = getwd(), organism = character(1), chr.sizes = c("auto"), readLength = integer(1), annotationMethod = c("biomaRt", "env", "gff", "gtf", "rda"), annotationFile = character(1), annotationObject = GRangesList(), format = c("bam", "aln"), gapped = FALSE, count = c("exons", "features", "genes", "islands", "transcripts"), outputFormat = c("matrix", "SummarizedExperiment", "edgeR", "RNAseq"), pattern = character(1), filenames = character(0), nbCore = 1, filter = srFilter(), type = "SolexaExport", chr.sel = c(), summarization = c("bestExons", "geneModels"), normalize = FALSE, max.gap = integer(1), min.cov = 1L, min.length = integer(1), plot = TRUE, conditions = c(), validity.check = TRUE, chr.map = data.frame(), ignoreWarnings = FALSE, silent = FALSE, ... )
## S4 method for signature 'character' easyRNASeq( filesDirectory = getwd(), organism = character(1), chr.sizes = c("auto"), readLength = integer(1), annotationMethod = c("biomaRt", "env", "gff", "gtf", "rda"), annotationFile = character(1), annotationObject = GRangesList(), format = c("bam", "aln"), gapped = FALSE, count = c("exons", "features", "genes", "islands", "transcripts"), outputFormat = c("matrix", "SummarizedExperiment", "edgeR", "RNAseq"), pattern = character(1), filenames = character(0), nbCore = 1, filter = srFilter(), type = "SolexaExport", chr.sel = c(), summarization = c("bestExons", "geneModels"), normalize = FALSE, max.gap = integer(1), min.cov = 1L, min.length = integer(1), plot = TRUE, conditions = c(), validity.check = TRUE, chr.map = data.frame(), ignoreWarnings = FALSE, silent = FALSE, ... )
filesDirectory |
The directory where the files to be used are located. Defaults to the current directory. |
organism |
A character string describing the organism |
chr.sizes |
A vector or a list containing the chromosomes' size of the selected organism or simply the string "auto". See details. |
readLength |
The read length in bp |
annotationMethod |
The method to fetch the annotation, one of "biomaRt","env","gff","gtf" or "rda". All methods but "biomaRt" and "env" require the annotationFile to be set. The "env" method requires the annotationObject to be set. |
annotationFile |
The location (full path) of the annotation file |
annotationObject |
A
|
format |
The format of the reads, one of "aln","bam". If not "bam", all the types supported by the ShortRead package are supported too. As of version 1.3.5, it defaults to bam. |
gapped |
Is the bam file provided containing gapped alignments? |
count |
The feature used to summarize the reads. One of 'exons','features','genes','islands' or 'transcripts'. See details. |
outputFormat |
By default, easyRNASeq returns a matrix.
If one of |
pattern |
For easyRNASeq, the pattern of file to look for, e.g. "bam$" |
filenames |
The name, not the path, of the files to use |
nbCore |
defines how many CPU core to use when computing the geneModels. Use the default parallel library |
filter |
The filter to be applied when loading the data using the "aln" format |
type |
The type of data when using the "aln" format. See the ShortRead library. |
chr.sel |
A vector of chromosome names to subset the final results. |
summarization |
A character defining which method to use when summarizing reads by genes. So far, only "geneModels" is available. |
normalize |
A boolean to convert the returned counts in RPKM. Valid
when the |
max.gap |
When computing read islands, the maximal gap size allowed between two islands to merge them |
min.cov |
When computing read islands, the minimal coverage to take into account for calling an island |
min.length |
The minimal size an island should have to be kept |
plot |
Whether or not to plot assessment graphs. |
conditions |
A vector of descriptor, each sample must have a descriptor if you use outputFormat edgeR. The size of this list must be equal to the number of sample. In addition the vector should be named with the filename of the corresponding samples. |
validity.check |
Shall UCSC chromosome name convention be enforced? This is only supported for a set of organisms, which are Dmelanogaster, Hsapiens, Mmusculus and Rnorvegicus; otherwise the argument 'chr.map' can be used to complement it. |
chr.map |
A data.frame describing the mapping of original chromosome names towards wished chromosome names. See details. |
ignoreWarnings |
set to TRUE (bad idea! they have a good reason to be there) if you do not want warning messages. |
silent |
set to TRUE if you do not want messages to be printed out. |
... |
additional arguments. See details |
... Additional arguments for different functions:
For the biomaRt getBM
function
For the readGffGtf
internal function that takes an optional arguments: annotation.type that
default to "exon" (used to select the proper rows of the gff or gtf file)
For to the list.files
function used to locate the read files.
the annotationObject When the
annotationMethods
is set to env
or rda
, a properly
formatted GRangesList
object need to be
provided. Check the vignette or the examples at
the bottom of this page for examples. The data.frame-like structure of
these objects is where easyRNASeq
will look for the exon, feature,
transcript, or gene identifier. Depending on the count method selected, it
is essential that the akin column name is present in the annotationObject.
E.g. when counting "features", the annotationObject has to contain a
"feature" field.
the chr.map The chr.map argument for the easyRNASeq function only works for an "organismName" of value 'custom' with the "validity.check" parameter set to 'TRUE'. This data.frame should contain two columns named 'from' and 'to'. The row should represent the chromosome name in your original data and the wished name in the output of the function.
count The count can be summarized by exons, features, genes, islands or transcripts. While exons, genes and transcripts are obvious, "features" describes any features provided by the user, e.g. enhancer loci. These are processed as the exons are. For "islands", it is for an under development function that identifies de-novo expression loci and count the number of reads overlapping them.
chr.sizes If set to "auto", then the format has to be "bam", in which case the chromosome names and size are extracted from the BAM header
Returns a count table (a matrix of m features x n samples). If the
outputFormat
option has been set, a corresponding object is returned:
a RangedSummarizedExperiment
, a
edgeR:DGEList
or RNAseq
.
Nicolas Delhomme
RNAseq
RangedSummarizedExperiment
edgeR:DGEList
ShortRead:readAligned
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) # get the example data tdir <- tutorialData() # get an example annotation file gAnnot.rda <- fetchData("gAnnot.rda") # creating a count table from 4 bam files count.table <- easyRNASeq(filesDirectory="tdir", pattern="[A,C,T,G]{6}\\.bam$", format="bam", readLength=36L, organism="Dmelanogaster", chr.sizes=seqlengths(Dmelanogaster), annotationMethod="rda", annotationFile=gAnnot.rda, count="exons") # an example of a chr.map chr.map <- data.frame(from=c("2L","2R","MT"),to=c("chr2L","chr2R","chrMT")) # an example of a GRangesList annotation grngs <- GRanges(seqnames=c("chr01","chr01","chr02"), ranges=IRanges( start=c(10,30,100), end=c(21,53,123)), strand=c("+","+","-"), transcript=c("trA1","trA2","trB"), gene=c("gA","gA","gB"), exon=c("e1","e2","e3") ) grngsList<-split(grngs,seqnames(grngs)) ## End(Not run)
## Not run: library(BSgenome.Dmelanogaster.UCSC.dm3) # get the example data tdir <- tutorialData() # get an example annotation file gAnnot.rda <- fetchData("gAnnot.rda") # creating a count table from 4 bam files count.table <- easyRNASeq(filesDirectory="tdir", pattern="[A,C,T,G]{6}\\.bam$", format="bam", readLength=36L, organism="Dmelanogaster", chr.sizes=seqlengths(Dmelanogaster), annotationMethod="rda", annotationFile=gAnnot.rda, count="exons") # an example of a chr.map chr.map <- data.frame(from=c("2L","2R","MT"),to=c("chr2L","chr2R","chrMT")) # an example of a GRangesList annotation grngs <- GRanges(seqnames=c("chr01","chr01","chr02"), ranges=IRanges( start=c(10,30,100), end=c(21,53,123)), strand=c("+","+","-"), transcript=c("trA1","trA2","trB"), gene=c("gA","gA","gB"), exon=c("e1","e2","e3") ) grngsList<-split(grngs,seqnames(grngs)) ## End(Not run)
This method extends the edgeR package by offering the functionality to plot the effect of the normalization factor.
## S4 method for signature 'DGEList,character,character' plotNormalizationFactors( obj = DGEList(), cond1 = character(1), cond2 = character(1) )
## S4 method for signature 'DGEList,character,character' plotNormalizationFactors( obj = DGEList(), cond1 = character(1), cond2 = character(1) )
obj |
An object of class |
cond1 |
A character string describing the first condition |
cond2 |
A character string describing the second condition |
none
Nicolas Delhomme
## Not run: ## create the object dgeList <- DGEList(counts,group) ## calculate the sie factors dgeList <- calcNormFactors(dgeList) ## plot them apply(combn(rownames(dgeList$samples),2), 2, function(co,obj){plotNormalizationFactors(obj,co[1],co[2])},dgeList) ## End(Not run)
## Not run: ## create the object dgeList <- DGEList(counts,group) ## calculate the sie factors dgeList <- calcNormFactors(dgeList) ## plot them apply(combn(rownames(dgeList$samples),2), 2, function(co,obj){plotNormalizationFactors(obj,co[1],co[2])},dgeList) ## End(Not run)
Check if the bam file represented by a BamFile
object exists.
## S4 method for signature 'BamFile' file.exists(...)
## S4 method for signature 'BamFile' file.exists(...)
... |
a |
Checkk if the bam file linked to by a
BamFile
object exists.
Another way to access the content of the gff type column.
## S4 method for signature 'Genome_intervals' type(x)
## S4 method for signature 'Genome_intervals' type(x)
x |
An object of class |
The content of the type column, usually a factor or a character vector
Nicolas Delhomme
# library library(genomeIntervals) # fetch the example data gffFilePath <- fetchData("Dmel-mRNA-exon-r5.52.gff3.gz") annot<-readGff3(gffFilePath,quiet=TRUE) type(annot)
# library library(genomeIntervals) # fetch the example data gffFilePath <- fetchData("Dmel-mRNA-exon-r5.52.gff3.gz") annot<-readGff3(gffFilePath,quiet=TRUE) type(annot)
A utility function to create a BamFileList-class
object from a set of filenames. The filenames need to contain the file path if they
are not in the working directory.
## S4 method for signature 'character,character' getBamFileList(filenames = character(0), indexnames = character(0))
## S4 method for signature 'character,character' getBamFileList(filenames = character(0), indexnames = character(0))
filenames |
a character vector containing fully defined BAM file filenames |
indexnames |
a character vector containing fully defined BAM index file filenames |
# tutorial data - store the data in the BiocCache tdir <- tutorialData() # creating a BamFileList using a directory and pattern # using filenames (from the Bioc cache) filenames <- dir(tdir,pattern="[A,C,T,G]{6}\\.bam$",full.names=TRUE) indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bfl <- getBamFileList(filenames,indexnames) # get them recursively filenames <- dir(path=tdir,pattern="[A,C,T,G]{6}\\.bam$", full.names=TRUE,recursive=TRUE) indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bfl <- getBamFileList(filenames,indexnames)
# tutorial data - store the data in the BiocCache tdir <- tutorialData() # creating a BamFileList using a directory and pattern # using filenames (from the Bioc cache) filenames <- dir(tdir,pattern="[A,C,T,G]{6}\\.bam$",full.names=TRUE) indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bfl <- getBamFileList(filenames,indexnames) # get them recursively filenames <- dir(path=tdir,pattern="[A,C,T,G]{6}\\.bam$", full.names=TRUE,recursive=TRUE) indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bfl <- getBamFileList(filenames,indexnames)
Return the ranges of the genomic annotation.
## S4 method for signature 'RNAseq' ranges(x)
## S4 method for signature 'RNAseq' ranges(x)
x |
An object of the |
It retrieves the object stored in the genomicAnnotation slot of the RNAseq
object and apply the ranges
function on it.
An IRangesList
object, where the split
is performed by seqnames (e.g. chromosomes).
Nicolas Delhomme
## Not run: library("RnaSeqTutorial") obj <- getAnnotation( AnnotParam( organism="Dmelanogaster", datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial"), type="gff3" )) ranges(obj) ## End(Not run)
## Not run: library("RnaSeqTutorial") obj <- getAnnotation( AnnotParam( organism="Dmelanogaster", datasource=system.file( "extdata", "Dmel-mRNA-exon-r5.52.gff3", package="RnaSeqTutorial"), type="gff3" )) ranges(obj) ## End(Not run)
Functions defined in the easyRNASeq package that enhance the parallel package.
## S4 method for signature 'list,function' parallelize(obj = list(), fun = NULL, nnodes = 1, ...)
## S4 method for signature 'list,function' parallelize(obj = list(), fun = NULL, nnodes = 1, ...)
obj |
the object which processing has to be parallelizes |
fun |
the function to be applied in parallel |
nnodes |
the number of nodes to use |
... |
additional arguments passed to the function |
The parallelize function ease the use of the parallel package. If the number of nodes provided by the user is 1, then a simple 'lapply' is used, otherwise a cluster object is created and the object dispatched for parallelization.
the result of the clusterApply
function.
Nicolas Delhomme
clusterApply
makePSOCKcluster
and stopCluster
in makeCluster
parallelize(list(a<-c(1,2),b<-c(2,1)),sum,nnodes=1)
parallelize(list(a<-c(1,2),b<-c(2,1)),sum,nnodes=1)
Print information about a RNAseq
,
AnnotParam
, BamParam
or
RnaSeqParam
object.
## S4 method for signature 'RNAseq' print(x, verbose = FALSE, ...)
## S4 method for signature 'RNAseq' print(x, verbose = FALSE, ...)
x |
An object from class |
verbose |
A logical to have a verbose or not output. Default to FALSE
For object of the |
... |
Additional arguments, currently unused. |
Print information about the provided object.
Nicolas Delhomme
A class holding all the necessary information and annotation to summarize couts (number of reads) per features (i.e. exons or transcripts or genes) for RNA-Seq experiments.
Objects can be created by calls of the
form new("RNAseq", ...)
.
Nicolas Delhomme
showClass("RNAseq")
showClass("RNAseq")
A class holding all the necessary parameters to process a bam file issued
from an RNA-Seq experiment together with the related annotation to compute
a count-table using the simpleRNASeq function
.
The precision slot is used to determine the count unit:
readsdefault. The standard summarizeOverlaps-methods
function is used to extract the read counts
bpThe easyRNASeq summarization functions
are used to extract the read covered bp counts
Objects can be created by calls of the
form new("RnaSeqParam", ...)
or using the RnaSeqParam constructor.
Nicolas Delhomme
showClass("RnaSeqParam")
showClass("RnaSeqParam")
These are functions extending the ShortRead packages capabilities:
demultiplex(obj,barcodes=c(),barcodes.qty=12,barcode.length=6, edition.dist=2,type=c("independant","within"),index.only=FALSE,mc.cores=1L) barcodePlot(obj,barcodes=c(),type=c("independant","within"), barcode.length=6,show.barcode=20,...) chastityFilter(.name="Illumina Chastity Filter") naPositionFilter(.name="NA Position Filter")
demultiplex(obj,barcodes=c(),barcodes.qty=12,barcode.length=6, edition.dist=2,type=c("independant","within"),index.only=FALSE,mc.cores=1L) barcodePlot(obj,barcodes=c(),type=c("independant","within"), barcode.length=6,show.barcode=20,...) chastityFilter(.name="Illumina Chastity Filter") naPositionFilter(.name="NA Position Filter")
obj |
An object derived from class |
barcodes |
A character vector describing the multiplex (i.e. barcode) sequences used in the experiment. |
barcodes.qty |
An integer describing the number of barcodes |
barcode.length |
An integer describing the barcode length in bp |
edition.dist |
The maximal edition distance (i.e. the number of changes to apply), to accept an incorrectly sequenced barcode. |
type |
The type of barcode used. |
index.only |
simply return the index and not the barcode themselves. |
mc.cores |
A parameter ultimately passed to srdistance to enable parallel processing on mc.cores. On linux and Mac only, windows task remain serially processed. |
.name |
An internal string describing the filter |
show.barcode |
An integer specifying how many barcodes should be displayed in the final output. |
... |
additional graphic parameters |
barcodePlot
Creates a plot showing the barcode
distribution of a multiplexed sequencing library.
chastityFilter
Creates a SRFilter
instance
that filters SolexaExport read according to the chastity filtering value.
demultiplex
Split a single AlignedRead
object into a list of AlignedRead
objects according to
the barcodes provided by the user. It supports multicore processing
but has a default serial behaviour.
naPositionFilter
Creates a
SRFilter
instance that filters SolexaExport read
having an NA position.
When demultiplexing, the function if provided with just the
AlignedRead
will try to find out how many barcodes
were used and what they are. This is unwise to do as many barcodes will get
wrongly sequenced and not always the most frequent ones are the one you
used! It's therefore strongly advised to specify the barcodes' sequences
that were used.
barcodePlot
returns invisibly the barcode
frequencies.
chastityFilter
returns a
SRFilter
instance.
demultiplex
returns a
list of AlignedRead
objects.
naPositionFilter
returns a SRFilter
instance.
Nicolas Delhomme
## Not run: # the barcode barcodes=c("ACACTG","ACTAGC","ATGGCT","TTGCGA") invisible(download.file(paste0("https://github.com/UPSCb/UPSCb/raw/", "master/tutorial/easyRNASeq/multiplex_export.txt.gz"), "multiplex_export.txt.gz")) # the multiplexed data alns <- readAligned(".", pattern="multiplex_export", filter=compose( chastityFilter(), nFilter(2), chromosomeFilter(regex="chr")), type="SolexaExport", withAll=TRUE) # barcode plot barcodePlot(alns, barcodes=barcodes, type="within", barcode.length=6, show.barcode=20, main="All samples", xlim=c(0,0.5)) # demultiplexing dem.alns <- demultiplex(alns, barcodes=barcodes, edition.dist=2, barcodes.qty=4, type="within") # plotting again par(mfrow=c(2,2)) barcode.frequencies <- lapply( names(dem.alns$barcodes), function(barcode,alns){ barcodePlot( alns$barcodes[[barcode]], barcodes=barcode, type="within",barcode.length=6, show.barcode=20, main=paste( "Expected barcode:", barcode)) },dem.alns) ## End(Not run)
## Not run: # the barcode barcodes=c("ACACTG","ACTAGC","ATGGCT","TTGCGA") invisible(download.file(paste0("https://github.com/UPSCb/UPSCb/raw/", "master/tutorial/easyRNASeq/multiplex_export.txt.gz"), "multiplex_export.txt.gz")) # the multiplexed data alns <- readAligned(".", pattern="multiplex_export", filter=compose( chastityFilter(), nFilter(2), chromosomeFilter(regex="chr")), type="SolexaExport", withAll=TRUE) # barcode plot barcodePlot(alns, barcodes=barcodes, type="within", barcode.length=6, show.barcode=20, main="All samples", xlim=c(0,0.5)) # demultiplexing dem.alns <- demultiplex(alns, barcodes=barcodes, edition.dist=2, barcodes.qty=4, type="within") # plotting again par(mfrow=c(2,2)) barcode.frequencies <- lapply( names(dem.alns$barcodes), function(barcode,alns){ barcodePlot( alns$barcodes[[barcode]], barcodes=barcode, type="within",barcode.length=6, show.barcode=20, main=paste( "Expected barcode:", barcode)) },dem.alns) ## End(Not run)
Display the content of a RNAseq
,
AnnotParam
, BamParam
or
RnaSeqParam
object.
## S4 method for signature 'RNAseq' show(object)
## S4 method for signature 'RNAseq' show(object)
object |
An object of the |
Display the values of the different slots of the
RNAseq
object.
The respective object settings.
This function is a wrapper around the more low level functionalities of the
package. It is the simplest way to get a RangedSummarizedExperiment
object from a set of bam files. RangedSummarizedExperiment
are
containers meant to hold any Next-Generation Sequencing experiment results and
metadata. The simpleRNASeq method replaces the
easyRNASeq
function to
simplify the usability. It does the following:
use GenomicAlignments
for reading/pre-processing the BAM files.
get the annotations
depending on the selected parameters
calculate the coverage from the provided file(s)
summarizes
the
read counts according to the selected summarization
returns a RangedSummarizedExperiment
object.
## S4 method for signature 'BamFileList,RnaSeqParam' simpleRNASeq( bamFiles = BamFileList(), param = RnaSeqParam(), nnodes = 1, verbose = TRUE, override = FALSE )
## S4 method for signature 'BamFileList,RnaSeqParam' simpleRNASeq( bamFiles = BamFileList(), param = RnaSeqParam(), nnodes = 1, verbose = TRUE, override = FALSE )
bamFiles |
a |
param |
RnaSeqParam a |
nnodes |
The number of CPU cores to use in parallel |
verbose |
a logical to be report progress or not. |
override |
Should the provided parameters override the detected ones |
returns a RangedSummarizedExperiment
object.
Nicolas Delhomme
For the input:
For the output:
RangedSummarizedExperiment
For related functions:
# the data tdir <- tutorialData() annot <- fetchData("Drosophila_melanogaster.BDGP5.77.with-chr.gtf.gz") # create the BamFileList, get the BAM and BAI index files from the Bioc cache filenames <- dir(tdir,pattern="[A,T].*\\.bam$",full.names=TRUE) indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bamFiles <- getBamFileList(filenames,indexnames) # create the AnnotParam annotParam <- AnnotParam(annot,type="gtf") # create the RnaSeqParam rnaSeqParam <- RnaSeqParam(annotParam=annotParam) # get a RangedSummarizedExperiment containing the counts table sexp <- simpleRNASeq( bamFiles=bamFiles, param=rnaSeqParam, verbose=TRUE ) # get the counts assays(sexp)$exons
# the data tdir <- tutorialData() annot <- fetchData("Drosophila_melanogaster.BDGP5.77.with-chr.gtf.gz") # create the BamFileList, get the BAM and BAI index files from the Bioc cache filenames <- dir(tdir,pattern="[A,T].*\\.bam$",full.names=TRUE) indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bamFiles <- getBamFileList(filenames,indexnames) # create the AnnotParam annotParam <- AnnotParam(annot,type="gtf") # create the RnaSeqParam rnaSeqParam <- RnaSeqParam(annotParam=annotParam) # get a RangedSummarizedExperiment containing the counts table sexp <- simpleRNASeq( bamFiles=bamFiles, param=rnaSeqParam, verbose=TRUE ) # get the counts assays(sexp)$exons
Describes extensions to the Rsamtools package.
For BamFile
and
BamFileList
objects:
validate
validates a BamFile
or
BamFileList
object.
## S4 method for signature 'BamFile' validate(obj, header = TRUE, cross.validation = TRUE)
## S4 method for signature 'BamFile' validate(obj, header = TRUE, cross.validation = TRUE)
obj |
An object of the |
header |
a boolean to (de)activate the check for a BAM header |
cross.validation |
a boolean - only valid for
|
validate
checks whether the BAM file exists and if a BAI index is present.
validate
returns invisibly a vector of boolean.
Fails anyway if any file is missing.
Nicolas Delhomme
# retrieve the data tdir <- tutorialData() # get the bam file path from the Bioc cache filenames <- dir(tdir,pattern="[A,C,T,G]{6}\\.bam$",full.names=TRUE) # retrieve the index from the Bioc cache too inxnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bfl <-BamFileList(filenames,index=inxnames) validate(bfl)
# retrieve the data tdir <- tutorialData() # get the bam file path from the Bioc cache filenames <- dir(tdir,pattern="[A,C,T,G]{6}\\.bam$",full.names=TRUE) # retrieve the index from the Bioc cache too inxnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData) bfl <-BamFileList(filenames,index=inxnames) validate(bfl)