Package 'easyRNASeq'

Title: Count summarization and normalization for RNA-Seq data
Description: Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
Authors: Nicolas Delhomme, Ismael Padioleau, Bastian Schiffthaler, Niklas Maehler
Maintainer: Nicolas Delhomme <[email protected]>
License: Artistic-2.0
Version: 2.43.0
Built: 2024-11-19 04:45:07 UTC
Source: https://github.com/bioc/easyRNASeq

Help Index


Class "AnnotParam"

Description

A class holding all the necessary parameters to retrieve the necessary annotation for processing an RNA-Seq experiment.

Objects from the Class

Objects can be created by calls of the form new("AnnotParamCharacter", ...) or new("AnnotParamObject", ...) (both subject to API changes) or using the AnnotParam constructor (failsafe, prefered). The class AnnotParam in itself is virtual and hence cannot be instantiated.

Author(s)

Nicolas Delhomme

See Also

Examples

showClass("AnnotParam")

Class "BamParam"

Description

A class describing the parameters of a bam file issued from an RNA-Seq experiment.

Objects from the Class

Objects can be created by calls of the form new("BamParam", ...) or using the BamParam constructor.

Slots from the Class

The BamParam class has the following slots:

  • paired

  • stranded

  • strandProtocol

  • yieldSize

all of which can be accessed using the accordingly names accessor.

Author(s)

Nicolas Delhomme

See Also

Examples

showClass("BamParam")

Extend the basename function to display Rsamtools BamFile class basename

Description

Display the basename of the bam file represented by a BamFile object.

Usage

## S4 method for signature 'BamFile'
basename(path)

Arguments

path

an object of class BamFile or BamFileList

Methods

list("signature(object = \"BamFile\")")

Display the basename of the bam file linked to by a BamFile object.


Manages the data necessary for the examples using BiocFileCache

Description

Manages the tutorial, example and vignette data using the BiocFileCache package

Usage

fetchData(fileURL)
tutorialData(...)

Arguments

...

unused for the time being

fileURL

The URL of the file to retrieve. Alternatively, the ID of the file in the BiocFileCache (i.e. the file basename), can be used.

Methods

.get_cache

internal function to set up the cache

fetchData

A function to fetch tutorial data, a file at a time

tutorialData

the function to retrieve all the tutorial data and cache it, if it is not already available

vignetteData

the function to retrieve all the tutorial data and cache it, if it is not already available

See Also

BiocFileCache

Examples

tdir <- tutorialData()
gAnnot.path <- fetchData("gAnnot.rda")
vdir <- vignetteData()
md5.txt <- fetchData("md5.txt")

Methods to create synthetic transcripts

Description

This function create a set of synthetic transcripts from a provided annotation file in "gff3" or "gtf" format. As detailed in http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis, one major caveat of estimating gene expression using aligned RNA-Seq reads is that a single read, which originated from a single mRNA molecule, might sometimes align to several features (e.g. transcripts or genes) with alignments of equivalent quality. This, for example, might happen as a result of gene duplication and the presence of repetitive or common domains. To avoid counting unique mRNA fragments multiple times, the stringent approach is to keep only uniquely mapping reads - being aware of potential consequences. Not only can "multiple counting" arise from a biological reason, but also from technical artifacts, introduced mostly by poorly formatted gff3/gtf annotation files. To avoid this, it is best practice to adopt a conservative approach by collapsing all existing transcripts of a single gene locus into a "synthetic" transcript containing every exon of that gene. In the case of overlapping exons, the longest genomic interval is kept, i.e. an artificial exon is created. This process results in a flattened transcript - a gene structure with a one (gene) to one (transcript) relationship.

Usage

## S4 method for signature 'AnnotParamCharacter'
createSyntheticTranscripts(
  obj,
  features = c("mRNA", "miRNA", "tRNA", "transcript"),
  verbose = TRUE
)

## S4 method for signature 'character'
createSyntheticTranscripts(
  obj,
  features = c("mRNA", "miRNA", "tRNA", "transcript"),
  verbose = TRUE,
  output = c("Genome_intervals", "GRanges"),
  input = c("gff3", "gtf")
)

Arguments

obj

a AnnotParamCharacter object or the annotation filename as a character string

features

one or more of 'mRNA', 'miRNA', 'tRNA', 'transcript'

verbose

increase the verbosity (default TRUE)

output

the output type, one of 'Genome_intervals' or 'GRanges'

input

the type of input, one of 'gff3' or 'gtf'

Details

The createSyntheticTranscripts function implements this, taking advantage of the hierarchical structure of the gff3/gtf file. Exon features are related to their transcript (parent), which themselves derives from their gene parents. Using this relationship, exons are combined per gene into a flattened transcript structure. Note that this might not avoid multiple counting if genes overlap on opposing strands. There, only strand specific sequencing data has the power to disentangle these situations.

As gff3/gtf file can contain a large number of feature types, the createSyntheticTranscripts currently only supports: mRNA, miRNA, tRNA and transcript. Please contact me if you need additional features to be considered. Note however, that I will only add features that are part of the sequenceontology.org SOFA (SO_Feature_Annotation) ontology.

Value

Depending on the obj class.

  • AnnotParamCharacter: a AnnotParamObject object

  • a character filename: depending on the selected output value, a Genome_intervals or a GRanges object.

Author(s)

Nicolas Delhomme

See Also

Examples

# get the example file
  Dm.gtf <- fetchData("Drosophila_melanogaster.BDGP5.77.with-chr.gtf.gz")

  # create the AnnotParam
  annotParam <- AnnotParam(
    datasource=Dm.gtf,
    type="gtf")

  # create the synthetic transcripts
  annotParam <- createSyntheticTranscripts(annotParam,verbose=FALSE)

The following function are defunct:

Description

  • The easyRNASeq function is superseded by the simpleRNASeq function to consolidate and prune the overall package. The changes are based on user comments and on the general standardization occuring in the field.

  • The fetchCoverage function only had two parameters deprecated as the consequence of the package consolidation. As the scanBam function is not called directly anymore but through higher level functions (from the GenomicRanges package), the 'what' and 'isUnmappedQuery' parameters were obsolete.


Accessors for RNAseq class

Description

These functions and generics define 'accessors' (to get and set values) for objects in the easyRNASeq package.

Usage

genomicAnnotation(obj)
readCounts(obj,count=c("exons","features","genes","islands","transcripts"),
summarization=c("bestExons","geneModels"),unique=FALSE)
genomicAnnotation(obj) <- value

Arguments

obj

An object derived from class RNAseq.

count

The type of count you want to access, 'genes','features','exons','transcripts' or 'islands'

summarization

If count is set to genes, precise the type of summarization, 'bestExons' or 'geneModels'

unique

For the 'exons' count only. Should the counts returned be unique for their identifier (i.e. the matrix row names)?

value

The replacement value.

Value

Usually, the value of the corresponding slot, or other simple content described on the help page of easyRNASeq.

Author(s)

Nicolas Delhomme

Examples

# This class is deprecated and as such there are no exmples of its use

Get genic annotation from a gff3/gtf file or using biomaRt

Description

The annotation can be retrieved in two ways

  • biomaRtUse biomaRt and Ensembl to get organism specific annotation.

  • gff3/gtfUse a gff3 or gtf local annotation file.

  • When using biomaRt, it is important that the organism argument to AnnotParam is set the prefix of one of the value available using the biomaRt listDatasets function, e.g. "Dmelanogaster".

  • When reading from a gff3/gtf file, a version 3 formatted gff or a gtf (an Ensembl defined gff2 version) is expected. The function genomeIntervals genomeIntervals-readGff3 is used to import the data.

Usage

## S4 method for signature 'AnnotParam'
getAnnotation(obj, verbose = FALSE, ...)

Arguments

obj

An object of class AnnotParam

verbose

a boolean to turn on verbosity

...

See details

Details

... are for additional arguments, passed to the biomaRt getBM function or to the readGffGtf internal function that takes an optional arguments: annotation.type that default to "exon". This is used to select the proper rows of the gff or gtf file.

Value

A GRanges containing the fetched annotations.

Author(s)

Nicolas Delhomme

Examples

## Not run: 
	library("RnaSeqTutorial")
  getAnnotation(
    AnnotParam(
      organism="Dmelanogaster",
      datasource=system.file(
  		  "extdata",
				"Dmel-mRNA-exon-r5.52.gff3",
				package="RnaSeqTutorial"),
  		type="gff3"
  ))

## End(Not run)

Accessors for AnnotParam class

Description

These functions and generics define 'accessors' (to get and set values) for AnnotParam objects within the easyRNASeq package. Implemented are:

  • datasource

  • type

Usage

datasource(object)
## S4 method for signature 'AnnotParam'
type(x)

Arguments

object

An object derived from class AnnotParam.

x

An object derived from class AnnotParam.

Value

The value of the corresponding slot.

Author(s)

Nicolas Delhomme

See Also

The AnnotParam class. The type and organism generics are imported from the BSgenome and Biostrings package, respectively.

Examples

# fetch the example data
Dm.annot <- fetchData("Dmel-mRNA-exon-r5.52.gff3.gz")

annot <- AnnotParam(datasource=Dm.annot)
# get the datasource Parameter
datasource(annot)

AnnotParam constructor

Description

This constructs a AnnotParam object. The datasource parameter (see details) is mandatory, however other parameters, i.e. when the datasource is not a GRanges default to "genes" and gff3", indicating that the datasource is in the gff3 format and that the contained information needs to be grouped by "genes". This representing the most common use case. Hence, it is left to the user to refine the parameters accordingly to the annotation he is providing or whishes to retrieve.

Usage

## S4 method for signature 'character'
AnnotParam(
  datasource = character(0),
  type = c("gff3", "biomaRt", "gtf", "rda")
)

Arguments

datasource

a character or a GRanges object. See details.

type

one of NULL, biomaRt, gff3, gtf or rda. Default to NULL. See details.

Details

Note that calling the constructor without argument fails, as the datasource is a mandatory parameter. Calling the constructor with additional (not all) parameters will affect the value of the selected parameters, leaving the other parameters unaffected. There are three parameters for an AnnotParam object:

  • datasourceIf no type is provided, the datasource should be GRanges object containing the genic information. These can be obtained using the getAnnotation function.

  • typeOne of biomaRt, gff3, gtf or rda. The default is "gff3". In all cases, the datasource is a character describing:

    • For biomaRt, the name of the organism as known by the ensembl Mart, e.g. dmelanogaster or hsapiens.

    • For gff3, gtf or rda, the filename (including the full or relative path).

See Also

Examples

# create an object to retrieve annotation from biomaRt
annotParam <- AnnotParam(datasource="Hsapiens",type="biomaRt")

# get the datasource and type
datasource(annotParam)
type(annotParam)

# create an object to retrieve annotation from an rda object
# fetch the example data
gAnnot.rda <- fetchData("gAnnot.rda")
annotParam <- AnnotParam(datasource=gAnnot.rda,type="rda")

Accessors for BamParam class

Description

These functions and generics define 'accessors' (to get and set values) for BamParam objects within the easyRNASeq package.

Usage

yieldSize(object,...)
paired(object)
stranded(object)
strandProtocol(object)

Arguments

object

An object derived from class BamParam.

...

Additional parameter inherited from the Rsamtools package yieldSize function. Ignored here.

Value

The value of the corresponding slot.

Author(s)

Nicolas Delhomme

See Also

The BamParam class The RnaSeqParam yieldSize accessor

Examples

bp <- BamParam()
## get the yieldSize Parameter
ysize <-yieldSize(bp)

BamParam constructor

Description

This constructs a BamParam object. The default parameters are derived from the currently most common RNA-Seq experimental use-case and are detailed below:

  • paired is TRUE, i.e. paired-end sequencing is expected.

  • stranded is FALSE i.e. stranded sequencing is not expected.

  • yieldSize is set to 1,000,000. This is the amount of reads iteratively processed from the bam file stream. It is a compromise between speed, process-parallelization and memory usage.

Usage

## S4 method for signature 'ANY'
BamParam(
  paired = TRUE,
  stranded = FALSE,
  strandProtocol = c("reverse", "forward"),
  yieldSize = 1000000L
)

Arguments

paired

boolean whether the BAM file contains paired-end data or not

stranded

boolean whether the reads are strand specific

strandProtocol

factor with values 'reverse' and 'forward' specifying the type of strand specificity protocol. 'reverse', the reads are on the opposite strand to the gene; typical for Illumina TRUSEQ strand-specific protocol.

yieldSize

the amount of reads to be streamed at a time. Default to 1M

Details

Calling the constructor without argument result in the default parameter described above to be returned. Calling the constructor with any parameter will affect the value of the selected parameters, leaving the other parameters unaffected.

Examples

# the defaults
BamParam()

# change the default
BamParam(paired=FALSE)
BamParam(stranded=TRUE,yieldSize=1L)
BamParam(stranded=TRUE,strandProtocol="forward",yieldSize=1L)

easyRNASeq count table correction to RPKM

Description

Convert a count table obtained from the easyRNASeq function into an RPKM corrected count table.

Usage

## S4 method for signature 'matrix,ANY,vector,vector'
RPKM(
  obj,
  from = c("exons", "features", "transcripts", "bestExons", "geneModels", "islands"),
  lib.size = numeric(1),
  feature.size = integer(1),
  simplify = TRUE,
  ...
)

Arguments

obj

An object of class RNAseq or a matrix, see details

from

Determine the kind of coverage to use, choice limited to: exons, features, transcripts, bestExons, geneModels or islands.

lib.size

Precise the library size. It should be a named numeric list, i.e. named after the sample names.

feature.size

Precise the feature (e.g. exons, genes) sizes. It should be a named numeric list, named after the feature names.

simplify

If set to TRUE, whenever a feature (exon, feature, ...) is duplicated in the count table, it is only returned once.

...

additional arguments. See details

Details

RPKM accepts two sets of arguments:

  • RNAseq,character the ... are additional arguments to be passed to the readCounts method.

  • matrix,named vectornormalize a count matrix by providing the feature sizes (e.g. gene sizes) as a named vector where the names match the row names of the count matrix and the lib sizes as a named vector where the names match the column names of the count matrix.

Value

A matrix containing RPKM corrected read counts.

Author(s)

Nicolas Delhomme

See Also

readCounts

Examples

## Not run: 
	## get an RNAseq object
	rnaSeq <- easyRNASeq(filesDirectory=
		    			system.file(
					"extdata",
					package="RnaSeqTutorial"),
					pattern="[A,C,T,G]{6}\\.bam$",
				format="bam",
				readLength=36L,
				organism="Dmelanogaster",
				chr.sizes=as.list(seqlengths(Dmelanogaster)),
				annotationMethod="rda",
				annotationFile=system.file(
				                            "data",
							    "gAnnot.rda",
							    package="RnaSeqTutorial"),
				count="exons",
				outputFormat="RNAseq")

	## get the RPKM
	rpkm <- RPKM(rnaSeq,from="exons")
	
	## the same from a count table
	count.table <- readCounts(rnaSeq,count="exons")

	## get the RPKM
	## verify that the feature are sorted as the count.table
	all(.getName(rnaSeq,"exon") == rownames(count.table))
	feature.size <- unlist(width(ranges(rnaSeq)))

	## verify that the samples are ordered in the same way
	all(names(librarySize(rnaSeq)) == colnames(count.table))

	## get the RPKM
	rpkm <- RPKM(count.table,
			feature.size=feature.size,
			lib.size=librarySize(rnaSeq))

## End(Not run)

Compute the coverage from a Short Read Alignment file

Description

Computes the genomic reads' coverage from a read file in bam format or any format supported by ShortRead.

Usage

## S4 method for signature 'RNAseq'
fetchCoverage(
  obj,
  format = c("aln", "bam"),
  filename = character(1),
  filter = srFilter(),
  type = "SolexaExport",
  chr.sel = c(),
  validity.check = TRUE,
  chr.map = data.frame(),
  ignoreWarnings = FALSE,
  gapped = TRUE,
  paired = FALSE,
  stranded = FALSE,
  bp.coverage = FALSE,
  ...
)

Arguments

obj

An RNAseq object

format

The format of the reads, one of "aln","bam". If not "bam", all the types supported by the ShortRead package are supported too.

filename

The full path of the file to use

filter

The filter to be applied when loading the data using the "aln" format

type

The type of data when using the "aln" format. See the ShortRead package.

chr.sel

A vector of chromosome names to subset the final results.

validity.check

Shall UCSC chromosome name convention be enforced

chr.map

A data.frame describing the mapping of original chromosome names towards wished chromosome names. See details.

ignoreWarnings

set to TRUE (bad idea! they have a good reason to be there) if you do not want warning messages.

gapped

Is the bam file provided containing gapped alignments?

paired

Is the bam file containing PE reads?

stranded

Is the bam file from a strand specific protocol?

bp.coverage

a boolean that default to FALSE to decide whether coverage is to be calculated and stored by bp

...

additional arguments. See details

Details

... for fetchCoverage: Can be used for readAligned method from package ShortRead. The use of the dots for the scanBamFlag method from package Rsamtools has been deprecated, as were the 'what' and 'isUnmappedQuery' argument to the function

Value

An RNAseq object. The slot readCoverage contains a SimpleRleList object representing a list of coverage vectors, one per chromosome.

Author(s)

Nicolas Delhomme

See Also

Rle ShortRead:readAligned

Examples

## Not run: 
	library("RnaSeqTutorial")
	library(BSgenome.Dmelanogaster.UCSC.dm3)

	obj <- new('RNAseq',
		organismName="Dmelanogaster",
		readLength=36L,
		chrSize=as.list(seqlengths(Dmelanogaster))
		)

	obj <- fetchCoverage(
			obj,
			format="bam",
                        filename=system.file(
				"extdata",
				"ACACTG.bam",
                            	package="RnaSeqTutorial")
			)
	
## End(Not run)

Defunct annotation function

Description

The fetchAnnotation and knownOrganisms function are now defunct. The fetchAnnotation function has been replaced by the getAnnotation method.

Author(s)

Nicolas Delhomme


Extension of the GenomicRanges package

Description

Describes extensions to the GenomicRanges package. For GRanges and GRangesList objects:

  • colnames returns the column name of a GRanges or GRangesList object.

  • unsafeAppend appends two GAlignments object together bypassing most sanity checks. Faster than the standard c or append function.

Usage

colnames(x, do.NULL = TRUE, prefix = "col")
unsafeAppend(obj1,obj2)

Arguments

x

An object of the GRanges or GRangesList class

do.NULL

see row_colnames for details

prefix

see row_colnames for details

obj1

A GAlignments object

obj2

A GAlignments object

Details

  • colnames returns the actual column names of the elementMetadata slot of the GRanges or GRangesList object. The elementMetadata contains a DataFrame object used to store additional information provided by the user, such as exon ID in our case.

  • unsafeAppend appends two GAlignments objects.

Value

  • colnames: A vector of column names.

  • unsafeAppend: A GAlignments object

Author(s)

Nicolas Delhomme

See Also

Examples

# an example of annotation
	grngs <- GRanges(seqnames=c("chr01","chr01","chr02"),
                     ranges=IRanges(
                             start=c(10,30,100),
                             end=c(21,53,123)),
                          strand=c("+","+","-"),
                          transcripts=c("trA1","trA2","trB"),
                          gene=c("gA","gA","gB"),
                          exon=c("e1","e2","e3")
                          )

	# accessing the colnames
	colnames(grngs)

	# creating a GRangesList
	grngsList<-split(grngs,seqnames(grngs))

	# accessing the colnames
	colnames(grngsList)

# For unsafeAppend
library(GenomicAlignments)
unsafeAppend(GAlignments(),GAlignments())

Identify expressed regions de-novo

Description

Process the coverage to locate regions with a minimum coverage (min.cov). If regions are separated by a gap shorter than a maximum length (max.gap), they are unified. Only islands longer than min.length are returned. These functions are now outdated and would need to be actualized.

Usage

## S4 method for signature 'RNAseq'
findIslands(
  obj,
  max.gap = integer(1),
  min.cov = 1L,
  min.length = integer(1),
  plot = TRUE,
  ...
)

Arguments

obj

An object of class RNAseq

max.gap

Maximum gap between two peaks to build an island

min.cov

Minimum coverage for an island to be returned

min.length

Minimum size of an island to be returned

plot

If TRUE, draw plots of coverage distribution. Help the user to select an appropriate value for the minimum coverage.

...

See details

Details

... are for providing additional options to the hist plot function.

Value

An RNAseq object with the readIsland slot set with a GRanges containing the selected islands and the readCount slot actualized with a list containing the count table per island.

Author(s)

Nicolas Delhomme

Examples

## Not run: 
	# NOTE that this function might need to be actualized
	obj <- new('RNAseq',
		organismName="Dmelanogaster",
		readLength=36L,
		chrSize=as.list(seqlengths(Dmelanogaster))
		)

 # fetch the example data
 bamFilePath <- fetchData("ACACTG.bam")

	obj <- fetchCoverage(obj,format="bam",filename=bamFilePath)

	obj <- findIslands(
			obj,
			max.gap=10L,
			min.cov=10L,
			min.length=200L)
	
## End(Not run)

Accessors for RnaSeqParam class

Description

These functions and generics define 'accessors' (to get and set values) for RnaSeqParam objects within the easyRNASeq package. Implemented are:

  • annotParam

  • bamParam

  • countBy

  • datasource

  • paired

  • precision

  • stranded

  • strandProtocol

  • yieldSize

Usage

## S4 method for signature 'RnaSeqParam'
yieldSize(object)

Arguments

object

An object derived from class RnaSeqParam.

Value

The value of the corresponding slot.

Author(s)

Nicolas Delhomme

See Also

The BamParam yieldSize accessor

Examples

## create the necessary AnnotParam
annotParam <- AnnotParam(
                datasource=system.file(
                                "extdata",
                                "Dmel-mRNA-exon-r5.52.gff3",
                                package="RnaSeqTutorial"))

## create the RnaSeqParam
rsp <- RnaSeqParam(annotParam=annotParam)
## get the yieldSize Parameter
ysize <-yieldSize(rsp)

RnaSeqParam constructor

Description

This constructs a RnaSeqParam object, that combines all the necessary parameters for the analysis of RNA-Seq data. As much as possible, these parameters are determined automa-gi/ti-cally. It describes three sets of parameters:

  • parameters describing the annotation

  • parameters describing the BAM files, i.e. the type of sequencing that was conducted.

  • parameters describing how the counting should be done.

The first two are provided through sepcific objects: AnnotParam and BamParam respectively. The third one is a set constituted of:

  • countBy: the feature per which the counts should be summarized ( exon, transcript or gene. A forth possibility - feature - can be used to define arbitrary genomic loci)

  • precision: the precision at which the counts should be performed: bp or reads. bp used to be the default in the easyRNASeq package, whereas now reads is, following the Bioconductor main stream development.

The default parameters for the BamParam parameter are derived from the currently most common RNA-Seq experimental use-case: strand-specific paired-end Illumina sequencing. See the respective manual pages of AnnotParam and BamParam for more details.

Usage

## S4 method for signature 'ANY'
RnaSeqParam(
  annotParam = AnnotParam(),
  bamParam = BamParam(),
  countBy = c("exons", "features", "genes", "transcripts"),
  precision = c("read", "bp")
)

Arguments

annotParam

An object derived from class AnnotParam.

bamParam

An object derived from class BamParam.

countBy

TODO

precision

A character value, either 'read' or 'bp' that defines the precision at which counting is done, either per read or per covered bp. 'read' is the default.

Examples

annotParam <- AnnotParam(
                datasource=system.file(
                                "extdata",
                                "Dmel-mRNA-exon-r5.52.gff3",
                                package="RnaSeqTutorial"))

## create the RnaSeqParam
rsp <- RnaSeqParam(annotParam=annotParam)

## change some defaults
RnaSeqParam(countBy="features",annotParam=annotParam)
RnaSeqParam(bamParam=BamParam(stranded=TRUE,yieldSize=1L),annotParam=annotParam)

Count methods for RNAseq object

Description

Summarize the read counts per exon, feature, gene, transcript or island.

  • exonCounts: for that summarization, reads are summarized per exons. An "exon" field is necessary in the annotation object for this to work. See easyRNASeq annotation methods for more details on the annotation object.

  • featureCounts is similar to the 'exons' one. This is just a wrapper to summarize count for genomic features that are not exon related. I.e. one could use it to measure eRNAs. Again, a "feature" field is necessary in the annotation object for this to work.

  • geneCounts sums the counts per either bestExons or geneModels. In either case, the annotation object needs to contain both an "exon" and a "gene" field.

  • islandCounts sums the counts per computed islands.

  • transcriptCounts sums the counts obtained by exons into their respective transcripts. Note that this often result in counting some reads several times. For this function to work you need both an "exon" and a "transcript" field in your annotation object. To avoid this, one could create transcript specific synthetic exons, i.e. features that would be unique to a transcript. To offer this possibility, transcripts count can be summarized from "features", in which case the annotation object need to have both the "feature" and "transcript" fields defined.

Usage

exonCounts(obj)
featureCounts(obj)
transcriptCounts(obj,from="exons")
geneCounts(obj,summarization=c("bestExons","geneModels"),...)
islandCounts(obj,force=FALSE,...)

Arguments

obj

An object derived from class RNAseq,can be a matrix for RPKM, see details

force

For islandCount, force RNAseq to redo findIsland

from

either "exons" or "features" can be used to summarize per transcript

summarization

Method use for summarize genes

...

See details

Details

... for

Value

A numeric vector containing count per exon, feature, gene or transcript.

Author(s)

Nicolas Delhomme

See Also

easyRNASeq annotation methods .geneModelSummarization findIslands

Examples

## Not run: 
library(BSgenome.Dmelanogaster.UCSC.dm3)

 # get the example data files
 tdir <- tutorialData()

 # get an example annotation file - we retrieve it from GitHub using curl
 gAnnot.rda <- fetchData("gAnnot.rda")

	# create an RNAseq object
	# summarizing 2 bam files by exons
	rnaSeq <- easyRNASeq(tdir,
                     organism="Dmelanogaster",
                     chr.sizes=seqlengths(Dmelanogaster),
                     readLength=36L,
                     annotationMethod="rda",
                     annotationFile=gAnnot.rda,
                     format="bam",
                     count="exons",
                     pattern="[A,C,T,G]{6}\\.bam$",
                     outputFormat="RNAseq")
	# summing up the exons by transcript
	rnaSeq <- transcriptCounts(rnaSeq)

## End(Not run)

Dataset included in the package

Description

The package contains a dataset from the Robinson, Delhomme et al., 2014 publication.

  • RobinsonDelhomme2014a normalised expression count table. This dataset was generated from 17 Populus tremula - Eurasian aspen - trees used to assess the sexual dimorphism of this dioecious species. This count matrix has been generating following published pre-processing guidelines - see http://www.epigenesys.eu/en/protocols/bio-informatics/1283-guidelines-for-rna-seq-data-analysis - and the resulting HTSeq files have been collated and the obtained raw count matrix submitted to a variance stabilising transformation. Subsequently, the values have been transformed so that the minimal vst values - that corresponds to an absence of expression - is 0. Hence the counts in the matrix are library-size normalized, variance stabilised expression values, with a minimal value of 0.


easyRNASeq method

Description

This function is a wrapper around the more low level functionalities of the package. Is the easiest way to get a count matrix from a set of read files. It does the following:

Usage

## S4 method for signature 'character'
easyRNASeq(
  filesDirectory = getwd(),
  organism = character(1),
  chr.sizes = c("auto"),
  readLength = integer(1),
  annotationMethod = c("biomaRt", "env", "gff", "gtf", "rda"),
  annotationFile = character(1),
  annotationObject = GRangesList(),
  format = c("bam", "aln"),
  gapped = FALSE,
  count = c("exons", "features", "genes", "islands", "transcripts"),
  outputFormat = c("matrix", "SummarizedExperiment", "edgeR", "RNAseq"),
  pattern = character(1),
  filenames = character(0),
  nbCore = 1,
  filter = srFilter(),
  type = "SolexaExport",
  chr.sel = c(),
  summarization = c("bestExons", "geneModels"),
  normalize = FALSE,
  max.gap = integer(1),
  min.cov = 1L,
  min.length = integer(1),
  plot = TRUE,
  conditions = c(),
  validity.check = TRUE,
  chr.map = data.frame(),
  ignoreWarnings = FALSE,
  silent = FALSE,
  ...
)

Arguments

filesDirectory

The directory where the files to be used are located. Defaults to the current directory.

organism

A character string describing the organism

chr.sizes

A vector or a list containing the chromosomes' size of the selected organism or simply the string "auto". See details.

readLength

The read length in bp

annotationMethod

The method to fetch the annotation, one of "biomaRt","env","gff","gtf" or "rda". All methods but "biomaRt" and "env" require the annotationFile to be set. The "env" method requires the annotationObject to be set.

annotationFile

The location (full path) of the annotation file

annotationObject

A GRangesList object containing the annotation.

format

The format of the reads, one of "aln","bam". If not "bam", all the types supported by the ShortRead package are supported too. As of version 1.3.5, it defaults to bam.

gapped

Is the bam file provided containing gapped alignments?

count

The feature used to summarize the reads. One of 'exons','features','genes','islands' or 'transcripts'. See details.

outputFormat

By default, easyRNASeq returns a matrix. If one of edgeR, RNAseq or SummarizedExperiment is provided then the respective object is returned.

pattern

For easyRNASeq, the pattern of file to look for, e.g. "bam$"

filenames

The name, not the path, of the files to use

nbCore

defines how many CPU core to use when computing the geneModels. Use the default parallel library

filter

The filter to be applied when loading the data using the "aln" format

type

The type of data when using the "aln" format. See the ShortRead library.

chr.sel

A vector of chromosome names to subset the final results.

summarization

A character defining which method to use when summarizing reads by genes. So far, only "geneModels" is available.

normalize

A boolean to convert the returned counts in RPKM. Valid when the outputFormat is left undefined (i.e. when a matrix is returned) and when it is edgeR. Note that you should not normalize the data prior to using edgeR!

max.gap

When computing read islands, the maximal gap size allowed between two islands to merge them

min.cov

When computing read islands, the minimal coverage to take into account for calling an island

min.length

The minimal size an island should have to be kept

plot

Whether or not to plot assessment graphs.

conditions

A vector of descriptor, each sample must have a descriptor if you use outputFormat edgeR. The size of this list must be equal to the number of sample. In addition the vector should be named with the filename of the corresponding samples.

validity.check

Shall UCSC chromosome name convention be enforced? This is only supported for a set of organisms, which are Dmelanogaster, Hsapiens, Mmusculus and Rnorvegicus; otherwise the argument 'chr.map' can be used to complement it.

chr.map

A data.frame describing the mapping of original chromosome names towards wished chromosome names. See details.

ignoreWarnings

set to TRUE (bad idea! they have a good reason to be there) if you do not want warning messages.

silent

set to TRUE if you do not want messages to be printed out.

...

additional arguments. See details

Details

  • ... Additional arguments for different functions:

    • For the biomaRt getBM function

    • For the readGffGtf internal function that takes an optional arguments: annotation.type that default to "exon" (used to select the proper rows of the gff or gtf file)

    • For to the list.files function used to locate the read files.

  • the annotationObject When the annotationMethods is set to env or rda, a properly formatted GRangesList object need to be provided. Check the vignette or the examples at the bottom of this page for examples. The data.frame-like structure of these objects is where easyRNASeq will look for the exon, feature, transcript, or gene identifier. Depending on the count method selected, it is essential that the akin column name is present in the annotationObject. E.g. when counting "features", the annotationObject has to contain a "feature" field.

  • the chr.map The chr.map argument for the easyRNASeq function only works for an "organismName" of value 'custom' with the "validity.check" parameter set to 'TRUE'. This data.frame should contain two columns named 'from' and 'to'. The row should represent the chromosome name in your original data and the wished name in the output of the function.

  • count The count can be summarized by exons, features, genes, islands or transcripts. While exons, genes and transcripts are obvious, "features" describes any features provided by the user, e.g. enhancer loci. These are processed as the exons are. For "islands", it is for an under development function that identifies de-novo expression loci and count the number of reads overlapping them.

  • chr.sizes If set to "auto", then the format has to be "bam", in which case the chromosome names and size are extracted from the BAM header

Value

Returns a count table (a matrix of m features x n samples). If the outputFormat option has been set, a corresponding object is returned: a RangedSummarizedExperiment, a edgeR:DGEList or RNAseq.

Author(s)

Nicolas Delhomme

See Also

RNAseq RangedSummarizedExperiment edgeR:DGEList ShortRead:readAligned

Examples

## Not run: 
	library(BSgenome.Dmelanogaster.UCSC.dm3)

 # get the example data
 tdir <- tutorialData()


 # get an example annotation file
 gAnnot.rda <- fetchData("gAnnot.rda")

	# creating a count table from 4 bam files
	count.table <- easyRNASeq(filesDirectory="tdir",
					pattern="[A,C,T,G]{6}\\.bam$",
				format="bam",
				readLength=36L,
				organism="Dmelanogaster",
				chr.sizes=seqlengths(Dmelanogaster),
				annotationMethod="rda",
				annotationFile=gAnnot.rda,
				count="exons")

	# an example of a chr.map
	chr.map <- data.frame(from=c("2L","2R","MT"),to=c("chr2L","chr2R","chrMT"))

	# an example of a GRangesList annotation
	grngs <- GRanges(seqnames=c("chr01","chr01","chr02"),
                     ranges=IRanges(
                             start=c(10,30,100),
                             end=c(21,53,123)),
                          strand=c("+","+","-"),
                          transcript=c("trA1","trA2","trB"),
                          gene=c("gA","gA","gB"),
                          exon=c("e1","e2","e3")
                          )

	grngsList<-split(grngs,seqnames(grngs))

## End(Not run)

Extension for the edgeR package

Description

This method extends the edgeR package by offering the functionality to plot the effect of the normalization factor.

Usage

## S4 method for signature 'DGEList,character,character'
plotNormalizationFactors(
  obj = DGEList(),
  cond1 = character(1),
  cond2 = character(1)
)

Arguments

obj

An object of class DGEList

cond1

A character string describing the first condition

cond2

A character string describing the second condition

Value

none

Author(s)

Nicolas Delhomme

Examples

## Not run: 
	## create the object
	dgeList <- DGEList(counts,group)
	## calculate the sie factors
	dgeList <- calcNormFactors(dgeList)
	## plot them
	apply(combn(rownames(dgeList$samples),2),
		2,
		function(co,obj){plotNormalizationFactors(obj,co[1],co[2])},dgeList)
	
## End(Not run)

Extend the file.exists function to check the path slot of a Rsamtools BamFile class for existence

Description

Check if the bam file represented by a BamFile object exists.

Usage

## S4 method for signature 'BamFile'
file.exists(...)

Arguments

...

a BamFile object

Methods

list("signature(object = \"BamFile\")")

Checkk if the bam file linked to by a BamFile object exists.


Extension for the genomeIntervals package

Description

type

Another way to access the content of the gff type column.

Usage

## S4 method for signature 'Genome_intervals'
type(x)

Arguments

x

An object of class Genome_intervals

Value

type

The content of the type column, usually a factor or a character vector

Author(s)

Nicolas Delhomme

See Also

Examples

# library
library(genomeIntervals)

# fetch the example data
gffFilePath <- fetchData("Dmel-mRNA-exon-r5.52.gff3.gz")
annot<-readGff3(gffFilePath,quiet=TRUE)
type(annot)

Get a BamFileList from a list of filenames

Description

A utility function to create a BamFileList-class object from a set of filenames. The filenames need to contain the file path if they are not in the working directory.

Usage

## S4 method for signature 'character,character'
getBamFileList(filenames = character(0), indexnames = character(0))

Arguments

filenames

a character vector containing fully defined BAM file filenames

indexnames

a character vector containing fully defined BAM index file filenames

Value

a BamFileList-class

See Also

BamFileList-class dir

Examples

# tutorial data - store the data in the BiocCache
tdir <- tutorialData()

  # creating a BamFileList using a directory and pattern
  # using filenames (from the Bioc cache)
  filenames <- dir(tdir,pattern="[A,C,T,G]{6}\\.bam$",full.names=TRUE)
  indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData)
  bfl <- getBamFileList(filenames,indexnames)

  # get them recursively
  filenames <- dir(path=tdir,pattern="[A,C,T,G]{6}\\.bam$",
                   full.names=TRUE,recursive=TRUE)
  indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData)
  bfl <- getBamFileList(filenames,indexnames)

Extension of the IRanges package

Description

Return the ranges of the genomic annotation.

Usage

## S4 method for signature 'RNAseq'
ranges(x)

Arguments

x

An object of the RNAseq class

Details

It retrieves the object stored in the genomicAnnotation slot of the RNAseq object and apply the ranges function on it.

Value

An IRangesList object, where the split is performed by seqnames (e.g. chromosomes).

Author(s)

Nicolas Delhomme

Examples

## Not run: 
	library("RnaSeqTutorial")

	obj <- getAnnotation(
            AnnotParam(
              organism="Dmelanogaster",
              datasource=system.file(
    	          "extdata",
				        "Dmel-mRNA-exon-r5.52.gff3",
				        package="RnaSeqTutorial"),
  		        type="gff3"
  ))

	ranges(obj)
	
## End(Not run)

parallel additional methods

Description

Functions defined in the easyRNASeq package that enhance the parallel package.

Usage

## S4 method for signature 'list,function'
parallelize(obj = list(), fun = NULL, nnodes = 1, ...)

Arguments

obj

the object which processing has to be parallelizes

fun

the function to be applied in parallel

nnodes

the number of nodes to use

...

additional arguments passed to the function fun

Details

The parallelize function ease the use of the parallel package. If the number of nodes provided by the user is 1, then a simple 'lapply' is used, otherwise a cluster object is created and the object dispatched for parallelization.

Value

the result of the clusterApply function.

Author(s)

Nicolas Delhomme

See Also

clusterApply makePSOCKcluster and stopCluster in makeCluster

Examples

parallelize(list(a<-c(1,2),b<-c(2,1)),sum,nnodes=1)

Class "RNAseq"

Description

A class holding all the necessary information and annotation to summarize couts (number of reads) per features (i.e. exons or transcripts or genes) for RNA-Seq experiments.

Objects from the Class

Objects can be created by calls of the form new("RNAseq", ...).

Author(s)

Nicolas Delhomme

See Also

Examples

showClass("RNAseq")

Class "RnaSeqParam"

Description

A class holding all the necessary parameters to process a bam file issued from an RNA-Seq experiment together with the related annotation to compute a count-table using the simpleRNASeq function. The precision slot is used to determine the count unit:

Objects from the Class

Objects can be created by calls of the form new("RnaSeqParam", ...) or using the RnaSeqParam constructor.

Author(s)

Nicolas Delhomme

See Also

Examples

showClass("RnaSeqParam")

Methods extending the ShortRead package functionalities

Description

These are functions extending the ShortRead packages capabilities:

Usage

demultiplex(obj,barcodes=c(),barcodes.qty=12,barcode.length=6,
edition.dist=2,type=c("independant","within"),index.only=FALSE,mc.cores=1L)
barcodePlot(obj,barcodes=c(),type=c("independant","within"),
barcode.length=6,show.barcode=20,...)
chastityFilter(.name="Illumina Chastity Filter")
naPositionFilter(.name="NA Position Filter")

Arguments

obj

An object derived from class AlignedRead

barcodes

A character vector describing the multiplex (i.e. barcode) sequences used in the experiment.

barcodes.qty

An integer describing the number of barcodes

barcode.length

An integer describing the barcode length in bp

edition.dist

The maximal edition distance (i.e. the number of changes to apply), to accept an incorrectly sequenced barcode.

type

The type of barcode used. independent represents barcodes generated by the illumina protocol; i.e. a separate additional sequencing step performed once the first mate has been sequenced. within represents barcodes that are part of the sequenced reads as established by Lefrancois P et al., BMC Genomics, 2009

index.only

simply return the index and not the barcode themselves.

mc.cores

A parameter ultimately passed to srdistance to enable parallel processing on mc.cores. On linux and Mac only, windows task remain serially processed.

.name

An internal string describing the filter

show.barcode

An integer specifying how many barcodes should be displayed in the final output.

...

additional graphic parameters

Details

  • barcodePlot Creates a plot showing the barcode distribution of a multiplexed sequencing library.

  • chastityFilter Creates a SRFilter instance that filters SolexaExport read according to the chastity filtering value.

  • demultiplex Split a single AlignedRead object into a list of AlignedRead objects according to the barcodes provided by the user. It supports multicore processing but has a default serial behaviour.

  • naPositionFilter Creates a SRFilter instance that filters SolexaExport read having an NA position.

When demultiplexing, the function if provided with just the AlignedRead will try to find out how many barcodes were used and what they are. This is unwise to do as many barcodes will get wrongly sequenced and not always the most frequent ones are the one you used! It's therefore strongly advised to specify the barcodes' sequences that were used.

Value

  • barcodePlot returns invisibly the barcode frequencies.

  • chastityFilter returns a SRFilter instance.

  • demultiplex returns a list of AlignedRead objects.

  • naPositionFilter returns a SRFilter instance.

Author(s)

Nicolas Delhomme

See Also

SRFilter AlignedRead

Examples

## Not run: 
	# the barcode
	barcodes=c("ACACTG","ACTAGC","ATGGCT","TTGCGA")

invisible(download.file(paste0("https://github.com/UPSCb/UPSCb/raw/",
       "master/tutorial/easyRNASeq/multiplex_export.txt.gz"),
       "multiplex_export.txt.gz"))

	# the multiplexed data
	alns <- readAligned(".",
                    pattern="multiplex_export",
                    filter=compose(
                      chastityFilter(),
                      nFilter(2),
                      chromosomeFilter(regex="chr")),
                    type="SolexaExport",
                    withAll=TRUE)

	# barcode plot
	barcodePlot(alns,
            barcodes=barcodes,
            type="within",
            barcode.length=6,
            show.barcode=20,
            main="All samples",
            xlim=c(0,0.5))

	# demultiplexing
	dem.alns <- demultiplex(alns,
                        barcodes=barcodes,
                        edition.dist=2,
                        barcodes.qty=4,
                        type="within")

	# plotting again
	par(mfrow=c(2,2))
	barcode.frequencies <- lapply(
                              names(dem.alns$barcodes),
                              function(barcode,alns){
                                barcodePlot(
                                            alns$barcodes[[barcode]],
                                            barcodes=barcode,
                                            type="within",barcode.length=6,
                                            show.barcode=20,
                                            main=paste(
                                              "Expected barcode:",
                                              barcode))
                              },dem.alns)
	
## End(Not run)

Display the content of classes from the easyRNASeq package.

Description

Display the content of a RNAseq, AnnotParam, BamParam or RnaSeqParam object.

Usage

## S4 method for signature 'RNAseq'
show(object)

Arguments

object

An object of the AnnotParam, BamParam, RnaSeqParam or RNAseq class

Methods

list("signature(object = \"RNAseq\")")

Display the values of the different slots of the RNAseq object.

Annot/Bam/RnaSeqParam

The respective object settings.


simpleRNASeq method

Description

This function is a wrapper around the more low level functionalities of the package. It is the simplest way to get a RangedSummarizedExperiment object from a set of bam files. RangedSummarizedExperiment are containers meant to hold any Next-Generation Sequencing experiment results and metadata. The simpleRNASeq method replaces the easyRNASeq function to simplify the usability. It does the following:

Usage

## S4 method for signature 'BamFileList,RnaSeqParam'
simpleRNASeq(
  bamFiles = BamFileList(),
  param = RnaSeqParam(),
  nnodes = 1,
  verbose = TRUE,
  override = FALSE
)

Arguments

bamFiles

a BamFileList object

param

RnaSeqParam a RnaSeqParam object that describes the RNA-Seq experimental setup.

nnodes

The number of CPU cores to use in parallel

verbose

a logical to be report progress or not.

override

Should the provided parameters override the detected ones

Value

returns a RangedSummarizedExperiment object.

Author(s)

Nicolas Delhomme

See Also

Examples

# the data
tdir <- tutorialData()
annot <- fetchData("Drosophila_melanogaster.BDGP5.77.with-chr.gtf.gz")

 # create the BamFileList, get the BAM and BAI index files from the Bioc cache
 filenames <- dir(tdir,pattern="[A,T].*\\.bam$",full.names=TRUE)
 indexnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData)
 bamFiles <- getBamFileList(filenames,indexnames)

  # create the AnnotParam
  annotParam <- AnnotParam(annot,type="gtf")

  # create the RnaSeqParam
  rnaSeqParam <- RnaSeqParam(annotParam=annotParam)

  # get a RangedSummarizedExperiment containing the counts table
  sexp <- simpleRNASeq(
    bamFiles=bamFiles,
    param=rnaSeqParam,
    verbose=TRUE
  )

  # get the counts
  assays(sexp)$exons

Extension of the Rsamtools package

Description

Describes extensions to the Rsamtools package.

Usage

## S4 method for signature 'BamFile'
validate(obj, header = TRUE, cross.validation = TRUE)

Arguments

obj

An object of the BamFile or BamFileList class

header

a boolean to (de)activate the check for a BAM header

cross.validation

a boolean - only valid for BamFileList objects - to (de)activate the cross validation of all the BAM files header

Details

validate checks whether the BAM file exists and if a BAI index is present.

Value

validate returns invisibly a vector of boolean. Fails anyway if any file is missing.

Author(s)

Nicolas Delhomme

See Also

Examples

# retrieve the data
tdir <- tutorialData()

# get the bam file path from the Bioc cache
filenames <- dir(tdir,pattern="[A,C,T,G]{6}\\.bam$",full.names=TRUE)

# retrieve the index from the Bioc cache too
inxnames <- sapply(paste0(sub(".*_","",basename(filenames)),".bai"),fetchData)

bfl <-BamFileList(filenames,index=inxnames)

validate(bfl)