Title: | Characterization of Alternative Splicing based on Paired-End Reads |
---|---|
Description: | Infer alternative splicing from paired-end RNA-seq data. The model is based on counting paths across exons, rather than pairwise exon connections, and estimates the fragment size and start distributions non-parametrically, which improves estimation precision. |
Authors: | David Rossell, Camille Stephan-Otto, Manuel Kroiss, Miranda Stobbe, Victor Pena |
Maintainer: | David Rossell <[email protected]> |
License: | GPL (>=2) |
Version: | 2.41.0 |
Built: | 2024-12-22 06:21:30 UTC |
Source: | https://github.com/bioc/casper |
The annotatedGenome
class stores info about transcripts,
usually created with procGenome
from TxDb
objects
or user-provided .gtf files.
Objects are typically created with a call to
procGenome
(for known genomes) or to
createDenovoGenome
(for de novo genomes).
GRangesList
object with elements corresponding
to gene islands. It indicates the
start/end/name of each exon contained in the island
Each element in the list corresponds to a gene island. It indicates the exons contained in each known variant.
data.frame
indicating the chromosome,
start and end of each exon, and its corresponding gene island.
GRanges
indicating the chromosome,
start/end and id of each exon
data.frame
indicating the aliases for each
known transcript, i.e. transcripts having the exact same sequence
of exons.
Character indicating the genome version from which the object was build, e.g. "hg19"
Character indicating the date when the object was created. UCSC genomes chance from time to time, so that an "hg19" genome from Jan 2012 may not be exactly the same as in Dec 2012.
Logical variable. FALSE
indicates that the
object was created using available annotation only. TRUE
indicates that new exons/islands were added based on the data
observed in a particular RNA-seq experiment.
Numeric vector storing transcript lengths.
List where each element corresponds to an island, and contains a character vector with names of isoforms that should be considered as known (i.e. always included in the model)
signature(object = "annotatedGenome")
: Displays general information about the object.
Camille Stephan-Otto Attolini
procGenome
and createDenovoGenome
to
create annotatedGenome
objects.
showClass("annotatedGenome")
showClass("annotatedGenome")
Produces a boxplot for the asymmetry coefficients for each row in the input matrix. Normal observations are simulated using the observed sample means and variances, and their asymmetry coefficients are added to the plot.
asymmetryCheck(x, ...)
asymmetryCheck(x, ...)
x |
|
... |
Other arguments to be passed on to codeplot |
Boxplot with asymmetry coefficients for observed and simulated Normal data
David Rossell
mu <- rnorm(100) x <- matrix(rnorm(100*5,mu),ncol=5) asymmetryCheck(x)
mu <- rnorm(100) x <- matrix(rnorm(100*5,mu),ncol=5) asymmetryCheck(x)
calcDenovo
estimates expression of gene splicing variants,
considering both known variants and variants that have not been
previously described.
calcDenovo(distrs, targetGenomeDB, knownGenomeDB=targetGenomeDB, pc, readLength, islandid, priorq=3, mprior, minpp=0.001, selectBest=FALSE, searchMethod="submodels", niter, exactMarginal=TRUE, integrateMethod="plugin", verbose=TRUE, mc.cores=1)
calcDenovo(distrs, targetGenomeDB, knownGenomeDB=targetGenomeDB, pc, readLength, islandid, priorq=3, mprior, minpp=0.001, selectBest=FALSE, searchMethod="submodels", niter, exactMarginal=TRUE, integrateMethod="plugin", verbose=TRUE, mc.cores=1)
distrs |
List of fragment distributions as generated by the |
targetGenomeDB |
|
knownGenomeDB |
|
pc |
Named vector of exon path counts as returned by |
readLength |
Read length in bp, e.g. in a paired-end experiment where
75bp are sequenced on each end one would set |
islandid |
Name of the gene island to be analyzed. If not specified, all gene islands are analyzed. |
priorq |
Parameter of the prior distribution on the proportion of
reads coming from each variant. The prior is Dirichlet with prior
sample size for each variant equal to priorq.
We recommend |
mprior |
Prior on the model space returned by
|
minpp |
Models (i.e. splicing configurations) with posterior probability less than |
selectBest |
If set to |
searchMethod |
Method used to perform the model search.
|
niter |
Number of MCMC iterations. |
exactMarginal |
Set to |
integrateMethod |
Method to compute integrated likelihoods. The default
( |
verbose |
Set to |
mc.cores |
Number of processors to be used for parallel
computation. Can only be used if the package |
calcDenovo
explores which subset of the isoforms indicated in
targetGenomeDB
are truly expressed.
It also adds new isoforms when some reads follow an exon path that
is not possible under any of the isoforms in targetGenomeDB
.
calcDenovo
the posterior probability of each model
(i.e. configuration of expressed variants) via Bayes theorem.
P(model|y) "proportional to" m(y|model) P(model)
where m(y|model) is the integrated likelihood and P(model) is the
prior probability of the model.
For example, a gene with 20 predicted isoforms in targetGenome
gives rise 2^20 - 1 possible models (configurations of expressed isoforms).
Importantly, P(model) can be set by analyzing available genome
annotations in knownGenomeDB
.
For instance, genes with 20 exons have isoforms that tend
to use most of the 20 exons. They also tend to express more
isoforms than genes with 5 exons. The function modelPrior
analyzes knownGenomeDB
to set reasonable values for P(model).
An exhaustive enumeration of all possible models is
not feasible unless the gene is very short (e.g. around 5 exons).
For longer genes we use computational strategies to search a subset of
"interesting" models. This is controlled by the argument searchMethod
(see above).
In order to compute P(model|y) one can either use the computed
m(y|model) P(model) (option exactMarginal==TRUE
) or the
proportion of MCMC visits (option exactMarginal==FALSE
). Unless
niter
is large the former option typically provides more
precise estimates.
A denovoGenomeExpr object.
Camille Stephan-Otto Attolini, Manuel Kroiss, David Rossell
Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330.
denovoExpr
to obtain expression estimates from the
calcDenovo
output.
plotExpr
to produce a plot with splicing variants and estimated expression.
## See help(denovoExpr)
## See help(denovoExpr)
Estimate expression of gene splicing variants,
assuming that the set of variants is known.
When rpkm
is set to TRUE
, fragments per kilobase
per million are returned. Otherwise relative expression estimates are returned.
calcExp(distrs, genomeDB, pc, readLength, islandid, rpkm=TRUE, priorq=2, priorqGeneExpr=2, citype="none", niter=10^3, burnin=100, mc.cores=1, verbose=FALSE)
calcExp(distrs, genomeDB, pc, readLength, islandid, rpkm=TRUE, priorq=2, priorqGeneExpr=2, citype="none", niter=10^3, burnin=100, mc.cores=1, verbose=FALSE)
distrs |
List of fragment distributions as generated by the |
genomeDB |
|
pc |
Named vector of exon path counts as returned by |
readLength |
Read length in bp, e.g. in a paired-end experiment where
75bp are sequenced on each end one would set |
islandid |
Name of the gene island to be analyzed. If not specified, all gene islands are analyzed. |
rpkm |
Set to |
priorq |
Parameter of the prior distribution on the proportion of reads coming from each variant. The prior is Dirichlet with prior sample size for each variant equal to priorq.
We recommend |
priorqGeneExpr |
Parameter for prior distribution on overall gene expression. Defaults to 2, which ensures non-zero estimates for all genes |
citype |
Set to |
niter |
Number of Monte Carlo iterations. Only used when |
burnin |
Number of burnin Monte Carlo iterations. Only used when |
mc.cores |
Number of processors to be used for parallel computation. Can only be used if the package |
verbose |
Set to |
Expression set with expression estimates.
featureNames
identify each transcript via
RefSeq ids, and the featureData
contains further information.
If citype
was set to a value other than "none"
, the featureData
also contains the 95% credibility intervals
(i.e. intervals that contain the true parameter value with 95% posterior probability).
Camille Stephan-Otto Attolini, Manuel Kroiss, David Rossell
Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330.
relexprByGene
data(K562.r1l1) data(hg19DB) #Pre-process bam0 <- rmShortInserts(K562.r1l1, isizeMin=100) pbam0 <- procBam(bam0) head(getReads(pbam0)) #Estimate distributions, get path counts distrs <- getDistrs(hg19DB,bam=bam0,readLength=75) pc <- pathCounts(pbam0, DB=hg19DB) #Get estimates eset <- calcExp(distrs=distrs, genomeDB=hg19DB, pc=pc, readLength=75, rpkm=FALSE) head(exprs(eset)) head(fData(eset)) #Re-normalize relative expression to add up to 1 within gene_id rather # than island_id eset <- relexprByGene(eset) #Add fake sample by permuting and combine eset2 <- eset[sample(1:nrow(eset),replace=FALSE),] sampleNames(eset2) <- '2' #must have a different name esetall <- mergeExp(eset,eset2) #After merge samples are correctly matched head(exprs(esetall)) head(fData(esetall))
data(K562.r1l1) data(hg19DB) #Pre-process bam0 <- rmShortInserts(K562.r1l1, isizeMin=100) pbam0 <- procBam(bam0) head(getReads(pbam0)) #Estimate distributions, get path counts distrs <- getDistrs(hg19DB,bam=bam0,readLength=75) pc <- pathCounts(pbam0, DB=hg19DB) #Get estimates eset <- calcExp(distrs=distrs, genomeDB=hg19DB, pc=pc, readLength=75, rpkm=FALSE) head(exprs(eset)) head(fData(eset)) #Re-normalize relative expression to add up to 1 within gene_id rather # than island_id eset <- relexprByGene(eset) #Add fake sample by permuting and combine eset2 <- eset[sample(1:nrow(eset),replace=FALSE),] sampleNames(eset2) <- '2' #must have a different name esetall <- mergeExp(eset,eset2) #After merge samples are correctly matched head(exprs(esetall)) head(fData(esetall))
Obtains expression estimates from denovoGenomeExpr
objects, as
returned by calcDenovo
.
When rpkm
is set to TRUE
, fragments per kilobase
per million are returned. Otherwise relative expression estimates are
returned.
The estimates can be obtained by Bayesian model averaging (default) or by selecting the model with highest posterior probability. See details.
denovoExpr(x, pc, rpkm = TRUE, summarize = "modelAvg", minProbExpr = 0.5, minExpr = 0.05)
denovoExpr(x, pc, rpkm = TRUE, summarize = "modelAvg", minProbExpr = 0.5, minExpr = 0.05)
x |
|
pc |
Named vector of exon path counts as returned by |
rpkm |
Set to |
summarize |
Set to |
minProbExpr |
Variants with (marginal posterior) probability of being
expressed below |
minExpr |
Variants with relative expression |
Expression set with expression estimates.
The featureData
indicates the gene island id, posterior probability
that each variant is expressed (column "probExpressed"
) and the
number of aligned reads per gene island (column "explCnts"
).
David Rossell
Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330.
## NOTE: toy example with few reads & genes to illustrate code usage ## Results with complete data are much more interesting! data(K562.r1l1) data(hg19DB) #Pre-process bam0 <- rmShortInserts(K562.r1l1, isizeMin=100) pbam0 <- procBam(bam0) #Estimate distributions, get path counts distrs <- getDistrs(hg19DB,bam=bam0,readLength=75) pc <- pathCounts(pbam0, DB=hg19DB) #Set prior distrib on model space mprior <- modelPrior(hg19DB, maxExons=40, smooth=FALSE) #Fit model denovo <- calcDenovo(distrs,targetGenomeDB=hg19DB,pc=pc,readLength=75,priorq=3,mprior=mprior,minpp=0) head(names(denovo)) denovo[['6499']] posprob(denovo[['6499']]) #Get estimates eset <- denovoExpr(denovo, pc=pc, rpkm=TRUE, minProbExpr=0.5) head(exprs(eset)) head(fData(eset))
## NOTE: toy example with few reads & genes to illustrate code usage ## Results with complete data are much more interesting! data(K562.r1l1) data(hg19DB) #Pre-process bam0 <- rmShortInserts(K562.r1l1, isizeMin=100) pbam0 <- procBam(bam0) #Estimate distributions, get path counts distrs <- getDistrs(hg19DB,bam=bam0,readLength=75) pc <- pathCounts(pbam0, DB=hg19DB) #Set prior distrib on model space mprior <- modelPrior(hg19DB, maxExons=40, smooth=FALSE) #Fit model denovo <- calcDenovo(distrs,targetGenomeDB=hg19DB,pc=pc,readLength=75,priorq=3,mprior=mprior,minpp=0) head(names(denovo)) denovo[['6499']] posprob(denovo[['6499']]) #Get estimates eset <- denovoExpr(denovo, pc=pc, rpkm=TRUE, minProbExpr=0.5) head(exprs(eset)) head(fData(eset))
denovoGeneExpr
stores inferred expression for de novo splicing
variants for a single gene.
denovoGenomeExpr
stores the information for several genes
(typically, the whole genome).
Objects are returned by calcDenovo
. When running
calcDenovo
on multiple genes results are returned in a
denovoGenomeExpr
object. Results for a single gene can be
retrieved using the [[ operator as usual, which returns a
denovoGeneExpr
object.
data.frame
containing the posterior
probability of each model
data.frame
with the estimated expression of
each variant under each model
matrix
indicating the exons contained in
each variant.
Sum of the log(integrated likelihood) + log(model prior probability) across all considered models.
Number of paths that had 0 probability under all considered variants and had to be excluded for model fitting purposes.
Input parameter to calcDenovo
Length of transcripts in bp (including new isoforms found by casper)
signature(object = "denovoGeneExpr")
: Displays
general information about the object.
Show names (island ids)
Selects a subset of genes
Selects a single gene
Accesses the posterior probabilities of each model (slot posprob)
Accesses the variant names and their respective exons
Replaces the value of the slot variants (can be useful for renaming variants, for instance)
David Rossell
calcDenovo
to create objects from the class.
denovoExpr
to obtain expression estimates from
denovoGenomeExpr
objects.
showClass("denovoGeneExpr")
showClass("denovoGeneExpr")
denovoGeneExpr
stores inferred expression for de novo splicing
variants for a single gene.
denovoGenomeExpr
stores the information for several genes
(typically, the whole genome).
Objects are returned by calcDenovo
.
A list of denovoGeneExpr
objects, with each
element containing results for an individual gene.
signature(object = "denovoGenomeExpr")
: Displays
general information about the object.
Coerces the object to a list
Selects a subset of genes
Selects a single gene
Camille Stephan-Otto Attolini
procGenome
and createDenovoGenome
to
create denovoGenomeExpr
objects.
showClass("denovoGeneExpr") showClass("denovoGenomeExpr")
showClass("denovoGeneExpr") showClass("denovoGenomeExpr")
We downloaded the fastq files, aligned with TopHat and processed with
wrapKnown
to obtain the estimated distributions for each of the 6
samples. distrsGSE37704
is a list with the 6 corresponding elements.
The estimated distributions for HiSeq data were very similar,
hence these distributions can be used as defaults for Illumina MiSeq and
HiSeq experiments.
data(distrsGSE37704)
data(distrsGSE37704)
An list
with 6 elements of class readDistrs
.
See help(getDistrs)
and help(readDistrs-class)
for details.
data(distrsGSE37704) distrsGSE37704 plot(distrsGSE37704[[1]],'readSt') lines(distrsGSE37704[[2]], 'readSt', col=2) plot(distrsGSE37704[[1]],'fragLength')
data(distrsGSE37704) distrsGSE37704 plot(distrsGSE37704[[1]],'readSt') lines(distrsGSE37704[[2]], 'readSt', col=2) plot(distrsGSE37704[[1]],'fragLength')
Plot exon structure for each transcript of a given gene. Optionally, aligned reads can be added to the plot.
genePlot(generanges, islandid, genomeDB, reads, exp, names.arg, xlab='', ylab='', xlim, cex=1, yaxt='n', col, ...)
genePlot(generanges, islandid, genomeDB, reads, exp, names.arg, xlab='', ylab='', xlim, cex=1, yaxt='n', col, ...)
generanges |
Object containing the ranges with start/end of each exon. |
islandid |
If |
genomeDB |
Annotated genome produced with the "procGenome" function |
reads |
|
exp |
|
names.arg |
Optionally, indicate the names of each transcript. |
xlab |
x-axis label |
ylab |
y-axis label |
xlim |
x-axis limits, defaults to start of 1st exon and end of last exon |
cex |
Character expansion |
yaxt |
The y-axis in the plot has no interpretation, hence by default it is not displayed. |
col |
Either single color or vector of colors to be used to draw each transcript. Defaults to rainbow colors. |
... |
Other arguments to be passed on to |
A plot is produced.
signature(generanges="CompressedIRangesList", islandid="ANY", genomeDB="ANY", reads="ANY", exp="ANY")
Plots a set of transcripts. Each element in the generanges
corresponds to a
transcript. Each transcript should contain exon start/end positions.
signature(generanges="IRanges", islandid="ANY", genomeDB="ANY", reads="ANY", exp="ANY")
Plots a single transcript. Each range indicates the start/end of a single exon.
signature(generanges="IRangesList", islandid="ANY", genomeDB="ANY", reads="ANY", exp="ANY")
Plots a set of transcripts. Each element in the generanges
corresponds to a
transcript. Each transcript should contain exon start/end positions.
signature(generanges="GRangesList", islandid="ANY",
genomeDB="ANY", reads="ANY", exp="ANY")
Plots a set of transcripts. Each element in the
generanges
corresponds to a
transcript. Each transcript should contain exon start/end
positions.
signature(generanges="GRanges", islandid="ANY", genomeDB="ANY", reads="ANY", exp="ANY")
Plots a set of transcripts. Each space in generanges
corresponds to a
transcript. Each transcript should contain exon start/end positions.
signature(generanges="missing", islandid="character", genomeDB="annotatedGenome", reads="GRanges", exp="ExpressionSet")
Plots all transcripts stored in genomeDB
for island with
identifier islandid
. Individual reads are added to the plot
(reads
contains start/end of individual read fragments).
signature(generanges="missing", islandid="character", genomeDB="annotatedGenome", reads="missing", exp="missing")
Plots all transcripts stored in genomeDB
for island with identifier islandid
.
signature(generanges="missing", islandid="character", genomeDB="annotatedGenome", reads="procBam", exp="missing")
Plots all transcripts stored in genomeDB
for island with
identifier islandid
. Individual reads are added to the plot
(reads
contains start/end of individual read fragments).
signature(generanges="missing", islandid="character", genomeDB="annotatedGenome", reads="procBam", exp="ExpressionSet")
Plots all transcripts stored in genomeDB
for island with
identifier islandid
. Individual reads and estimated
expression are added to the plot
(reads
contains start/end of individual read fragments).
Camille Stephan-Otto Attolini, David Rossell
data(hg19DB) #Plot an IRangesList txs <- transcripts(txid="NM_005158",genomeDB=hg19DB) genePlot(txs) #Equivalently, indicate islandid islandid <- getIsland(txid="NM_005158",genomeDB=hg19DB) genePlot(islandid=islandid, genomeDB=hg19DB)
data(hg19DB) #Plot an IRangesList txs <- transcripts(txid="NM_005158",genomeDB=hg19DB) genePlot(txs) #Equivalently, indicate islandid islandid <- getIsland(txid="NM_005158",genomeDB=hg19DB) genePlot(islandid=islandid, genomeDB=hg19DB)
Compute fragment start distributions by using reads aligned to genes with only one annotated variant. Estimate fragment length distribution using fragments aligned to long exons (>1000nt). Fragment length is defined as the distance between the start of the left-end read and the end of the right-end read.
getDistrs(DB, bam, pbam, islandid=NULL, verbose=FALSE, nreads=4*10^6, readLength, min.gt.freq = NULL, tgroups=5, mc.cores=1)
getDistrs(DB, bam, pbam, islandid=NULL, verbose=FALSE, nreads=4*10^6, readLength, min.gt.freq = NULL, tgroups=5, mc.cores=1)
DB |
Annotated genome. Object of class |
bam |
Aligned reads, as returned by |
pbam |
Processed BAM object of class |
islandid |
Island IDs of islands to be used in the read start distribution calculations (defaults to genes with only one annotated variant) |
verbose |
Set to |
nreads |
To speed up computations, only the first |
readLength |
Read length in bp, e.g. in a paired-end experiment where
75bp are sequenced on each end one would set |
min.gt.freq |
The target distributions cannot be estimated with
precision for gene types that are very unfrequent.
Gene types with relative frequency below |
tgroups |
As an alternative to |
mc.cores |
Number of cores to use for parallel processing |
An object of class readDistrs with slots:
lenDis |
Table with number of fragments with a given length |
stDis |
Cumulative distribution function (object of type closure) for relative start position |
Camille Stephan-Otto Attolini, David Rossell
data(K562.r1l1) data(hg19DB) bam0 <- rmShortInserts(K562.r1l1, isizeMin=100) distrs <- getDistrs(hg19DB,bam=bam0,readLength=75) #Fragment length distribution plot(distrs,'fragLength') #Fragment start distribution (relative to transcript length) plot(distrs,'readSt')
data(K562.r1l1) data(hg19DB) bam0 <- rmShortInserts(K562.r1l1, isizeMin=100) distrs <- getDistrs(hg19DB,bam=bam0,readLength=75) #Fragment length distribution plot(distrs,'fragLength') #Fragment start distribution (relative to transcript length) plot(distrs,'readSt')
annotatedGenome objects store information regarding genes and transcripts. When there's an overlap in exons between several genes, these genes are grouped into gene islands. getIsland retrieves the island to which each gene or transcript was assigned, while getChr indicates the chromosome.
getIsland(entrezid, txid, genomeDB) getChr(entrezid, txid, islandid, genomeDB)
getIsland(entrezid, txid, genomeDB) getChr(entrezid, txid, islandid, genomeDB)
entrezid |
Character indicating single Entrez identifier. Can be left missing and specify another identifier instead. |
txid |
Character indicating a single RefSeq transcript identifier. Can be left missing and specify another identifier instead. |
islandid |
Character indicating the gene island indentifier. Can be left missing and specify another identifier instead. |
genomeDB |
Object of class |
Character with island identifier
signature(entrezid='character',txid='missing',genomeDB='annotatedGenome')
Return island id for given Entrez identifier
signature(entrezid='missing',txid='character',genomeDB='annotatedGenome')
Return island id for given transcript identifier (RefSeq)
signature(entrezid='character',txid='missing',islandid='missing',genomeDB='annotatedGenome')
Return chromosome for given Entrez identifier (RefSeq)
signature(entrezid='missing',txid='character',islandid='missing',genomeDB='annotatedGenome')
Return chromosome for given transcript identifier (RefSeq)
signature(entrezid='missing',txid='missing',islandid='character',genomeDB='annotatedGenome')
Return chromosome for given island identifier
signature(entrezid='character',txid='missing',islandid='missing')
Return chromosome for given Entrez identifier
signature(entrezid='missing',txid='character',islandid='missing')
Return chromosome for given transcript identifier (RefSeq)
signature(entrezid='missing',txid='character',islandid='missing')
Return chromosome for given island identifier
data(hg19DB) getIsland(entrezid="27",genomeDB=hg19DB) getIsland(txid="NM_005158",genomeDB=hg19DB) getChr(entrezid="27",genomeDB=hg19DB) getChr(txid="NM_005158",genomeDB=hg19DB)
data(hg19DB) getIsland(entrezid="27",genomeDB=hg19DB) getIsland(txid="NM_005158",genomeDB=hg19DB) getChr(entrezid="27",genomeDB=hg19DB) getChr(txid="NM_005158",genomeDB=hg19DB)
getNreads returns a numeric vector with the total number of path counts in each island from a pathCounts object.
getNreads(pc)
getNreads(pc)
pc |
pathCounts object generated by pathCounts() |
Numeric vector with total number of path counts in each island of pc.
signature(pathCounts='pathCounts')
Returns numeric vector with total number of path counts for each island in the pathCounts object.
Camille Stephan-Otto Attolini
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets.
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets.
procBam
object.
procBam
objects store reads that have been split according to
their CIGAR codes. getReads
accesses these reads.
getReads(x)
getReads(x)
x |
Object of class |
RangedData
object with reads stored in x
.
signature(x='procBam')
Return reads stored in x
.
#See example in calcExp
#See example in calcExp
getRoc
compares simulation truth and data analysis results to
determine False Positives (FP), False Negatives (FP), True Positives
(TP), True Negatives (TN), Positives (FP+TP), False Discovery
Proportion (FP/P) and Power (TP/(TP+FN)).
getRoc(simTruth, decision)
getRoc(simTruth, decision)
simTruth |
Binary vector or matrix indicating simulation truth
( |
decision |
Binary vector or matrix with differential expression calls based on some data analysis. |
data.frame
with TP, FP, TN, FN, P, FDR and Power.
signature(simTruth='logical',decision='logical')
Operating characteristics are computed for a single simulation
signature(simTruth='numeric',decision='numeric')
Operating characteristics are computed for a single simulation
signature(simTruth='matrix',decision='matrix')
simTruth
and decision
contain truth and calls for
several simulations (in columns).
getRoc
returns a data.frame with operating characteristics
in each simulation.
David Rossell
## See help(probNonEquiv) for an example
## See help(probNonEquiv) for an example
We downloaded the human genome hg19 via procGenome
and selected a few genes from chromosome 1
to use as a toy data for the vignette and examples.
data(hg19DB)
data(hg19DB)
An annotatedGenome
object. See help(procGenome)
and
help(annotatedGenome-class)
for details.
data(hg19DB) hg19DB slotNames(hg19DB)
data(hg19DB) hg19DB slotNames(hg19DB)
The paired-end RNA-seq data is from the RGASP project
sample K562_2x75 (replicate 1, lane 1)
and was obtained at
ftp://ftp.sanger.ac.uk/pub/gencode/rgasp/RGASP1/inputdata/human_fastq.
Reads were aligned against hg19 with tophat 2.0.2 and bowtie 0.12.5, setting the insert size at -r 200,
and imported into R using scanBam
from package Rsamtools
.
For illustration purposes, we selected reads mapping to a few genes only
(namely, the genes that were also selected for the toy genome annotation in data(hg19DB)
.
data(K562.r1l1)
data(K562.r1l1)
A list indicating read id, chromosome, start and end locations and the position of the pair, as returned by scanBam
.
ftp://ftp.sanger.ac.uk/pub/gencode/rgasp/RGASP1/inputdata/human_fastq
C Trapnell, L Pachter, SL Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 2009, 25, 1105-1111. doi=10.1093/bioinformatics/btp120.
B Langmead, C Trapnell, M Pop, SL Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 2009, 10:R25.
data(K562.r1l1) names(K562.r1l1)
data(K562.r1l1) names(K562.r1l1)
mergeBatches
combines x
and y
into an
ExpressionSet
,
performs quantile normalization and adjusts for batch effects by
subtracting the mean expression in each batch (and then adding the
grand mean so that the mean expression per gene is unaltered).
mergeBatches(x, y, mc.cores=1)
mergeBatches(x, y, mc.cores=1)
x |
|
y |
Either |
mc.cores |
Number of processors to be used (ignored when |
When y
is an ExpressionSet
, mergeBatches
returns
an ExpressionSet
with combined expressions.
Its featureData
contains a variable "batch"
indicating
the batch that each sample corresponded to.
When y
is a simulatedSamples
object, mergeBatches
is applied to combine x
with each dataset in y
and a
list of ExpressionSet
objects is returned.
David Rossell
#Fake data from 2 batches x <- matrix(rnorm(6),nrow=2) colnames(x) <- paste('x',1:3,sep='') y <- matrix(1+rnorm(6),nrow=2) colnames(y) <- paste('y',1:3,sep='') x <- new("ExpressionSet",exprs=x) y <- new("ExpressionSet",exprs=y) exprs(x) exprs(y) #Merge & adjust z <- mergeBatches(x,y) exprs(z)
#Fake data from 2 batches x <- matrix(rnorm(6),nrow=2) colnames(x) <- paste('x',1:3,sep='') y <- matrix(1+rnorm(6),nrow=2) colnames(y) <- paste('y',1:3,sep='') x <- new("ExpressionSet",exprs=x) y <- new("ExpressionSet",exprs=y) exprs(x) exprs(y) #Merge & adjust z <- mergeBatches(x,y) exprs(z)
mergeExp
combines the output of calcExp
from multiple
samples, i.e. multiple ExpressionSet
objects, into a single ExpressionSet
mergeExp(..., sampleNames, keep=c('transcript','gene_id','island_id'))
mergeExp(..., sampleNames, keep=c('transcript','gene_id','island_id'))
... |
|
sampleNames |
Character vector indicating the name of each sample. Defaults to 'Sample1', 'Sample2', etc. |
keep |
Variables in the featureData of each individual
|
mergeExp
runs some checks to ensure that object can be combined
(e.g. making sure that measurements are obtained on same set of
genes), then sorts and formats each input
ExpressionSet
.
A label with the sample name is appended to variables in the featureData that appear in multiple samples, e.g. variable 'se' reporting standard errors (obtained by setting citype='asymp' in calcExp).
Object of class ExpressionSet
combining the input
ExpressionSets. Its featureData
contains the columns indicated
in the keep argument, plus a column readCount
with the total
number of reads mapped to each gene (or gene island, when multiple
genes have overlapping exons).
David Rossell
calcExp
to obtain an ExpressionSet
for an individual sample.
#See example in calcExp
#See example in calcExp
Set prior on expressed splicing variants using the genome annotation
contained in a knownGenome
object.
The prior probability of variants V1,...,Vn being expressed depends on n, on the number of exons in each variant V1,...,Vn and the number of exons in the gene. See the details section.
modelPrior(genomeDB, maxExons=40, smooth=TRUE, verbose=TRUE)
modelPrior(genomeDB, maxExons=40, smooth=TRUE, verbose=TRUE)
genomeDB |
Object of class |
maxExons |
The prior distribution is estimated for genes with 1
up to |
smooth |
If set to |
verbose |
Set to |
The goal is to set a prior that takes into account the number of annotated variants for genes with E exons, as well as the number of exons in each variant.
Suppose we have a gene with E exons. Let V_1,...,V_n be n variants of interest and let |V_1|,...,|V_n| be the corresponding number of exons in each variant. The prior probability of variants V_1,...,V_n being expressed is modeled as
P(V_1,...,V_n|E)= P(n|E) P(|V_1| |E) ... P(|V_n| |E)
where P(n|E)= NegBinom(n; k_E, r_E) I(0 < n < 2^E) and P(|V_i| |E)= BetaBinomial(|V_i|-1; E-1, alpha_E, beta_E).
The parameters k_E, r_E, alpha_E, beta_E depend on E (the number of exons
in the gene) and are estimated from the available annotation via
maximum likelihood.
Parameters are estimated jointly for all genes with E>=
maxExons
in order to improve the precision.
For smooth==TRUE
, alpha_E and beta_E are modeled as a smooth
function of E by calling gam
and setting the smoothing
parameter via cross-validation. Estimates for genes with E>=10 are
substituted by their smooth versions, which typically helps improve
stability in the estimates.
List with 2 components.
nvarPrior |
List with prior distribution on the number of expressed variants for genes with 1,2,3... exons. Each element contains the truncated Negative Binomial parameters, observed and predicted frequencies (counting the number of genes with a given number of variants). |
nexonPrior |
List with prior distribution on the number of exons in a variant for genes with 1,2,3... exons. Each element contains the Beta-Binomial parameters, observed and predicted frequencies (counting the number of variants with a given number of exons) |
David Rossell, Camille Stephan-Otto Attolini
data(hg19DB) mprior <- modelPrior(hg19DB, maxExons=10) ##Prior on number of expressed variants ##Genes with 2 exons ##mprior$nvarPrior[['2']] ##Genes with 3 exons ##mprior$nvarPrior[['3']] ##Prior on the number of exons in an expressed variant ##Genes with 2 exons ##mprior$nexonPrior[['2']] ##Genes with 3 exons ##mprior$nexonPrior[['3']]
data(hg19DB) mprior <- modelPrior(hg19DB, maxExons=10) ##Prior on number of expressed variants ##Genes with 2 exons ##mprior$nvarPrior[['2']] ##Genes with 3 exons ##mprior$nvarPrior[['3']] ##Prior on the number of exons in an expressed variant ##Genes with 2 exons ##mprior$nexonPrior[['2']] ##Genes with 3 exons ##mprior$nexonPrior[['3']]
modelPriorAS
stores parameters for the prior distribution on
all possible alternative splicing configuration (i.e. prior on model
space). This information is used for de novo reconstruction of
splicing variants.
Objects are created by function modelPrior
.
Prior on the number of variants per gene. A list with components "nbpar"
containing
the parameters of the Negative Binomial distribution, "obs"
containing the observed counts and "pred"
the Negative
Binomial predicted counts.
Prior on the number of exons in an expressed
variant. A list with components "bbpar"
containing
Beta-Binomial parameters, "obs"
containing the observed
counts and "pred"
the Beta-Binomial predicted counts.
signature(object = "modelPriorAS")
: Displays
general information about the object.
Selects prior parameters for genes with the specified number of exons
Selects a single gene
David Rossell
procGenome
and createDenovoGenome
to
create modelPriorAS
objects.
showClass("modelPriorAS")
showClass("modelPriorAS")
Compute counts for exon paths visited by aligned reads
pathCounts(reads, DB, mc.cores = 1, verbose=FALSE)
pathCounts(reads, DB, mc.cores = 1, verbose=FALSE)
reads |
Object of class |
DB |
Object of class |
mc.cores |
Number of processors to be used for parallel computing. Requires
having package |
verbose |
Set to |
Named integer vector with counts of exon paths. Names are character strings built as ".exon1.exon2-exon3.exon4.", with dashes making the split between exons visited by left and right-end reads correspondingly.
signature(reads='list')
Computes counts for exon paths from a list of procBam objects (usually reads processed and split by chromosome).
signature(reads='procBam')
Compute counts for exon paths from a procBam object of processed reads.
Camille Stephan-Otto Attolini
procGenome
to create an annotated genome object,
createDenovoGenome
to create a de novo annotated genome.
See help(getNreads) to get number of fragments mapping to each island.
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets.
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets.
Stores exon path counts.
Objects are created with a call to pathCounts
.
List with one element per gene island. For each island, it contains a named vector with exon path counts. The names indicate the visited exons.
For instance, consider that for gene '1' with 2 exons we observe 10
reads in which the left end falls completely in exon 1 and the right
end in exon 2. Suppose that for 5 reads the left end bridges exons 1-2
and the right end falls in exon 2. Then pc[['1']]
would contain
c(10,5)
and names(pc[['1']])
would contain c(".1-2.",".1.2-2.")
Logical variable. FALSE
indicates that the
counts correspond to a known genome (i.e. created with procGenome
), and TRUE
to a de novo
annotated genome (i.e. created with createDenovoGenome
).
Logical variable. TRUE
indicates that the
path counts were obtained from and RNA-seq experiment where strand
information was preserved.
signature(object = "pathCounts")
: Displays general information about the object.
Camille Stephan-Otto Attolini
showClass("pathCounts")
showClass("pathCounts")
Plots the estimated fragment length (insert size) distribution and the relative read start distribution (0 indicating transcription start, 1 transcription end). The former checks that the insert size distribution matches that described in the experimental protocol. The latter checks the extent to which reads are non-uniformly distributed (note: casper does NOT assume reads to be uniformly distributed, so a lack of uniformity is not a problem per se).
x |
Object of type |
y |
Set to |
... |
Further arguments to be passed on to plot. |
signature(x = "readDistrs", y = "ANY")
x
is an object of type readDistrs
, as returned by
getDistrs
. The plot allows to visualize the fragment
length and read start distributions in a given sample.
signature(x = "readDistrs")
x
is an object of type readDistrs
, as returned by
getDistrs
. The plot allows to visualize the fragment
length and read start distributions in a given sample.
signature(x = "readDistrsList", y = "ANY")
x
is an object of type readDistrsList
storing fragment
length and read start distributions for multiple samples.
signature(x = "readDistrsList")
x
is an object of type readDistrsList
storing fragment
length and read start distributions for multiple samples.
#See getDistrs examples
#See getDistrs examples
Plots variants with sufficiently large posterior probability of being expressed along with their (marginal) estimated expression.
plotExpr(gene, minProbExpr = 0.5, minExpr = 0.1, xlab = "(kb)", ylab = "", xlim, cex = 1, yaxt = "n", col, ...)
plotExpr(gene, minProbExpr = 0.5, minExpr = 0.1, xlab = "(kb)", ylab = "", xlim, cex = 1, yaxt = "n", col, ...)
gene |
|
minProbExpr |
Variants with marginal posterior probability of
expression below |
minExpr |
Variants with (marginal) estimated expression below
|
xlab |
x-axis label, passed on to |
ylab |
y-axis label, passed on to |
xlim |
x-axis limits, passed on to |
cex |
Character expansion, passed on to |
yaxt |
Type of y-axis, passed on to |
col |
Colors for each variant, defaults to rainbow colors. It is possible to specify a single color. |
... |
Other arguments to be passed on to |
The marginal posterior probability that a variant is expressed is the sum of the posterior probabilities of all models containing that variant.
The marginal estimated expression is the average expression across all models (including those where the variant has 0 expression) weighted by the posterior probability of each model.
signature(gene = "denovoGeneExpr")
gene
contains the results from a de novo isoform expression
analysis for a single gene, as returned by calcDenovo
. When
calcDenovo
is run on multiple genes simultaneously, the
desired gene can be selected using the "[[" operator as usual.
#See calcDenovo examples
#See calcDenovo examples
Plots the prior distribution on the number of expressed variants and the
number of exons per variant in genes with exons
exons
(as returned by function modelPrior
).
The prior distribution is compared to the observed frequencies to check
that the assumed distributional forms are reasonable.
plotPriorAS(object, type="nbVariants", exons=1:9, xlab, ylab="Probability", col=c("red","blue"))
plotPriorAS(object, type="nbVariants", exons=1:9, xlab, ylab="Probability", col=c("red","blue"))
object |
|
type |
Set to |
exons |
Vector with integers. The plot is only produced with
number of exons indicated in |
xlab |
x-axis label, passed on to |
ylab |
y-axis label, passed on to |
col |
Colors for bars showing prior probabilities and frequencies in the known genome |
signature(object = "modelPriorAS")
object
contains the prior distribution on the model space, as
returned by function modelPrior
#See modelPrior examples
#See modelPrior examples
probNonEquiv
performs a Bayesian hypothesis test for equivalence between group means.
It returns the posterior probability that |mu1-mu2|>logfc.
pvalTreat
is a wrapper to treat
in package limma
,
which returns P-values for the same hypothesis test.
probNonEquiv
computes v_i=P(|theta_i| > logfc | data), where theta_i is
the difference between group means for gene i. This posterior
probability is based on the NNGCV model from package EBarrays, which
has a formulation similar to limma in an empirical Bayes framework.
Notice that the null hypothesis here is that |theta_i|<logfc,
e.g. isoforms with small fold changes are regarded as uninteresting.
Subsequent differential expression calls are based on selecting large v_i. For instance, selecting v_i >= 0.95 guarantees that the posterior expected false discovery proportion (a Bayesian FDR analog) is below 0.05.
probNonEquiv(x, groups, logfc = log(2), minCount, method = "plugin", mc.cores=1) pvalTreat(x, groups, logfc = log(2), minCount, p.adjust.method='none', mc.cores = 1)
probNonEquiv(x, groups, logfc = log(2), minCount, method = "plugin", mc.cores=1) pvalTreat(x, groups, logfc = log(2), minCount, p.adjust.method='none', mc.cores = 1)
x |
ExpressionSet containing expression levels, or list of ExpressionSets |
groups |
Variable in fData(x) indicating the two groups to compare (the case with more than 2 groups is not implemented). |
logfc |
Biologically relevant threshold for the log fold change, i.e. difference between groups means in log-scale |
minCount |
If specified, probabilities are only computed for rows with |
method |
Set to |
mc.cores |
Number of parallel processors to use. Ignored unless
|
p.adjust.method |
P-value adjustment method, passed on to |
If x
is a single ExpressionSet
, probNonEquiv
returns a vector with posterior probabilities
(NA for rows with less than minCount
reads).
pvalTreat
returns TREAT P-values instead.
If x
is a list of ExpressionSet
, the function is applied
to each element separately and results are returned as columns in the
output matrix.
Victor Pena, David Rossell
Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330
McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics, 25(6):765-771
treat
in package limma
, p.adjust
#Simulate toy data p <- 50; n <- 10 x <- matrix(rnorm(p*2*n),nrow=p) x[(p-10):p,1:n] <- x[(p-10):p,1:n] + 1.5 x <- new("ExpressionSet",exprs=x) x$group <- rep(c('group1','group2'),each=n) #Posterior probabilities pp <- probNonEquiv(x, groups='group', logfc=0.5) d <- rowMeans(exprs(x[,1:n])) - rowMeans(exprs(x[,-1:-n])) plot(d,pp,xlab='Observed log-FC') abline(v=c(-.5,.5)) #Check false positives truth <- rep(c(FALSE,TRUE),c(p-11,11)) getRoc(truth, pp>.9) getRoc(truth, pp>.5)
#Simulate toy data p <- 50; n <- 10 x <- matrix(rnorm(p*2*n),nrow=p) x[(p-10):p,1:n] <- x[(p-10):p,1:n] + 1.5 x <- new("ExpressionSet",exprs=x) x$group <- rep(c('group1','group2'),each=n) #Posterior probabilities pp <- probNonEquiv(x, groups='group', logfc=0.5) d <- rowMeans(exprs(x[,1:n])) - rowMeans(exprs(x[,-1:-n])) plot(d,pp,xlab='Observed log-FC') abline(v=c(-.5,.5)) #Check false positives truth <- rep(c(FALSE,TRUE),c(p-11,11)) getRoc(truth, pp>.9) getRoc(truth, pp>.5)
Process paired-end data stored in BAM object generated by scanBam. Outputs GRanges objects for reads and junctions.
procBam(bam, stranded=FALSE, seed=as.integer(1), verbose=FALSE, rname='null', keep.junx=FALSE, keep.flag=FALSE, ispaired=TRUE,...)
procBam(bam, stranded=FALSE, seed=as.integer(1), verbose=FALSE, rname='null', keep.junx=FALSE, keep.flag=FALSE, ispaired=TRUE,...)
bam |
BAM object generated by |
stranded |
Set to |
seed |
Seed for random number generator |
verbose |
Set to |
rname |
Chromosome to process be combined with the |
keep.junx |
Option to store junction information. Only useful for finding denovo exons and transcripts. |
keep.flag |
Option to store aligment flag information. |
ispaired |
Set to |
... |
Other arguments |
In case of multihits with same start position for both reads but different insertions/deletions patterns only one alignment is chosen at random.
An object of class procBam
containing
reads with both ends
correctly aligned and split according to the corresponding CIGAR.
Unique identifiers by fragment are stored. Junctions spanned by reads
are also stored in GRanges object if the argument \'keep.junx\' is set
to TRUE.
signature(bam='list',stranded='logical',seed='integer',verbose='logical',
rname='character',keep.junx='logical',keep.flag='logical')
Process paired-end data stored in BAM object generated by scanBam. Outputs GRanges objects for reads and (optionally) junctions.
Camille Stephan-Otto Attolini
scanBam
from package Rsamtools
,
help("procBam-class")
, getReads
.
##See example in calcExp
##See example in calcExp
Stores processed bam files in a RangedData
format. Each read is
split into disjoint ranges according to its cigar code.
Objects are created with a call to procBam
.
GRanges
indicating chromosome, start and end
of each disjoint range. The pair id and read id within the pair
are also stored.
GRanges
indicating chromosome, start and end
of junctions spanned by reads.
Logical variable. TRUE
indicates that the
reads were obtained from and RNA-seq experiment where strand
information was preserved.
In the case of stranded experiments:
GRanges
indicating chromosome, start and end of
each disjoint range for fragments originated from the positive
strand. The pair id and read id within the pair are also stored.
GRanges
indicating chromosome, start and end
of each disjoint range for fragments originated from the
negative strand. The pair id and read id within the pair are
also stored.
GRanges
indicating chromosome, start and end
of junctions spanned by reads originated from the positive strand.
GRanges
indicating chromosome, start and end
of junctions spanned by reads originated from the negative
strand.
signature(object = "procBam")
: Displays general
information about the object.
signature(x = "procBam")
: Extracts the
aligned reads stored in x
.
Camille Stephan-Otto Attolini, David Rossell
getReads
showClass("procBam")
showClass("procBam")
procGenome
processes annotations for a given transcriptome,
either from a TxDb
object created by GenomicFeatures
package
(e.g. from UCSC) or from a user-provided GRanges
object (e.g. by
importing a gtf file).
createDenovoGenome
creates a de novo annotated genome by
combining UCSC annotations and observed RNA-seq data.
procGenome(genDB, genome, mc.cores=1, verbose=TRUE) createDenovoGenome(reads, DB, minLinks=2, maxLinkDist=1e+05, maxDist=1000, minConn=2, minJunx=3, minLen=12, mc.cores=1)
procGenome(genDB, genome, mc.cores=1, verbose=TRUE) createDenovoGenome(reads, DB, minLinks=2, maxLinkDist=1e+05, maxDist=1000, minConn=2, minJunx=3, minLen=12, mc.cores=1)
genDB |
Either a |
genome |
Character indicating genome version (e.g. "hg19", "dm3") |
mc.cores |
Number of cores to use in parallel processing (multicore package required) |
verbose |
Set to |
DB |
|
minLinks |
Minimum number of reads joining two exons to merge their corresponding genes |
maxLinkDist |
Maximum distance between two exons to merge
their correspondin genes. A value of |
maxDist |
Maximum distance between two exons with reads joining them to merge their corresponding genes. |
minConn |
Minimum number of fragments connecting a new exon to an annotated one to add to denovo genome. |
minJunx |
Minimum number of junctions needed to redefine an annotated exon's end or start. |
minLen |
Minimum length of a junction to consider as a putative intron. |
reads |
Processed reads stored in a |
These functions create the annotation objects that are needed for subsequent functions. Typically these objects are created only once for a set of samples.
If interested in quantifying expression for known transcripts
only, one would typically use procGenome
with a
TxDb
from the usual Bioconductor annotations,
e.g. genDB<-makeTxDbFromUCSC(genome="hg19",tablename="refGene"),
or imported from a gtf file
e.g. genDB<-makeTxDbFromGFF('transcripts.gft',format='gtf').
GRanges
object (e.g. genDB <- import('transcripts.gtf')).
Package GenomicFeatures contains more info about how to create
TxDb
objects.
Alternatively, one can provide annotations as a GRanges
object
whith is returned when importing a gtf file with
function import
(package rtracklayer
).
The output from procGenome
can be used in combination with
wrapKnown
, which quantifies expression for a set of known transcripts,
or wrapDenovo
, which uses Bayesian model selection methods to
assess which transcripts are truly expressed.
When using wrapDenovo
, you should create a single annotatedGenome
object that combines information from all samples
(e.g. from a gtf file produced by running your favorite isoform
prediction software jointly on all samples),
as this increases the power to detect new exons and isoforms.
Object of class annotatedGenome
.
signature(genDB = "transcriptDb")
genDB
is usually obtained with a call to
makeTxDbFromUCSC
(package GenomicFeatures
),
e.g. genDB<-makeTxDbFromUCSC(genome="hg19", tablename="refGene")
signature(genDB = "GRanges")
genDB
stores information about all transcripts and their
respective exons. Chromosome, start, end and strand are stored as
usual in GRanges objects. genDB
must have a column named
"type"
taking the value "transcript"
for rows
corresponding to transcript and "exon"
for rows corresponding
to exons. It must also store transcript and gene ids. For instance, Cufflinks RABT
module creates a gtf file with information formatted in this manner
for known and de novo predicted isoforms.
Camille Stephan-Otto Attolini
See annotatedGenome-class
for a description of the class.
See methods transcripts
to extract exons in each transcript,
getIsland
to obtain the island id corresponding to a given transcript id
See splitGenomeByLength
for splitting an annotatedGenome
according to gene length.
## Known transcripts from Bioconductor annotations ## library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## hg19DB <- procGenome(TxDb.Hsapiens.UCSC.hg19.knownGene, genome='hg19') ## Alternative using makeTxDbFromUCSC ## genDB<-makeTxDbFromUCSC(genome="hg19", tablename="refGene") ## hg19DB <- procGenome(genDB, "hg19") ## Alternative importing .gtf file ## genDB.Cuff <- import('transcripts.gtf') ## hg19DB.Cuff <- procGenome(genDB.Cuff, genome='hg19')
## Known transcripts from Bioconductor annotations ## library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## hg19DB <- procGenome(TxDb.Hsapiens.UCSC.hg19.knownGene, genome='hg19') ## Alternative using makeTxDbFromUCSC ## genDB<-makeTxDbFromUCSC(genome="hg19", tablename="refGene") ## hg19DB <- procGenome(genDB, "hg19") ## Alternative importing .gtf file ## genDB.Cuff <- import('transcripts.gtf') ## hg19DB.Cuff <- procGenome(genDB.Cuff, genome='hg19')
qqnormGenomeWide overlays quantile-quantile normal plots (qqnorm) for a series of genes (rows in the input matrix), to provide an overall assessment of Normality. Similarly, qqgammaGenomeWide overlays quantile-quantile gamma plots.
Note that the theoretical quantiles for z-scores under a Normal are the same for all genes, but the gamma theoretical quantiles depend on the Gamma parameter estimates for each gene and hence the theoretical quantiles are different for each gene (resulting in different x-values in each qq-plot)
qqnormGenomeWide(x, ngenes=min(1000, nrow(x)), ...) qqgammaGenomeWide(x, ngenes=min(1000, nrow(x)), ...)
qqnormGenomeWide(x, ngenes=min(1000, nrow(x)), ...) qqgammaGenomeWide(x, ngenes=min(1000, nrow(x)), ...)
x |
|
ngenes |
A qqnorm plot is produced for the first |
... |
Other arguments to be passed on to codeplot |
Produces a figure overlaying qq-normal or qq-gamma plots for ngenes
comparing observed vs. theoretical quantiles
David Rossell
mu <- rnorm(100) x <- matrix(rnorm(100*5,mu),ncol=5) qqnormGenomeWide(x) qqgammaGenomeWide(exp(x))
mu <- rnorm(100) x <- matrix(rnorm(100*5,mu),ncol=5) qqnormGenomeWide(x) qqgammaGenomeWide(exp(x))
Perform quantile normalization on the columns of a matrix or ExpressionSet
quantileNorm(x)
quantileNorm(x)
x |
|
Returns x
with quantile normalized columns
David Rossell
x <- cbind(rnorm(1000),rnorm(1000,2,4)) boxplot(x) xnorm <- quantileNorm(x) boxplot(xnorm)
x <- cbind(rnorm(1000),rnorm(1000,2,4)) boxplot(x) xnorm <- quantileNorm(x) boxplot(xnorm)
Transforms relative expressions that add up to 1 within each gene island (the default output of casper) to relative expressions that add up to 1 per gene.
relexprByGene(x, normbylength=FALSE, genomeDB)
relexprByGene(x, normbylength=FALSE, genomeDB)
x |
|
normbylength |
If set to |
genomeDB |
If |
ExpressionSet
with relative expressions adding up to one for
each gene_id
.
David Rossell
#See help(calcExp)
#See help(calcExp)
In paired-end experiments short inserts (i.e. the 2 ends being very close to each other),
may indicate RNA degradation or that a short RNA (e.g. miRNA) is being
sequenced.
Typically the goal is not to study alternative splicing for such
short/degraded RNA; in this case it is recommendable to remove such short
inserts to avoid biasing the insert size distribution.
Requiring a minimum insert size can also result in significantly
faster computations when quantifying alternative splicing via calc
or calcDenovo
.
rmShortInserts(bam, isizeMin=100)
rmShortInserts(bam, isizeMin=100)
bam |
Object with aligned reads, as returned by |
isizeMin |
Reads with insert size smaller than |
Named list, in the same format as that returned by scanBam
.
The insert size is stored in objects imported with scanBam
in the element named isize
.
David Rossell
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets.
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets.
Simulate several future RNA-seq data under various experimental settings (sequencing depth, read length, insert sizes), estimate isoform expression and assess the MAE incurred in the estimation process. The function is a wrapper combining functions simReads and calcExp.
simMAE(nsim, islandid, nreads, readLength, fragLength, burnin=1000, pc, distr, readLength.pilot=readLength, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE, writeBam=FALSE, bamFile=NULL)
simMAE(nsim, islandid, nreads, readLength, fragLength, burnin=1000, pc, distr, readLength.pilot=readLength, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE, writeBam=FALSE, bamFile=NULL)
nsim |
Number of RNA-seq datasets to generate (often as little as
|
islandid |
When specified this argument indicates to run the
simulations only for gene islands with identifiers in
|
nreads |
Vector indicating the target number of read pairs for each
experimental setting. The actual number of reads differs from
|
readLength |
Vector indicating the read length in each experimental setting |
fragLength |
Vector indicating the mean insert size in each experimental setting |
burnin |
Number of MCMC burn-in samples (passed on to |
pc |
Observed path counts in pilot data. When not specified, these
are simulated from |
distr |
Estimated read start and insert size distributions in pilot data |
readLength.pilot |
Read length in pilot data |
eset.pilot |
ExpressionSet with pilot data expression in
log2-RPKM, used to simulate |
usePilot |
By default |
retTxsError |
If |
genomeDB |
|
mc.cores |
Number of cores to use in the expression estimation step, passed on to |
mc.cores.int |
Number of cores to simulate RNA-seq datasets in parallel |
verbose |
Set |
writeBam |
Set to |
bamFile |
Name of the .bam file |
simMAE
simulates nsim
datasets under each experimental
setting defined by nreads
, readLength
,
fragLength
. For each dataset the following steps are performed:
1. The number of reads is nreads * readYield * pmapped, where readYield= runif(1,0.8,1.2) accounts for deviations in read yield and pmapped= runif(1,0.6,0.9)*pmappable is the proportion of mapped reads (60%-90% of the mappable reads according to the piecewise-linear power law of Li et al (2014))
2. True expression levels pi
are generated from their posterior
distribution given the pilot data.
3. Conditional on pi
, RNA-seq data are generated and expression
estimates pihat
are obtained using calcExp
4. The mean absolute estimation error sum(abs(pihat-pi))
across
all isoforms is computed
Ideally simMAE
should use pilot data from a relevant related experiment
to simulate what future data may look like for the current experiment of
interest.
The recommened way to do this is to download a .bam file from such a
related experiment and processing it in casper with function
wrapKnown
, as then both gene and isoform expression can be
estimated accurately.
The object output by wrapKnown
is a list with elements
named 'pc'
, 'distr'
which can be given as input to
simMAE
.
As an alternative to specifying pc
,
simMAE
allows setting eset.pilot
as
pilot data. Gene and isoform expression are then simulated as follows:
1. The number of reads per gene is generated from a Multinomial
distribution with success probabilities proportional to
2^exprs{eset.pilot}
.
2. Relative isoform expression within each gene are generated from a symmetric Dirichlet distribution with parameter 1/Ig, where Ig is the number of isoforms in gene g.
We emphasize that relative isoform expressions are not trained from the
pilot data, and that while the distribution of gene expression levels
resembles that in eset.pilot
, no attempt is made to match gene
identifiers and hence the results for individual genes should not be
trusted
(hence this option is only available when retTxsError==FALSE
.
If retTxsError==TRUE
, simMAE
returns posterior expected MAE
for each individual isoform.
Else the output is a data.frame
with overall MAE across all isoforms
Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2014)
Li, W. and Freudenberg, J. and Miramontes, P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics, 15, 2 (2014)
wrapKnown,simReads,calcExp
## maybe str(simMAE) ; plot(simMAE) ...
## maybe str(simMAE) ; plot(simMAE) ...
Simulates RNA-seq data under the same experimental setting as in the observed data, and compares the observed vector of number of reads per gene with the simulations.
simMAEcheck(nsim, islandid, burnin=1000, pc, distr, readLength.pilot, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE)
simMAEcheck(nsim, islandid, burnin=1000, pc, distr, readLength.pilot, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE)
nsim |
Number of RNA-seq datasets to generate (often as little as
|
islandid |
When specified this argument indicates to run the
simulations only for gene islands with identifiers in
|
burnin |
Number of MCMC burn-in samples (passed on to |
pc |
Observed path counts in pilot data. When not specified, these
are simulated from |
distr |
Estimated read start and insert size distributions in pilot data |
readLength.pilot |
Read length in pilot data |
eset.pilot |
ExpressionSet with pilot data expression in
log2-RPKM, used to simulate |
usePilot |
By default |
retTxsError |
If |
genomeDB |
|
mc.cores |
Number of cores to use in the expression estimation step, passed on to |
mc.cores.int |
Number of cores to simulate RNA-seq datasets in parallel |
verbose |
Set |
simMAEcheck
simulates nsim
datasets under the same experimental setting
as in the observed data. For more details, please check the documentation for
simMAE
, which is the basis of this function.
The output is a list with 2 entries. The first entry is a data.frame
with overall MAE across all isoforms in the simulations (see simMAE
for details).
The second entry contains the expected number of genes for which the number of
reads in the data lies in the range of the posterior predictive simulations (under the hypothesis that they have the same
distribution) and the actual number of genes for which the condition is satisfied.
Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2014)
Li, W. and Freudenberg, J. and Miramontes, P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics, 15, 2 (2014)
wrapKnown,simReads,calcExp
#Run casperDesign() to see full manual with examples
#Run casperDesign() to see full manual with examples
Simulate true expression levels and observed data (casper expression estimates) for future samples within each group.
These simulations serve as the basis for sample size calculation:
if one were to sequence nsamples
new RNA-seq samples, what data
would we expect to see? The simulation is posterior predictive,
i.e. based on the current available data x
.
simMultSamples(nsim, nsamples, nreads, readLength, fragLength, x, groups='group', distrs, genomeDB, model='LNNMV', verbose=TRUE, mc.cores=1)
simMultSamples(nsim, nsamples, nreads, readLength, fragLength, x, groups='group', distrs, genomeDB, model='LNNMV', verbose=TRUE, mc.cores=1)
nsim |
Number of simulations to obtain |
nsamples |
Vector indicating number of future samples per group,
e.g. |
nreads |
Desired number of paired-end reads per sample. The actual number of aligned reads for any given sample differs from this amount, see details. |
readLength |
Read length, i.e. in an experiment with paired reads
at 100bp each, |
fragLength |
Desired average insert size (size of RNA molecules
after fragmentation). If missing, insert sizes are as obtained from |
x |
|
groups |
Name of column in |
distrs |
Fragment start and length distributions. It can be either an object or a list of objects of class readDistrs. In the latter case, an element is chosen at random for each individual sample to consider uncertainty in these distributions. If not specified, it defaults to data(distrsGSE37704). |
genomeDB |
annotatedGenome object |
model |
Set to |
verbose |
Set to |
mc.cores |
Number of cores to use in function.
|
The posterior predictive simulations is based on four steps: (1) simulate true expression for each group (mean and SD), (2) simulate true expression for future samples, (3) simulate paired reads for each future sample, (4) estimate expression from the reads via Casper. Below are some more details.
1. Simulate true mean expression in each group and residual variance
for each gene. If model=='LNNMV' this is based on the log-normal
normal with modified variance model in package EBarrays
(Yuan & Kendziorski 2006), if model=='GaGa' this is based on the GaGa
model (Rossell, 2009).
adapted to take into account that the expression estimates in the
pilot data x
are noisy (which is why simMultSamples
requires the
SE / posterior SD associated to exprs(x)
).
The simulated values are returned in component "simTruth"
of
the simMultSamples
output.
2. Simulate true isoform expression for each of the future samples. These are independent Normal draws with mean and variance generated in step 1. True gene expression is derived from the isoform expressions.
3. Determine the number of reads to be simulated for each gene based on its true expression (generated in step 2) and a Multinomial sampling model. For each sample:
- The number of reads yielded by the experiment is Unif(.8*nreads,1.2*nreads) - A proportion of non-mappable reads is discarded using the power law in Li et al (2014) - Amongst remaining reads, we assume that a proportion Unif(0.6,0.9) were aligned (consistenly with reports from ENCODE project)
The final number of simulated reads is reported in component "simExpr"
of the simMultSamples
output.
4. Obtain expression estimates from the path counts produced in step 3
via calcExp
. These are reported in component "simExpr"
of the simMultSamples
output.
Object of class simulatedSamples
, which extends a list
of length nsim
. See the class documentation for some helpful
methods (e.g. coef, exprs, mergeBatches).
Each element is itself a list containing an individual simulation.
simTruth |
|
simExpr |
|
Victor Pena, David Rossell
Rossell D. (2009) GaGa: a Parsimonious and Flexible Model for Differential Expression Analysis. Annals of Applied Statistics, 3, 1035-1051.
Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2015)
Yuan, M. and Kendziorski, C. (2006). A unified approach for simultaneous gene clustering and differential expression identification. Biometrics, 62, 1089-1098.
#Run casperDesign() to see full manual with examples
#Run casperDesign() to see full manual with examples
This function generates path counts and bam files with simulated paired end reads according to given read start distribution, fragment length distribution and gene and variant expressions.
simReads(islandid, nSimReads, pis, rl, seed, writeBam, distrs, genomeDB, repSims=FALSE, bamFile=NULL, stranded=FALSE, verbose=TRUE, chr=NULL, mc.cores=1)
simReads(islandid, nSimReads, pis, rl, seed, writeBam, distrs, genomeDB, repSims=FALSE, bamFile=NULL, stranded=FALSE, verbose=TRUE, chr=NULL, mc.cores=1)
islandid |
Island ID's from the genomeDB object to simulate reads |
nSimReads |
Named numeric vector with number of fragments to simulate in each island. |
pis |
Named numeric vector with relative expression of transcripts. Expressions add up to one for each island to simulate. |
rl |
Read length |
seed |
Seed of the random numbers generator |
writeBam |
Set to 1 to generate bam files with the simulated reads |
distrs |
Object of class 'readDistrs' with read start and fragment length distributions |
genomeDB |
Object of class 'annotatedGenome' with the genome to genererate reads from |
repSims |
Set to TRUE to return relative read starts and fragment lengths from the simulation |
bamFile |
Name of the bam file to write reads to. Must end with '.bam' |
stranded |
Set to TRUE to preserve gene strand when generating reads. The 'XS' tag will be added to reads in the bam file and the returned 'pc' object will be stranded |
verbose |
Set to |
chr |
Characters vector with chromosomes to simulate. Defaults to whole genome simulations. |
mc.cores |
Number of cores to use in function |
Nsim |
Numerical vector with the number of reads simulated for each island. |
pc |
Object of class 'pathCounts' with simulated path counts |
sims |
Only if 'repSims' is set to TRUE. List with vectors of length 'n' with the following elements: -'varl': Length of variant for corresponding read -'st' Start of fragment relative to variant start (not in genomic coordinates) -len:Fragment length -'strand':Strand of gene for simulated read |
Camille Stephan-Otto Attolini
data(hg19DB) data(K562.r1l1) distrs <- getDistrs(hg19DB,bam=K562.r1l1,readLength=75) islandid <- c('10319','463') txs <- unlist(lapply(hg19DB@transcripts[islandid], names)) pis <- vector(mode='numeric', length=length(txs)) npis <- sapply(hg19DB@transcripts[islandid],length) pis[1:npis[1]] <- rep(1/npis[1],npis[1]) pis[-1:-npis[1]] <- rep(1/npis[2],npis[2]) names(pis) <- txs nSimReads <- c(100, 100) names(nSimReads) <- islandid simpc <- simReads(islandid=islandid, nSimReads=nSimReads, pis=pis, rl=75, repSims=TRUE, seed=1, writeBam=FALSE, distrs=distrs,genomeDB=hg19DB)
data(hg19DB) data(K562.r1l1) distrs <- getDistrs(hg19DB,bam=K562.r1l1,readLength=75) islandid <- c('10319','463') txs <- unlist(lapply(hg19DB@transcripts[islandid], names)) pis <- vector(mode='numeric', length=length(txs)) npis <- sapply(hg19DB@transcripts[islandid],length) pis[1:npis[1]] <- rep(1/npis[1],npis[1]) pis[-1:-npis[1]] <- rep(1/npis[2],npis[2]) names(pis) <- txs nSimReads <- c(100, 100) names(nSimReads) <- islandid simpc <- simReads(islandid=islandid, nSimReads=nSimReads, pis=pis, rl=75, repSims=TRUE, seed=1, writeBam=FALSE, distrs=distrs,genomeDB=hg19DB)
simulatedSamples
stores multiple simulated isoform expression
datasets. Each dataset contains the (simulation) true mean expression
in each group and residual variance, as well as the estimated
expression in each individual sample.
Objects are returned by simMultSamples
.
The class extends a list directly.
A list, each element containing a different simulated dataset
signature(object = "simulatedSamples")
: Displays
general information about the object.
signature(object = "simulatedSamples")
:
Returns a matrix with difference between group means (simulation
truth) in all simulated datasets
signature(object = "simulatedSamples")
: Returns
a list of ExpressionSets containing the estimated expressions in
each simulation.
signature(x="ExpressionSet",y="simulatedSamples")
:
Combines x
with each element in exprs
in y
, and
returns a list. See help(mergeBatches)
for more details.
x[i]
selects a subset of simulations, x[,j]
a subset of the samples in each simulation
David Rossell
showClass("simulatedSamples")
showClass("simulatedSamples")
splitGenomeByLength
splits an annotatedGenome according to gene
length (bp), which allows estimating the fragment start and length
distribution for each subset separately.
splitGenomeByLength(DB, breaks=c(0,3000,5000,Inf))
splitGenomeByLength(DB, breaks=c(0,3000,5000,Inf))
DB |
Object containing annotated genome. Must be of class
|
breaks |
Breakpoints to define gene subgroups. |
By default groups are <3000bp, 3000-5000bp, >5000bp, which work well for the human genome. Further sub-dividisions may result in unstable estimates of fragment start and length distributions.
List where each component is of class annotatedGenome
.
David Rossell
procGenome
and createDenovoGenome
for creating
annotatedGenome
objects.
getDistrs
for estimating fragment start and length distribution.
##Not run ## genDB<-makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") ## hg19DB <- procGenome(genDB, "hg19") ## hg19split <- splitGenomeByLength(hg19DB)
##Not run ## genDB<-makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") ## hg19DB <- procGenome(genDB, "hg19") ## hg19split <- splitGenomeByLength(hg19DB)
~~ Methods for function subsetGenome
in package casper ~~
Subset an annotatedGenome object by islands or chromosomes.
subsetGenome(islands, chr, genomeDB)
subsetGenome(islands, chr, genomeDB)
islands |
Vector of characters with the island IDs to retrieve from genome. |
chr |
Vector of characters with the names of chromosomes to retrieve from genome. |
genomeDB |
annotatedGenome object with genome to subset. |
signature(islands = "character", chr = "missing", genomeDB = "annotatedGenome")
Subset annotatedGenome object by a set of island IDs.
signature(islands = "missing", chr = "character", genomeDB = "annotatedGenome")
Subset annotatedGenome object by chromosomes.
annotatedGenome
object,
either for all transcripts or only those corresponding
to a given island or transcript.
annotatedGenome objects store information regarding genes and transcripts. When there's an overlap in exons between several genes, these genes are grouped into gene islands.
transcripts
retrieves all stored transcripts for a given
transcript or island.
matchTranscripts
finds transcripts in queryDB matching a
transcript in subjectDB. The best match for each transcript
in subjectDB is returned, unless difference in bp is >maxbp
transcripts(genomeDB, txid, islandid) matchTranscripts(queryDB, subjectDB, maxbp=10)
transcripts(genomeDB, txid, islandid) matchTranscripts(queryDB, subjectDB, maxbp=10)
genomeDB |
Object of class |
txid |
Character indicating transcript identifier (optional) |
islandid |
Character indicating island identifier (optional) |
queryDB |
|
subjectDB |
|
maxbp |
Maximum difference in bp for transcripts to be matched |
IRangesList
where each element in the list corresponds to a
different transcript.
signature(genomeDB = "annotatedGenome", txid="missing", islandid="missing")
Return exons for all transcripts in genomeDB
signature(genomeDB = "annotatedGenome", txid="character", islandid="missing")
Return exons for transcript txid
signature(genomeDB = "annotatedGenome", txid="missing", islandid="character")
Return exons for all transcripts in island islandid
genePlot
to plot the resulting transcripts
data(hg19DB) txs <- transcripts(txid="NM_005158",genomeDB=hg19DB) txs
data(hg19DB) txs <- transcripts(txid="NM_005158",genomeDB=hg19DB) txs
txLength
in Package casper ~~~~ Methods for function txLength
in package casper ~~
Function to retrieve transcript lengths from annotated genome (class genomeDB).
txLength(islandid, txid, genomeDB)
txLength(islandid, txid, genomeDB)
islandid |
Retrieve length for transcripts in island |
txid |
Retrieve length for |
genomeDB |
Annotated genome of class |
When called for the first time lengths are calculated and stored in
the object genomeDB
. Subsequent calls refer to these computed
values.
Named numeric vector with transcript lengths.
signature(islandid = "character", txid = "missing", genomeDB = "annotatedGenome")
Retrieve lengths from genomeDB for transcripts in islandid
islands.
signature(islandid = "missing", txid = "character", genomeDB = "annotatedGenome")
Retrieve lengths from genomeDB
for txid
transcripts.
signature(islandid = "missing", txid = "missing", genomeDB = "annotatedGenome")
Retrieve or calculate lengths for all transcripts in the annotated
genome genomeDB
.
Function to analyze bam files to generate an ExpressionSet with expression estimates for all samples, read start and fragment length distributions, path counts and optinally processed reads.
wrapDenovo(bamFile, output_wrapKnown, knownGenomeDB, targetGenomeDB, readLength, rpkm=TRUE, keep.multihits=TRUE, searchMethod="submodels", exactMarginal=TRUE, integrateMethod = "plugin", maxExons=40, islandid, chroms=NULL, keep.pbam=FALSE, keepPbamInMemory=FALSE, niter=10^3, priorq=3, priorqGeneExpr=2, mc.cores.int=1, mc.cores=1, verbose=TRUE, seed=1)
wrapDenovo(bamFile, output_wrapKnown, knownGenomeDB, targetGenomeDB, readLength, rpkm=TRUE, keep.multihits=TRUE, searchMethod="submodels", exactMarginal=TRUE, integrateMethod = "plugin", maxExons=40, islandid, chroms=NULL, keep.pbam=FALSE, keepPbamInMemory=FALSE, niter=10^3, priorq=3, priorqGeneExpr=2, mc.cores.int=1, mc.cores=1, verbose=TRUE, seed=1)
bamFile |
Names of bam files with the sample to analyze. These must sorted and indexed, and the index must be in the same directory. |
output_wrapKnown |
Optional argument containing the output of an
earlier call to |
knownGenomeDB |
|
targetGenomeDB |
|
readLength |
Read length in bp, e.g. in a paired-end experiment where
75bp are sequenced on each end one would set |
rpkm |
Set to |
keep.multihits |
Set to |
searchMethod |
Method used to perform the model search.
|
exactMarginal |
Set to |
integrateMethod |
Method to compute integrated likelihoods. The default
( |
maxExons |
Prior probabilities of isoform expression are
estimated for genes with 1 up to |
islandid |
Names of the gene island to be analyzed. If missing all gene islands are analyzed |
chroms |
Names of the chromosomes to be analyzed. If missing all chromosomes are analyzed. |
keep.pbam |
Set to |
keepPbamInMemory |
Set to |
niter |
Number of MCMC iterations in the model search algorithm. |
priorq |
Parameter of the Dirichlet prior for the proportion of
reads coming from each variant. We recommend |
priorqGeneExpr |
Parameter of the Dirichlet prior distribution on overall gene expression. Defaults to 2 to ensure non-zero estimates. |
mc.cores |
Number of cores to use in expression estimation. |
mc.cores.int |
Number of cores to use when loading bam files. Be careful as this is a memory intensive step. |
verbose |
Set to |
seed |
Set seed of random number generator. |
The function executes the functions procBam
, getDistrs
,
pathCounts
calcDenovo
and denovoExpr
and formats the output nicely.
Running wrapDenovo
is much more efficient in cpu
speed and memory usage than running these functions separately.
When rpkm
is false the function returns the estimated
proportion of reads arising from each isoform within a gene island.
See the details in help("wrapKnown")
for more information on this.
denovoGenomeDB |
|
.
exp |
Object of class |
distr |
Object of class |
pbam |
List of objects of class |
Miranda Stobbe, David Rossell
Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330.
calcDenovo
, wrapKnown
, relexprByGene
## not run ## Known isoforms ## library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## hg19DB <- procGenome(TxDb.Hsapiens.UCSC.hg19.knownGene), genome='hg19') ## gtf with known & de novo predictions ## mygtf <- import('hg19_denovo.gtf') ## hg19denovoDB <- procGenome(mygtf, genome='hg19') ## bamFile="/path_to_bam/sorted.bam" ## ans <- wrapDenovo(bamFile=bamFile, targetGenomeDB=hg19denovoDB, knownGenomeDB=hg19DB, readLength=101) ## Estimated expression via BMA ## head(exprs(ans[['exp']])) ## Posterior probability that each isoform is expressed ## head(fData(ans[['exp']]))
## not run ## Known isoforms ## library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## hg19DB <- procGenome(TxDb.Hsapiens.UCSC.hg19.knownGene), genome='hg19') ## gtf with known & de novo predictions ## mygtf <- import('hg19_denovo.gtf') ## hg19denovoDB <- procGenome(mygtf, genome='hg19') ## bamFile="/path_to_bam/sorted.bam" ## ans <- wrapDenovo(bamFile=bamFile, targetGenomeDB=hg19denovoDB, knownGenomeDB=hg19DB, readLength=101) ## Estimated expression via BMA ## head(exprs(ans[['exp']])) ## Posterior probability that each isoform is expressed ## head(fData(ans[['exp']]))
Function to analyze bam files to generate an ExpressionSet with expression estimates for all samples, read start and fragment length distributions, path counts and optinally processed reads.
wrapKnown(bamFile, verbose=FALSE, seed=1, mc.cores.int=1, mc.cores=1, genomeDB, readLength, rpkm=TRUE, priorq=2, priorqGeneExpr=2, citype='none', niter=10^3, burnin=100, keep.pbam=FALSE, keep.multihits=TRUE, chroms=NULL)
wrapKnown(bamFile, verbose=FALSE, seed=1, mc.cores.int=1, mc.cores=1, genomeDB, readLength, rpkm=TRUE, priorq=2, priorqGeneExpr=2, citype='none', niter=10^3, burnin=100, keep.pbam=FALSE, keep.multihits=TRUE, chroms=NULL)
bamFile |
Names of bam files with the sample to analyze. These must sorted and indexed, and the index must be in the same directory. |
verbose |
Set to |
seed |
Set seed of random number generator. |
mc.cores.int |
Number of cores to use when loading bam files. This is a memory intensive step, therefore number of cores must be chosen according to available RAM memory. |
mc.cores |
Number of cores to use in expression estimation. |
genomeDB |
|
readLength |
Read length in bp, e.g. in a paired-end experiment where
75bp are sequenced on each end one would set |
rpkm |
Set to |
priorq |
Parameter of the prior distribution on the proportion of reads
coming from each variant. The prior is Dirichlet with prior sample
size for each variant equal to priorq.
We recommend |
priorqGeneExpr |
Parameter for prior distribution on overall gene expression. Defaults to 2, which ensures non-zero estimates for all genes |
citype |
Set to |
niter |
Number of Monte Carlo iterations. Only used when |
burnin |
Number of burnin Monte Carlo iterations. Only used when |
keep.pbam |
Set to |
keep.multihits |
Set to |
chroms |
Manually set chromosomes to be processed. By default only main chromosomes are considered (except 'chrM') |
The function executes the functions procBam
, getDistrs
and pathCounts
in parallel for each chromosome, but is much more efficient in cpu
speed and memory usage than running these functions separately.
Data from multiple samples are then combined using mergeExp
.
Note that further normalization (e.g. quantileNorm
)
may be needed preliminary to actual data analysis.
When rpkm
is false the function returns the estimated
proportion of reads arising from each isoform within a gene island.
casper groups two or more genes into a gene island whenever these
genes share an exon (or part of an exon). Because exons are shared,
isoform quantification must be done simultaneously for all those
genes.
That is, the output from wrapKnown
when rpkm
is FALSE
are proportions that add up
to 1 within each island. If you would like to re-normalize these
expressions so that they add up to 1 within each gene, see the help
for function relexprByGene
.
One last remark: casper returns the estimated proportion of reads
generated by each isoform, which is not the same as relative
isoform expressions. Longer isoforms tend to
produce more reads than shorter isoforms. This is easily accounted for
by dividing relative expressions by isoform length, see relexprByGene
.
distr |
Object of class |
pbam |
List of objects of class |
pc |
Object of class |
exp |
Object of class |
Camille Stephan-Otto Attolini, David Rossell
Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330.
procGenome
, relexprByGene
, quantileNorm
## genDB<-makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") ## hg19DB <- procGenome(genDB, "hg19") ## bamFile="/path_to_bam/sorted.bam" ## ans <- wrapKnown(bamFile=bamFile, mc.cores.int=4, mc.cores=3, genomeDB=hg19DB, readLength=101) ## names(ans) ## head(exprs(ans\$exp))
## genDB<-makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") ## hg19DB <- procGenome(genDB, "hg19") ## bamFile="/path_to_bam/sorted.bam" ## ans <- wrapKnown(bamFile=bamFile, mc.cores.int=4, mc.cores=3, genomeDB=hg19DB, readLength=101) ## names(ans) ## head(exprs(ans\$exp))