Title: | Flexible, isoform-level differential expression analysis |
---|---|
Description: | Tools for statistical analysis of assembled transcriptomes, including flexible differential expression analysis, visualization of transcript structures, and matching of assembled transcripts to annotation. |
Authors: | Jack Fu [aut], Alyssa C. Frazee [aut, cre], Leonardo Collado-Torres [aut], Andrew E. Jaffe [aut], Jeffrey T. Leek [aut, ths] |
Maintainer: | Jack Fu <[email protected]> |
License: | Artistic-2.0 |
Version: | 2.39.0 |
Built: | 2024-10-30 03:36:02 UTC |
Source: | https://github.com/bioc/ballgown |
Super awesome transcript-level expression analysis
match assembled transcripts to annotated transcripts
annotate_assembly(assembled, annotated)
annotate_assembly(assembled, annotated)
assembled |
|
annotated |
|
If gown
is a ballgown
object, assembled
can be
structure(gown)$trans
(or any subset). You can generate a
GRangesList
object
containing annotated transcripts from a gtf file using the
gffReadGR
function and
setting splitByTranscripts=TRUE
.
data frame, where each row contains assembledInd
and
annotatedInd
(indexes of overlapping transcripts in assembled
and annotated
), and the percent overlap between the two transcripts.
Alyssa Frazee
data(bg) gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffReadGR(gtfPath, splitByTranscript=TRUE) info = annotate_assembly(assembled=structure(bg)$trans, annotated=annot)
data(bg) gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffReadGR(gtfPath, splitByTranscript=TRUE) info = annotate_assembly(assembled=structure(bg)$trans, annotated=annot)
S4 class for storing and manipulating expression data from assembled transcriptomes
expr
tables containing expression data for genomic features (introns, exons, transcripts)
structure
genomic locations of features and their relationships to one another
indexes
tables connecting components of the assembly and providing other experimental information (e.g., phenotype data and locations of read alignment files)
dirs
directories holding data created by tablemaker
mergedDate
date the ballgown object was created
meas
which expression measurement(s) the object contains in its data slot. Vector of one or more of "rcount", "ucount", "mrcount", "cov", "cov_sd", "mcov", "mcov_sd", or "FPKM", if Tablemaker output is used, or one of "TPM" or "FPKM" if RSEM output is used. Can also be "all" for all measurements. See vignette for details.
RSEM
TRUE if object was made from RSEM output, FALSE if object was made from Tablemaker/Cufflinks output.
Alyssa Frazee, Leonardo Collado-Torres, Jeff Leek
data(bg) class(bg) #"ballgown" dim(bg@expr$exon) bg@structure$exon head(bg@indexes$t2g) head(bg@dirs) bg@mergedDate bg@meas bg@RSEM
data(bg) class(bg) #"ballgown" dim(bg@expr$exon) bg@structure$exon head(bg@indexes$t2g) head(bg@dirs) bg@mergedDate bg@meas bg@RSEM
constructor function for ballgown objects
ballgown( samples = NULL, dataDir = NULL, samplePattern = NULL, bamfiles = NULL, pData = NULL, verbose = TRUE, meas = "all" )
ballgown( samples = NULL, dataDir = NULL, samplePattern = NULL, bamfiles = NULL, pData = NULL, verbose = TRUE, meas = "all" )
samples |
vector of file paths to folders containing sample-specific
ballgown data (generated by |
dataDir |
file path to top-level directory containing sample-specific
folders with ballgown data in them. Only used if |
samplePattern |
regular expression identifying the subdirectories of\
|
bamfiles |
optional vector of file paths to read alignment files for
each sample. If provided, make sure to sort properly (e.g., in the same
order as |
pData |
optional |
verbose |
if |
meas |
character vector containing either "all" or one or more of: "rcount", "ucount", "mrcount", "cov", "cov_sd", "mcov", "mcov_sd", or "FPKM". The resulting ballgown object will only contain the specified expression measurements, for the appropriate features. See vignette for which expression measurements are available for which features. "all" creates the full object. |
Because experimental data is recorded so variably, it is the user's
responsibility to format pData
correctly. In particular, it's
really important that the rows of pData
(corresponding to samples)
are ordered the same way as samples
or the
dataDir
/samplePattern
combo. You can run
list.files(path = dataDir, pattern = samplePattern)
to see the sample
order if samples
was not used.
If you are creating a ballgown object for a large experiment, this function may run slowly and use a large amount of RAM. We recommend running this constructor as a batch job and saving the resulting ballgown object as an rda file. The rda file usually has reasonable size on disk, and the object in it shouldn't take up too much RAM when loaded, so the time and memory use in creating the object is a one-time cost.
an object of class ballgown
Leonardo Collado-Torres, Alyssa Frazee
ballgownrsem
, for loading RSEM output into a ballgown
object
bg = ballgown(dataDir=system.file('extdata', package='ballgown'), samplePattern='sample') pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))
bg = ballgown(dataDir=system.file('extdata', package='ballgown'), samplePattern='sample') pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))
Loads results of rsem-calculate-expression into a ballgown object for easy visualization, processing, and statistical testing
ballgownrsem( dir = "", samples, gtf, UCSC = TRUE, tfield = "transcript_id", attrsep = "; ", bamout = "transcript", pData = NULL, verbose = TRUE, meas = "all", zipped = FALSE )
ballgownrsem( dir = "", samples, gtf, UCSC = TRUE, tfield = "transcript_id", attrsep = "; ", bamout = "transcript", pData = NULL, verbose = TRUE, meas = "all", zipped = FALSE )
dir |
output directory containing RSEM output for all samples (i.e. for each run of rsem-calculate-expression) |
samples |
vector of sample names (i.e., of the |
gtf |
path to GTF file of genes/transcripts used in your RSEM reference.
(where the reference location was denoted by the |
UCSC |
set to TRUE if |
tfield |
What keyword identifies transcripts in the "attributes" field
of |
attrsep |
How are attributes separated in the "attributes" field of
|
bamout |
set to |
pData |
data frame of phenotype data, with rows corresponding to
|
verbose |
If TRUE (as by default), status messages are printed during data loading. |
meas |
character vector containing either "all" or one of "FPKM" or "TPM". The resulting ballgown object will only contain the specified expression measurement for the transcripts. "all" creates the full object. |
zipped |
set to TRUE if all RSEM results files have been gzipped (end) in ".gz"). |
Currently exon- and intron-level measurements are not available for RSEM-generated ballgown objects, but development is ongoing.
a ballgown object with the specified expression measurements and structure specified by GTF.
ballgown
for reading Cufflinks/Tablemaker output
dataDir = system.file('extdata', package='ballgown') gtf = file.path(dataDir, 'hg19_genes_small.gtf.gz') rsemobj = ballgownrsem(dir=dataDir, samples=c('tiny', 'tiny2'), gtf=gtf, bamout='none', zipped=TRUE) rsemobj
dataDir = system.file('extdata', package='ballgown') gtf = file.path(dataDir, 'hg19_genes_small.gtf.gz') rsemobj = ballgownrsem(dir=dataDir, samples=c('tiny', 'tiny2'), gtf=gtf, bamout='none', zipped=TRUE) rsemobj
Small ballgown object created with simulated toy data, for demonstration purposes
a ballgown
object: 100 transcripts, 633 exons, 536 introns
Alyssa Frazee
data(bg) bg # ballgown instance with 100 transcripts and 20 samples
data(bg) bg # ballgown instance with 100 transcripts and 20 samples
plot annotated and assembled transcripts together
checkAssembledTx( assembled, annotated, ind = 1, main = "Assembled and Annotated Transcripts", customCol = NULL )
checkAssembledTx( assembled, annotated, ind = 1, main = "Assembled and Annotated Transcripts", customCol = NULL )
assembled |
a GRangesList object where the GRanges objects in the list represent sets of exons comprising assembled transcripts |
annotated |
a GRangesList object where the GRanges objects in the list represent sets of exons comprising annotated transcripts |
ind |
integer; index of |
main |
optional character string giving the title for the resulting plot. Default: "Assembled and Annotated Transcripts" |
customCol |
optional vector of custom colors for the annotated transcripts. If not the same length as the number of annotated transcripts in the plot, recycling or truncation might occur. |
Plots annotated transcripts on the bottom panel (shaded in gray) and assembled transcripts on the top panel (shaded with diagonal lines).
Alyssa Frazee
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffReadGR(gtfPath, splitByTranscript=TRUE) data(bg) checkAssembledTx(annotated=annot, assembled=structure(bg)$trans, ind=4)
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffReadGR(gtfPath, splitByTranscript=TRUE) data(bg) checkAssembledTx(annotated=annot, assembled=structure(bg)$trans, ind=4)
group a gene's assembled transcripts into clusters
clusterTranscripts(gene, gown, k = NULL, method = c("hclust", "kmeans"))
clusterTranscripts(gene, gown, k = NULL, method = c("hclust", "kmeans"))
gene |
name of gene whose transcripts will be clustered. When using
Cufflinks output, usually of the form |
gown |
ballgown object containing experimental data |
k |
number of clusters to use |
method |
clustering method to use. Must be one of |
list with elements clusters
and pctvar
.
clusters
contains columns "cluster" and "t_id", and denotes which
transcripts belong to which clusters.
pctvar
is only non-NULL when using k-means clustering and is the
percentage of variation explained by these clusters, defined as the ratio
of the between-cluster sum of squares to the total sum of squares.
Alyssa Frazee
hclust
, kmeans
,
plotLatentTranscripts
for visualizing the transcript clusters
data(bg) clusterTranscripts('XLOC_000454', bg, k=2, method='kmeans') # transcripts 1294 and 1301 cluster together, 91% variation explained.
data(bg) clusterTranscripts('XLOC_000454', bg, k=2, method='kmeans') # transcripts 1294 and 1301 cluster together, 91% variation explained.
cluster a gene's transcripts and calculate cluster-level expression
collapseTranscripts( gene, gown, meas = "FPKM", method = c("hclust", "kmeans"), k = NULL )
collapseTranscripts( gene, gown, meas = "FPKM", method = c("hclust", "kmeans"), k = NULL )
gene |
which gene's transcripts should be clustered |
gown |
ballgown object |
meas |
which transcript-level expression measurement to use
( |
method |
which clustering method to use: |
k |
how many clusters to use. |
list with two elements:
tab
, a cluster-by-sample table of expression measurements
(meas
, either cov or FPKM), where the expression measurement for
each cluster is the mean (for 'cov'
) or aggregate (for
'FPKM'
, as in gexpr
) expression measurement for all
the transcripts in that cluster. This table can be used as the
gowntable
argument to stattest
, if differential
expression results for transcript *clusters* are desired.
cl
output from clusterTranscripts
that was run
to produce tab
, for reference. Cluster IDs in the cluster
component correspond to row names of tab
Alyssa Frazee
hclust
, kmeans
,
clusterTranscripts
, plotLatentTranscripts
data(bg) collapseTranscripts(bg, gene='XLOC_000454', meas='FPKM', method='kmeans')
data(bg) collapseTranscripts(bg, gene='XLOC_000454', meas='FPKM', method='kmeans')
determine if one set of GRanges fully contains any of another set of GRanges
contains(transcripts, cds)
contains(transcripts, cds)
transcripts |
|
cds |
|
If gown
is a ballgown
object, transcripts
can
be structure(gown)$trans
(or any subset).
vector with length equal to length(transcripts)
, where each
entry is TRUE
if the corresponding transcript contains a coding
sequence (i.e., is a superset of at least one entry of cds
).
Alyssa Frazee
## pretend this annotation is coding sequence: gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffReadGR(gtfPath, splitByTranscript=TRUE) data(bg) results = contains(structure(bg)$trans, annot) # results is a boolean vector sum(results) #61
## pretend this annotation is coding sequence: gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffReadGR(gtfPath, splitByTranscript=TRUE) data(bg) results = contains(structure(bg)$trans, annot) # results is a boolean vector sum(results) #61
extract paths to tablemaker output
dirs(x) ## S4 method for signature 'ballgown' dirs(x)
dirs(x) ## S4 method for signature 'ballgown' dirs(x)
x |
a ballgown object |
data(bg) dirs(bg)
data(bg) dirs(bg)
extract exon-level expression measurements from ballgown objects
eexpr(x, meas = "rcount") ## S4 method for signature 'ballgown' eexpr(x, meas = "rcount")
eexpr(x, meas = "rcount") ## S4 method for signature 'ballgown' eexpr(x, meas = "rcount")
x |
a ballgown object |
meas |
type of measurement to extract. Can be "rcount", "ucount", "mrcount", "cov", "mcov", or "all". Default "rcount". |
exon-by-sample matrix containing exon-level expression values
(measured by meas
). If meas
is "all"
, or
x@RSEM
is TRUE, a data frame is returned, containing all
measurements and location information.
data(bg) exon_rcount_matrix = eexpr(bg) exon_ucount_matrix = eexpr(bg, 'ucount') exon_data_frame = eexpr(bg, 'all')
data(bg) exon_rcount_matrix = eexpr(bg) exon_ucount_matrix = eexpr(bg, 'ucount') exon_data_frame = eexpr(bg, 'all')
extract expression components from ballgown objects
expr(x) ## S4 method for signature 'ballgown' expr(x)
expr(x) ## S4 method for signature 'ballgown' expr(x)
x |
a ballgown object |
list containing elements intron
, exon
, and
trans
, which are feature-by-sample data frames of expression data.
data(bg) names(expr(bg)) class(expr(bg)) dim(expr(bg)$exon)
data(bg) names(expr(bg)) class(expr(bg)) dim(expr(bg)$exon)
Replacement method for expr slot in ballgown objects
expr(x) <- value ## S4 replacement method for signature 'ballgown' expr(x) <- value
expr(x) <- value ## S4 replacement method for signature 'ballgown' expr(x) <- value
x |
a ballgown object |
value |
the updated value for |
data(bg) n = ncol(bg@expr$trans) #multiply all transcript expression measurements by 10: bg@expr$trans[,11:n] = 10*bg@expr$trans[11:n]
data(bg) n = ncol(bg@expr$trans) #multiply all transcript expression measurements by 10: bg@expr$trans[,11:n] = 10*bg@expr$trans[11:n]
Create a new ballgown object containing only transcripts passing a mean expression filter
exprfilter(gown, cutoff, meas = "FPKM")
exprfilter(gown, cutoff, meas = "FPKM")
gown |
a ballgown object |
cutoff |
transcripts must have mean expression across samples above this value to be included in the return |
meas |
how should transcript expression be measured? Default FPKM, but
can also be |
A new ballgown object derived from gown
, but only containing
transcripts (and associated exons/introns) with mean meas
greater
than cutoff
across all samples.
data(bg) # make a ballgown object containing only transcripts with mean FPKM > 100: over100 = exprfilter(bg, cutoff=100)
data(bg) # make a ballgown object containing only transcripts with mean FPKM > 100: over100 = exprfilter(bg, cutoff=100)
get gene IDs from a ballgown object
geneIDs(x) ## S4 method for signature 'ballgown' geneIDs(x)
geneIDs(x) ## S4 method for signature 'ballgown' geneIDs(x)
x |
a ballgown object |
This vector differs from that produced by geneNames in that geneIDs produces names of loci created during the assembly process, not necessarily annotated genes.
named vector of gene IDs included in the ballgown object. If object was created using Tablemaker, these gene IDs will be of the form "XLOC_*". Vector is named and ordered by corresponding numeric transcript ID.
data(bg) geneIDs(bg)
data(bg) geneIDs(bg)
get gene names from a ballgown object
geneNames(x) ## S4 method for signature 'ballgown' geneNames(x)
geneNames(x) ## S4 method for signature 'ballgown' geneNames(x)
x |
a ballgown object |
This vector differs from that produced by geneIDs in that geneNames produces
*annotated* gene names that correspond to assembled transcripts. The
return will be empty/blank/NA if the transcriptome assembly is de novo
(i.e., was not compared to an annotation before the ballgown object was
created). See getGenes
for matching transcripts to gene names.
Some entries of this vector will be empty/blank/NA if the corresponding
transcript did not overlap any annotated genes.
named vector of gene names included in the ballgown object, named and ordered by corresponding numeric transcript ID.
data(bg) # this is a de novo assembly, so it does not contain gene info as it stands # but we can add it: annot = system.file('extdata', 'annot.gtf.gz', package='ballgown') gnames = getGenes(annot, structure(bg)$trans, UCSC=FALSE) gnames_first = lapply(gnames, function(x) x[1]) #just take 1 overlapping gene expr(bg)$trans$gene_name = gnames_first # now we can extract these gene names: geneNames(bg)
data(bg) # this is a de novo assembly, so it does not contain gene info as it stands # but we can add it: annot = system.file('extdata', 'annot.gtf.gz', package='ballgown') gnames = getGenes(annot, structure(bg)$trans, UCSC=FALSE) gnames_first = lapply(gnames, function(x) x[1]) #just take 1 overlapping gene expr(bg)$trans$gene_name = gnames_first # now we can extract these gene names: geneNames(bg)
extract a specific field of the "attributes" column of a data frame created from a GTF/GFF file
getAttributeField(x, field, attrsep = "; ")
getAttributeField(x, field, attrsep = "; ")
x |
vector representing the "attributes" column of GTF/GFF file |
field |
name of the field you want to extract from the "attributes" column |
attrsep |
separator for the fields in the attributes column. Defaults to '; ', the separator for GTF files outputted by Cufflinks. |
vector of nucleotide positions included in the transcript
Wolfgang Huber, in the davidTiling
R package (LGPL license)
gffRead
for creating a data frame from a GTF/GFF file,
and http://useast.ensembl.org/info/website/upload/gff.html for
specifics of the GFF/GTF file format.
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') gffdata = gffRead(gtfPath) gffdata$transcriptID = getAttributeField(gffdata$attributes, field = "transcript_id")
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') gffdata = gffRead(gtfPath) gffdata$transcriptID = getAttributeField(gffdata$attributes, field = "transcript_id")
label assembled transcripts with gene names
getGenes(gtf, assembled, UCSC = TRUE, attribute = "gene_id")
getGenes(gtf, assembled, UCSC = TRUE, attribute = "gene_id")
gtf |
path to a GTF file containing locations of annotated transcripts |
assembled |
GRangesList object, with each set of ranges representing exons of an assembled transcript. |
UCSC |
set to |
attribute |
set to attribute name in |
chromosome labels in gtf
and assembled
should match.
(i.e., you should provide the path to a gtf corrsponding to the same
annotation you used when constructing assembled
)
an IRanges CharacterList
of the same length as
assembled
, providing the name(s) of the gene(s) that overlaps each
transcript in assembled
.
Alyssa Frazee, Andrew Jaffe
data(bg) gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') geneoverlaps = getGenes(gtfPath, structure(bg)$trans, UCSC=FALSE)
data(bg) gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') geneoverlaps = getGenes(gtfPath, structure(bg)$trans, UCSC=FALSE)
For objects created with Cufflinks/Tablemaker, gene-level measurements are calculated by appropriately combining FPKMs from the transcripts comprising the gene. For objects created with RSEM, gene-level measurements are extracted directly from the RSEM output.
gexpr(x) ## S4 method for signature 'ballgown' gexpr(x)
gexpr(x) ## S4 method for signature 'ballgown' gexpr(x)
x |
a ballgown object |
gene-by-sample matrix containing per-sample gene measurements.
data(bg) gene_matrix = gexpr(bg)
data(bg) gene_matrix = gexpr(bg)
read in GTF/GFF file as a data frame
gffRead(gffFile, nrows = -1, verbose = FALSE)
gffRead(gffFile, nrows = -1, verbose = FALSE)
gffFile |
name of GTF/GFF on disk |
nrows |
number of rows to read in (default -1, which means read all rows) |
verbose |
if TRUE, print status info at beginning and end of file read. Default FALSE. |
data frame representing the GTF/GFF file
Kasper Hansen
getAttributeField
to extract data from "attributes"
column; http://useast.ensembl.org/info/website/upload/gff.html for
more information on the GTF/GFF file format.
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffRead(gtfPath)
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot = gffRead(gtfPath)
(very) light wrapper for rtracklayer::import
gffReadGR( gtf, splitByTranscript = FALSE, identifier = "transcript_id", sep = "; " )
gffReadGR( gtf, splitByTranscript = FALSE, identifier = "transcript_id", sep = "; " )
gtf |
name of GTF/GFF file on disk |
splitByTranscript |
if |
identifier |
name of transcript identifier column of |
sep |
field separator in the |
if splitByTranscript
is FALSE
, an object of class
GRanges
representing the genomic features in gtf
. If
splitByTranscript
is TRUE, an object of class GRangesList
,
where each element is a GRanges
object corresponding to an
annotated transcript (designated in names
).
Alyssa Frazee
gffRead
for reading in a GTF file as a data frame
rather than a GRanges
/GRangesList
object.
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') # read in exons as GRanges: annotgr = gffReadGR(gtfPath) # read in groups of exons as transcripts, in GRangesList: transcripts_grl = gffReadGR(gtfPath, splitByTranscript=TRUE)
gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') # read in exons as GRanges: annotgr = gffReadGR(gtfPath) # read in groups of exons as transcripts, in GRangesList: transcripts_grl = gffReadGR(gtfPath, splitByTranscript=TRUE)
extract transcript-level expression measurements from ballgown objects
iexpr(x, meas = "rcount") ## S4 method for signature 'ballgown' iexpr(x, meas = "rcount")
iexpr(x, meas = "rcount") ## S4 method for signature 'ballgown' iexpr(x, meas = "rcount")
x |
a ballgown object |
meas |
type of measurement to extract. Can be "rcount", "ucount", "mrcount", or "all". Default "rcount". |
intron-by-sample matrix containing the number of reads (measured as
specified by meas
) supporting each intron, in each sample. If
meas
is "all"
, a data frame is returned, containing all
measurements and location information.
data(bg) intron_rcount_matrix = iexpr(bg) intron_data_frame = iexpr(bg, 'all')
data(bg) intron_rcount_matrix = iexpr(bg) intron_data_frame = iexpr(bg, 'all')
extract the indexes from ballgown objects
indexes(x) ## S4 method for signature 'ballgown' indexes(x)
indexes(x) ## S4 method for signature 'ballgown' indexes(x)
x |
a ballgown object |
list containing elements e2t
, i2t
, t2g
,
bamfiles
, and pData
, where e2t
and i2t
are data
frames linking exons and introns (respectively) to transcripts, t2g
is a data frame linking transcripts to genes, and bamfiles
and
pData
are described in ?ballgown
.
data(bg) names(indexes(bg)) class(indexes(bg)) head(indexes(bg)$t2g)
data(bg) names(indexes(bg)) class(indexes(bg)) head(indexes(bg)$t2g)
Replace method for indexes slot in ballgown objects
indexes(x) <- value ## S4 replacement method for signature 'ballgown' indexes(x) <- value
indexes(x) <- value ## S4 replacement method for signature 'ballgown' indexes(x) <- value
x |
a ballgown object |
value |
the updated value for |
data(bg) indexes(bg)$bamfiles = paste0('/path/to/bamfolder/', sampleNames(bg), '_accepted_hits.bam')
data(bg) indexes(bg)$bamfiles = paste0('/path/to/bamfolder/', sampleNames(bg), '_accepted_hits.bam')
get the last element
last(x)
last(x)
x |
anything you can call |
this function is made of several thousand lines of complex code, so be sure to read it carefully.
the last element of x
Alyssa Frazee
last(c('h', 'e', 'l', 'l', 'o'))
last(c('h', 'e', 'l', 'l', 'o'))
extract package version & creation date from ballgown object
mergedDate(x) ## S4 method for signature 'ballgown' mergedDate(x)
mergedDate(x) ## S4 method for signature 'ballgown' mergedDate(x)
x |
a ballgown object |
data(bg) mergedDate(bg)
data(bg) mergedDate(bg)
calculate percent overlap between two GRanges objects
pctOverlap(tx1, tx2)
pctOverlap(tx1, tx2)
tx1 |
GRanges object |
tx2 |
GRanges object |
In the ballgown context, tx1
and tx2
are two
transcripts, each represented by GRanges objects whose ranges represent the
exons comprising the transcripts. The percent overlap is the number of
nucleotides falling within both transcripts divided by the number of
nucleotides falling within either transcript. Useful as a measure of
transcript closeness (as it is essentially Jaccard distance).
percent overlap between tx1
and tx2
, as defined by the
ratio of the intersection of tx1
and tx2
to the union of
tx1
and tx2
.
Alyssa Frazee
data(bg) gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot_grl = gffReadGR(gtfPath, splitByTranscript=TRUE) pctOverlap(structure(bg)$trans[[2]], annot_grl[[369]]) #79.9%
data(bg) gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown') annot_grl = gffReadGR(gtfPath, splitByTranscript=TRUE) pctOverlap(structure(bg)$trans[[2]], annot_grl[[369]]) #79.9%
extract phenotype data from a ballgown object
pData(object) ## S4 method for signature 'ballgown' pData(object)
pData(object) ## S4 method for signature 'ballgown' pData(object)
object |
a ballgown object |
sample-by-phenotype data frame
data(bg) pData(bg)
data(bg) pData(bg)
Replacement method for pData slot in ballgown objects
pData(object) <- value ## S4 replacement method for signature 'ballgown,ANY' pData(object) <- value
pData(object) <- value ## S4 replacement method for signature 'ballgown,ANY' pData(object) <- value
object |
a ballgown object |
value |
the updated value for |
# add "timepoint" covariate to ballgown object: data(bg) # already contains pData pData(bg) = data.frame(pData(bg), timepoint=rep(1:10, 2)) head(pData(bg))
# add "timepoint" covariate to ballgown object: data(bg) # already contains pData pData(bg) = data.frame(pData(bg), timepoint=rep(1:10, 2)) head(pData(bg))
This is an experimental, first-pass function that clusters assembled transcripts based on their overlap percentage, then plots and colors the transcript clusters.
plotLatentTranscripts( gene, gown, method = c("hclust", "kmeans"), k = NULL, choosek = c("var90", "thumb"), returncluster = TRUE, labelTranscripts = TRUE, ... )
plotLatentTranscripts( gene, gown, method = c("hclust", "kmeans"), k = NULL, choosek = c("var90", "thumb"), returncluster = TRUE, labelTranscripts = TRUE, ... )
gene |
string, name of gene whose transcripts should be clustered (e.g., "XLOC_000001") |
gown |
object of class |
method |
clustering method to use. Currently can choose from
hierarchical clustering ( |
k |
number of transcripts clusters to use. By default, |
choosek |
if |
returncluster |
if TRUE (as it is by default), return the results of the
call to |
labelTranscripts |
if TRUE (as it is by default), print transcript IDs on the y-axis |
... |
other arguments to pass to plotTranscripts |
if returncluster
is TRUE, the transcript clusters are returned
as described in clusterTranscripts
. A plot of the transcript
clusters is also produced, in the style of plotTranscripts
.
Alyssa Frazee
clusterTranscripts
, plotTranscripts
data(bg) plotLatentTranscripts('XLOC_000454', bg, method='kmeans', k=2)
data(bg) plotLatentTranscripts('XLOC_000454', bg, method='kmeans', k=2)
visualize transcript abundance by group
plotMeans( gene, gown, overall = FALSE, groupvar, groupname = "all", meas = c("cov", "FPKM", "rcount", "ucount", "mrcount", "mcov"), colorby = c("transcript", "exon"), legend = TRUE, labelTranscripts = FALSE )
plotMeans( gene, gown, overall = FALSE, groupvar, groupname = "all", meas = c("cov", "FPKM", "rcount", "ucount", "mrcount", "mcov"), colorby = c("transcript", "exon"), legend = TRUE, labelTranscripts = FALSE )
gene |
name of gene whose transcripts will be plotted. When using
Cufflinks/Tablemaker output, usually of the form |
gown |
ballgown object containing experimental and phenotype data |
overall |
if |
groupvar |
string representing the name of the variable denoting which
sample belongs to which group. Can be |
groupname |
string representing which group's expression means you want
to plot. Can be |
meas |
type of expression measurement to plot. One of "cov", "FPKM", "rcount", "ucount", "mrcount", or "mcov". Not all types are valid for all features. (See description of tablemaker output for more information). |
colorby |
one of |
legend |
if |
labelTranscripts |
if |
produces a plot of the transcript structure for the specified gene in the current graphics device, colored by study-wide or group-specific mean expression level.
Alyssa Frazee
data(bg) plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript')
data(bg) plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript')
visualize structure of assembled transcripts
plotTranscripts( gene, gown, samples = NULL, colorby = "transcript", meas = "FPKM", legend = TRUE, labelTranscripts = FALSE, main = NULL, blackBorders = TRUE, log = FALSE, logbase = 2, customCol = NULL, customOrder = NULL )
plotTranscripts( gene, gown, samples = NULL, colorby = "transcript", meas = "FPKM", legend = TRUE, labelTranscripts = FALSE, main = NULL, blackBorders = TRUE, log = FALSE, logbase = 2, customCol = NULL, customOrder = NULL )
gene |
name of gene whose transcripts will be plotted. When using
Cufflinks output, usually of the form |
gown |
ballgown object containing experimental and phenotype data |
samples |
vector of sample(s) to plot. Can be |
colorby |
one of |
meas |
which expression measurement to color features by, if any. Must match an available measurement for whatever feature you're plotting. |
legend |
if |
labelTranscripts |
if |
main |
optional string giving the desired plot title. |
blackBorders |
if |
log |
if |
logbase |
log base to use if |
customCol |
an optional vector of custom colors to color transcripts by. There must be the same number of colors as transcripts in the gene being plotted. |
customOrder |
an optional vector of transcript ids (matching ids in
|
produces a plot of the transcript structure for the specified gene in the current graphics device.
Alyssa Frazee
plotMeans
, plotLatentTranscripts
data(bg) # plot one gene for one sample: plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12', meas='FPKM', colorby='transcript', main='transcripts from gene XLOC_000454: sample 12, FPKM') # plot one gene for many samples: plotTranscripts('XLOC_000454', bg, samples=c('sample01', 'sample06', 'sample12', 'sample19'), meas='FPKM', colorby='transcript')
data(bg) # plot one gene for one sample: plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12', meas='FPKM', colorby='transcript', main='transcripts from gene XLOC_000454: sample 12, FPKM') # plot one gene for many samples: plotTranscripts('XLOC_000454', bg, samples=c('sample01', 'sample06', 'sample12', 'sample19'), meas='FPKM', colorby='transcript')
get names of samples in a ballgown objects
sampleNames(object) ## S4 method for signature 'ballgown' sampleNames(object)
sampleNames(object) ## S4 method for signature 'ballgown' sampleNames(object)
object |
a ballgown object |
vector of sample IDs for x
. If pData
exists, samples in
its rows correspond to samples in sampleNames(x)
(in order).
data(bg) sampleNames(bg)
data(bg) sampleNames(bg)
get sequence (chromosome) names from ballgown object
seqnames(x) ## S4 method for signature 'ballgown' seqnames(x)
seqnames(x) ## S4 method for signature 'ballgown' seqnames(x)
x |
a ballgown object |
vector of sequence (i.e., chromosome) names included in the ballgown object
data(bg) seqnames(bg)
data(bg) seqnames(bg)
Test each transcript, gene, exon, or intron in a ballgown object for differential expression, using comparisons of linear models.
stattest( gown = NULL, gowntable = NULL, pData = NULL, mod = NULL, mod0 = NULL, feature = c("gene", "exon", "intron", "transcript"), meas = c("cov", "FPKM", "rcount", "ucount", "mrcount", "mcov"), timecourse = FALSE, covariate = NULL, adjustvars = NULL, gexpr = NULL, df = 4, getFC = FALSE, libadjust = NULL, log = TRUE )
stattest( gown = NULL, gowntable = NULL, pData = NULL, mod = NULL, mod0 = NULL, feature = c("gene", "exon", "intron", "transcript"), meas = c("cov", "FPKM", "rcount", "ucount", "mrcount", "mcov"), timecourse = FALSE, covariate = NULL, adjustvars = NULL, gexpr = NULL, df = 4, getFC = FALSE, libadjust = NULL, log = TRUE )
gown |
name of an object of class |
gowntable |
matrix or matrix-like object with |
pData |
Required if |
mod |
object of class |
mod0 |
object of class |
feature |
the type of genomic feature to be tested for differential
expression. If |
meas |
the expression measurement to use for statistical tests. Must be
one of |
timecourse |
if |
covariate |
string representing the name of the covariate of interest
for the differential expression tests. Must correspond to the name of a
column of |
adjustvars |
optional vector of strings representing the names of
potential confounders. Must correspond to names of columns of
|
gexpr |
optional data frame that is the result of calling
|
df |
degrees of freedom used for modeling expression over time with
natural cubic splines. Default 4. Only used if |
getFC |
if |
libadjust |
library-size adjustment to use in linear models. By default,
the adjustment is defined as the sum of the sample's log expression
measurements below the 75th percentile of those measurements. To use
a different library-size adjustment, provide a numeric vector of each
sample's adjustment value. Entries of this vector correspond to samples in
in rows of |
log |
if |
At minimum, you need to provide a ballgown object or count table, the type of feature you want to test (gene, transcript, exon, or intron), the expression measurement you want to use (FPKM, cov, rcount, etc.), and the covariate of interest, which must be the name of one of the columns of the 'pData' component of your ballgown object (or provided pData). This covariate is automatically converted to a factor during model fitting in non-timecourse experiments.
By default, models are fit using log2(meas + 1)
as the outcome for
each feature. To disable the log transformation, provide 'log = FALSE' as
an argument to 'stattest'. You can use the gowntable
option if you'd
like to to use a different transformation.
Library size adjustment is performed by default by using the sum of the log
nonzero expression measurements for each sample, up to the 75th percentile
of those measurements. This adjustment can be disabled by setting
libadjust=FALSE
. You can use mod
and mod0
to specify
alternative library size adjustments.
mod
and mod0
are optional arguments. If mod
is
specified, you must also specify mod0
. If neither is specified,
mod0
defaults to the design matrix for a model including only a
library-size adjustment, and mod
defaults to the design matrix for a
model including a library-size adjustment and covariate
. Note that
if you supply mod
and mod0
, covariate
,
timecourse
, adjustvars
, and df
are ignored, so make
sure your covariate of interest and all appropriate confounder
adjustments, including library size, are specified in mod
and
mod0
. By default, the library-size adjustment is the sum of all
counts below the 75th percentile of nonzero counts, on the log scale
(log2 + 1).
Full model details are described in the supplement of http://biorxiv.org/content/early/2014/03/30/003665.
data frame containing the columns feature
, id
representing feature id, pval
representing the p-value for testing
whether this feature was differentially expressed according to
covariate
, and qval
, the estimated false discovery rate
using this feature's signal strength as a significance cutoff. An
additional column, fc
, is included if getFC
is TRUE
.
Jeff Leek, Alyssa Frazee
http://biorxiv.org/content/early/2014/03/30/003665
data(bg) # two-group comparison: stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='group') # timecourse test: pData(bg) = data.frame(pData(bg), time=rep(1:10, 2)) #dummy time covariate timecourse_results = stattest(bg, feature='transcript', meas='FPKM', covariate='time', timecourse=TRUE) # timecourse test, adjusting for group: group_adj_timecourse_results = stattest(bg, feature='transcript', meas='FPKM', covariate='time', timecourse=TRUE, adjustvars='group') # custom model matrices: ### create example data: set.seed(43) sex = sample(c('M','F'), size=nrow(pData(bg)), replace=TRUE) age = sample(21:52, size=nrow(pData(bg)), replace=TRUE) ### create design matrices: mod = model.matrix(~ sex + age + pData(bg)$group + pData(bg)$time) mod0 = model.matrix(~ pData(bg)$group + pData(bg)$time) ### build model: adjusted_results = stattest(bg, feature='transcript', meas='FPKM', mod0=mod0, mod=mod)
data(bg) # two-group comparison: stat_results = stattest(bg, feature='transcript', meas='FPKM', covariate='group') # timecourse test: pData(bg) = data.frame(pData(bg), time=rep(1:10, 2)) #dummy time covariate timecourse_results = stattest(bg, feature='transcript', meas='FPKM', covariate='time', timecourse=TRUE) # timecourse test, adjusting for group: group_adj_timecourse_results = stattest(bg, feature='transcript', meas='FPKM', covariate='time', timecourse=TRUE, adjustvars='group') # custom model matrices: ### create example data: set.seed(43) sex = sample(c('M','F'), size=nrow(pData(bg)), replace=TRUE) age = sample(21:52, size=nrow(pData(bg)), replace=TRUE) ### create design matrices: mod = model.matrix(~ sex + age + pData(bg)$group + pData(bg)$time) mod0 = model.matrix(~ pData(bg)$group + pData(bg)$time) ### build model: adjusted_results = stattest(bg, feature='transcript', meas='FPKM', mod0=mod0, mod=mod)
extract structure components from ballgown objects
structure(x) ## S4 method for signature 'ballgown' structure(x)
structure(x) ## S4 method for signature 'ballgown' structure(x)
x |
a ballgown object |
list containing elements intron
, exon
, and
trans
. exon
and intron
are GRanges
objects,
where each range is an exon or intron, and trans
is a
GRangesList
object, where each GRanges
element is a set of
exons representing a transcript.
data(bg) names(structure(bg)) class(structure(bg)) structure(bg)$exon
data(bg) names(structure(bg)) class(structure(bg)) structure(bg)$exon
subset ballgown objects to specific samples or genomic locations
subset(x, ...) ## S4 method for signature 'ballgown' subset(x, cond, genomesubset = TRUE)
subset(x, ...) ## S4 method for signature 'ballgown' subset(x, cond, genomesubset = TRUE)
x |
a ballgown object |
... |
further arguments to generic subset |
cond |
Condition on which to subset. See details. |
genomesubset |
if TRUE, subset |
To use subset
, you must provide the cond
argument as a
string representing a logical expression specifying your desired subset.
The subset expression can either involve column names of
texpr(x, "all")
(if genomesubset
is TRUE
) or of
pData(x)
(if genomesubset
is FALSE
). For example, if
you wanted a ballgown object for only chromosome 22, you might call
subset(x, "chr == 'chr22'")
. (Be sure to handle quotes within
character strings appropriately).
a subsetted ballgown object, containing only the regions or samples
satisfying cond
.
Alyssa Frazee
data(bg) bg_twogenes = subset(bg, "gene_id=='XLOC_000454' | gene_id=='XLOC_000024'") bg_twogenes # ballgown instance with 4 assembled transcripts and 20 samples bg_group0 = subset(bg, "group == 0", genomesubset=FALSE) bg_group0 # ballgown instance with 100 assembled transcripts and 10 samples
data(bg) bg_twogenes = subset(bg, "gene_id=='XLOC_000454' | gene_id=='XLOC_000024'") bg_twogenes # ballgown instance with 4 assembled transcripts and 20 samples bg_group0 = subset(bg, "group == 0", genomesubset=FALSE) bg_group0 # ballgown instance with 100 assembled transcripts and 10 samples
extract transcript-level expression measurements from ballgown objects
texpr(x, meas = "FPKM") ## S4 method for signature 'ballgown' texpr(x, meas = "FPKM")
texpr(x, meas = "FPKM") ## S4 method for signature 'ballgown' texpr(x, meas = "FPKM")
x |
a ballgown object |
meas |
type of measurement to extract. Can be "cov", "FPKM", or "all". Default "FPKM". |
transcript-by-sample matrix containing expression values (measured by
meas
). If meas
is "all"
, a data frame is returned,
containing all measurements and location information.
data(bg) transcript_fpkm_matrix = texpr(bg) transcript_data_frame = texpr(bg, 'all')
data(bg) transcript_fpkm_matrix = texpr(bg) transcript_data_frame = texpr(bg, 'all')
find the gene to which a transcript belongs
tGene(bg, transcript, tid = TRUE, gid = TRUE, warnme = TRUE)
tGene(bg, transcript, tid = TRUE, gid = TRUE, warnme = TRUE)
bg |
ballgown object |
transcript |
transcript identifier |
tid |
set to |
gid |
if |
warnme |
if |
data(bg) tGene(bg, 10) tGene(bg, 'TCONS_00000010', tid=FALSE) tGene(bg, 10, gid=FALSE) #empty: no gene names included in bg.
data(bg) tGene(bg, 10) tGene(bg, 'TCONS_00000010', tid=FALSE) tGene(bg, 10, gid=FALSE) #empty: no gene names included in bg.
get numeric transcript IDs from a ballgown object
transcriptIDs(x) ## S4 method for signature 'ballgown' transcriptIDs(x)
transcriptIDs(x) ## S4 method for signature 'ballgown' transcriptIDs(x)
x |
a ballgown object |
vector of numeric transcript IDs included in the ballgown object
data(bg) transcriptIDs(bg)
data(bg) transcriptIDs(bg)
get transcript names from a ballgown object
transcriptNames(x) ## S4 method for signature 'ballgown' transcriptNames(x)
transcriptNames(x) ## S4 method for signature 'ballgown' transcriptNames(x)
x |
a ballgown object |
vector of transcript names included in the ballgown object. If object was created using Cufflinks/Tablemaker, these transcript names will be of the form "TCONS_*". Return vector is named and ordered by corresponding numeric transcript ID.
data(bg) transcriptNames(bg)
data(bg) transcriptNames(bg)
create tablemaker-like files on disk from a ballgown object
writeFiles(gown, dataDir)
writeFiles(gown, dataDir)
gown |
ballgown object |
dataDir |
top-level directory for sample-specific folders |
data(bg) writeFiles(bg, dataDir=getwd())
data(bg) writeFiles(bg, dataDir=getwd())