Title: | Turn Bioconductor objects into tidy data frames |
---|---|
Description: | This package contains methods for converting standard objects constructed by bioinformatics packages, especially those in Bioconductor, and converting them to tidy data. It thus serves as a complement to the broom package, and follows the same the tidy, augment, glance division of tidying methods. Tidying data makes it easy to recombine, reshape and visualize bioinformatics analyses. |
Authors: | Andrew J. Bass, David G. Robinson, Steve Lianoglou, Emily Nelson, John D. Storey, with contributions from Laurent Gatto |
Maintainer: | John D. Storey <[email protected]> and Andrew J. Bass <[email protected]> |
License: | LGPL |
Version: | 1.39.0 |
Built: | 2024-10-31 05:57:53 UTC |
Source: | https://github.com/bioc/biobroom |
These are methods for turning a sva list, from the sva package, into a tidy data frame.
tidy
returns a data.frame of the estimated surrogate variables, glance
returns a data.frame
of the posterior probabilities, and glance
returns a
data.frame with only the number of surrogate variables.
augment_sva(x, data, ...) tidy_sva(x, addVar = NULL, ...) glance_sva(x, ...)
augment_sva(x, data, ...) tidy_sva(x, addVar = NULL, ...) glance_sva(x, ...)
x |
sva list |
data |
Original data |
... |
extra arguments (not used) |
addVar |
add additional coefficients to the estimated surrogate variables |
All tidying methods return a data.frame
without rownames.
The structure depends on the method chosen.
augment
returns one row per gene. It always
contains the columns
pprob.gam |
Posterior probability each gene is affected by heterogeneity |
pprob.b |
Posterior probability each gene is affected by model |
tidy
returns the estimate surrogate variables.
glance
returns the estimate surrogate variables.
This package contains methods for converting standard objects constructed by bioinformatics packages, especially those in Bioconductor, and converting them to tidy data. It thus serves as a complement to the broom package, and follows the same the tidy, augment, glance division of tidying methods. Tidying data makes it easy to recombine, reshape and visualize bioinformatics analyses.
This reshapes a DESeq2 expressionset object into a tidy format. If the dataset contains hypothesis test results (p-values and estimates), this summarizes one row per gene per possible contrast.
## S3 method for class 'DESeqDataSet' tidy(x, colData = FALSE, intercept = FALSE, ...) ## S3 method for class 'DESeqResults' tidy(x, ...)
## S3 method for class 'DESeqDataSet' tidy(x, colData = FALSE, intercept = FALSE, ...) ## S3 method for class 'DESeqResults' tidy(x, ...)
x |
DESeqDataSet object |
colData |
whether colData should be included in the tidied output for those in the DESeqDataSet object. If dataset includes hypothesis test results, this is ignored |
intercept |
whether to include hypothesis test results from the (Intercept) term. If dataset does not include hypothesis testing, this is ignored |
... |
extra arguments (not used) |
colDat=TRUE
adds covariates from colData to the data frame.
If the dataset contains results (p-values and log2 fold changes), the result is a data frame with the columns
term |
The contrast being tested, as given to
|
gene |
gene ID |
baseMean |
mean abundance level |
estimate |
estimated log2 fold change |
stderror |
standard error in log2 fold change estimate |
statistic |
test statistic |
p.value |
p-value |
p.adjusted |
adjusted p-value |
If the dataset does not contain results (DESeq
has
not been run on it), tidy
defaults to tidying the counts in
the dataset:
gene |
gene ID |
sample |
sample ID |
count |
number of reads in this gene in this sample |
If colData = TRUE
, it also merges this with the columns present
in colData(x)
.
# From DESeq2 documentation if (require("DESeq2")) { dds <- makeExampleDESeqDataSet(betaSD = 1) tidy(dds) # With design included tidy(dds, colData=TRUE) # add a noise confounding effect colData(dds)$noise <- rnorm(nrow(colData(dds))) design(dds) <- (~ condition + noise) # perform differential expression tests ddsres <- DESeq(dds, test = "Wald") # now results are per-gene, per-term tidied <- tidy(ddsres) tidied if (require("ggplot2")) { ggplot(tidied, aes(p.value)) + geom_histogram(binwidth = .05) + facet_wrap(~ term, scale = "free_y") } }
# From DESeq2 documentation if (require("DESeq2")) { dds <- makeExampleDESeqDataSet(betaSD = 1) tidy(dds) # With design included tidy(dds, colData=TRUE) # add a noise confounding effect colData(dds)$noise <- rnorm(nrow(colData(dds))) design(dds) <- (~ condition + noise) # perform differential expression tests ddsres <- DESeq(dds, test = "Wald") # now results are per-gene, per-term tidied <- tidy(ddsres) tidied if (require("ggplot2")) { ggplot(tidied, aes(p.value)) + geom_histogram(binwidth = .05) + facet_wrap(~ term, scale = "free_y") } }
Tidy, augment and glance methods for turning edgeR objects into tidy data frames, where each row represents one observation and each column represents one column.
## S3 method for class 'DGEExact' tidy(x, ...) ## S3 method for class 'DGEList' tidy(x, addSamples = FALSE, ...) ## S3 method for class 'DGEList' augment(x, data = NULL, ...) ## S3 method for class 'DGEExact' glance(x, alpha = 0.05, p.adjust.method = "fdr", ...)
## S3 method for class 'DGEExact' tidy(x, ...) ## S3 method for class 'DGEList' tidy(x, addSamples = FALSE, ...) ## S3 method for class 'DGEList' augment(x, data = NULL, ...) ## S3 method for class 'DGEExact' glance(x, alpha = 0.05, p.adjust.method = "fdr", ...)
x |
DGEExact, DGEList object |
... |
extra arguments (not used) |
addSamples |
Merge information from samples. Default is FALSE. |
data |
merge data to augment. This is particularly useful when merging gene names or other per-gene information. Default is NULL. |
alpha |
Confidence level to test for significance |
p.adjust.method |
Method for adjusting p-values to determine significance; can be any in p.adjust.methods |
tidy
defaults to tidying the counts in
the dataset:
gene |
gene ID |
sample |
sample ID |
count |
number of reads in this gene in this sample |
If addSamples = TRUE
, it also merges this with the sample information present
in x$samples
.
augment
returns per-gene information (DGEList only)
glance
returns one row with the columns (DGEExact only)
significant |
number of significant genes using desired adjustment method and confidence level |
comparison |
The pair of groups compared by edgeR, delimited by / |
if (require("edgeR")) { library(Biobase) data(hammer) hammer.counts <- exprs(hammer)[, 1:4] hammer.treatment <- phenoData(hammer)$protocol[1:4] y <- DGEList(counts=hammer.counts,group=hammer.treatment) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y) head(tidy(et)) head(glance(et)) }
if (require("edgeR")) { library(Biobase) data(hammer) hammer.counts <- exprs(hammer)[, 1:4] hammer.treatment <- phenoData(hammer)$protocol[1:4] y <- DGEList(counts=hammer.counts,group=hammer.treatment) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y) head(tidy(et)) head(glance(et)) }
Tidying methods for Biobase's ExpressionSet objects
## S3 method for class 'ExpressionSet' tidy(x, addPheno = FALSE, assay = Biobase::assayDataElementNames(x)[1L], ...)
## S3 method for class 'ExpressionSet' tidy(x, addPheno = FALSE, assay = Biobase::assayDataElementNames(x)[1L], ...)
x |
ExpressionSet object |
addPheno |
whether columns should be included in the tidied output for those in the ExpressionSet's phenoData |
assay |
The name of the |
... |
extra arguments (not used) |
addPheno=TRUE
adds columns that are redundant (since they
add per-sample information to a per-sample-per-gene data frame), but that
are useful for some kinds of graphs and analyses.
tidy
returns a data frame with one row per gene-sample
combination, with columns
gene |
gene name |
sample |
sample name (from column names) |
value |
expressions on log2 scale |
library(Biobase) # import ExpressionSet object data(hammer) # Use tidy to extract genes, sample ids and measured value tidy(hammer) # add phenoType data tidy(hammer, addPheno=TRUE)
library(Biobase) # import ExpressionSet object data(hammer) # Use tidy to extract genes, sample ids and measured value tidy(hammer) # add phenoType data tidy(hammer, addPheno=TRUE)
An ExpressionSet containing the results of the Hammer et al 2010 RNA-Seq study on the nervous system of rats (Hammer et al 2010).
This was downloaded from the ReCount database of analysis-ready RNA-Seq datasets (Frazee et al 2011).
Hammer, P., Banck, M. S., Amberg, R., Wang, C., Petznick, G., Luo, S., Khrebtukova, I., Schroth, G. P., Beyerlein, P., and Beutler, A. S. (2010). mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. Genome research, 20(6), 847-860. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2877581/
Frazee, A. C., Langmead, B., and Leek, J. T. (2011). ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics, 12, 449. http://bowtie-bio.sourceforge.net/recount/
hammer
hammer
An object of class ExpressionSet
with 29516 rows and 8 columns.
ExpressionSet
Tidy, augment, and glance methods for MArrayLM objects, which contain the results of gene-wise linear models to microarray datasets. This class is the output of the lmFit and eBayes functions.
## S3 method for class 'MArrayLM' tidy(x, intercept = FALSE, ...) ## S3 method for class 'MArrayLM' augment(x, data, ...) ## S3 method for class 'MArrayLM' glance(x, ...) ## S3 method for class 'MAList' tidy(x, ...) ## S3 method for class 'EList' tidy(x, addTargets = FALSE, ...)
## S3 method for class 'MArrayLM' tidy(x, intercept = FALSE, ...) ## S3 method for class 'MArrayLM' augment(x, data, ...) ## S3 method for class 'MArrayLM' glance(x, ...) ## S3 method for class 'MAList' tidy(x, ...) ## S3 method for class 'EList' tidy(x, addTargets = FALSE, ...)
x |
|
intercept |
whether the |
... |
extra arguments, not used |
data |
original expression matrix; if missing, |
addTargets |
Add sample level information. Default is FALSE. |
Tidying this fit computes one row per coefficient per gene, while
augmenting returns one row per gene, with per-gene statistics included.
(This is thus a rare case where the augment
output has more rows
than the tidy
output. This is a side effect of the fact that the
input to limma is not tidy but rather a one-row-per-gene matrix).
The output of tidying functions is always a data frame without rownames.
tidy
returns one row per gene per coefficient. It always
contains the columns
gene |
The name of the gene (extracted from the rownames of the input matrix) |
term |
The coefficient being estimated |
estimate |
The estimate of each per-gene coefficient |
Depending on whether the object comes from eBayes
, it may also
contain
statistic |
Empirical Bayes t-statistic |
p.value |
p-value computed from t-statistic |
lod |
log-of-odds score |
augment
returns one row per gene, containing the original
gene expression matrix if provided. It then adds columns containing
the per-gene statistics included in the MArrayLM object, each prepended
with a .:
.gene |
gene ID, obtained from the rownames of the input |
.sigma |
per-gene residual standard deviation |
.df.residual |
per-gene residual degrees of freedom |
The following columns may also be included, depending on which have been
added by lmFit
and eBayes
:
.AMean |
average intensity across probes |
.statistic |
moderated F-statistic |
.p.value |
p-value generated from moderated F-statistic |
.df.total |
total degrees of freedom per gene |
.df.residual |
residual degrees of freedom per gene |
.s2.prior |
prior estimate of residual variance |
.s2.post |
posterior estimate of residual variance |
glance
returns one row, containing
rank |
rank of design matrix |
df.prior |
empirical Bayesian prior degrees of freedom |
s2.prior |
empirical Bayesian prior residual standard deviation |
tidy
returns a data frame with one row per gene-sample
combination, with columns
gene |
gene name |
sample |
sample name (from column names) |
value |
expressions on log2 scale |
tidy
returns a data frame with one row per gene-sample
combination, with columns
gene |
gene name |
sample |
sample name (from column names) |
value |
expressions on log2 scale |
weight |
present if |
other columns |
if present and if |
if (require("limma")) { # create random data and design set.seed(2014) dat <- matrix(rnorm(1000), ncol=4) dat[, 1:2] <- dat[, 1:2] + .5 # add an effect rownames(dat) <- paste0("g", 1:nrow(dat)) des <- data.frame(treatment = c("a", "a", "b", "b"), confounding = rnorm(4)) lfit <- lmFit(dat, model.matrix(~ treatment + confounding, des)) eb <- eBayes(lfit) head(tidy(lfit)) head(tidy(eb)) if (require("ggplot2")) { # the tidied form puts it in an ideal form for plotting ggplot(tidy(lfit), aes(estimate)) + geom_histogram(binwidth=1) + facet_wrap(~ term) ggplot(tidy(eb), aes(p.value)) + geom_histogram(binwidth=.2) + facet_wrap(~ term) } }
if (require("limma")) { # create random data and design set.seed(2014) dat <- matrix(rnorm(1000), ncol=4) dat[, 1:2] <- dat[, 1:2] + .5 # add an effect rownames(dat) <- paste0("g", 1:nrow(dat)) des <- data.frame(treatment = c("a", "a", "b", "b"), confounding = rnorm(4)) lfit <- lmFit(dat, model.matrix(~ treatment + confounding, des)) eb <- eBayes(lfit) head(tidy(lfit)) head(tidy(eb)) if (require("ggplot2")) { # the tidied form puts it in an ideal form for plotting ggplot(tidy(lfit), aes(estimate)) + geom_histogram(binwidth=1) + facet_wrap(~ term) ggplot(tidy(eb), aes(p.value)) + geom_histogram(binwidth=.2) + facet_wrap(~ term) } }
This method handles the return values of functions that return lists
rather than S3 objects, such as sva
,
and therefore cannot be handled by S3 dispatch.
## S3 method for class 'list' tidy(x, ...) ## S3 method for class 'list' glance(x, ...)
## S3 method for class 'list' tidy(x, ...) ## S3 method for class 'list' glance(x, ...)
x |
list object |
... |
extra arguments, passed to the tidying function |
Those tiders themselves are implemented as functions of the form tidy_<function> that are not exported.
Tidying methods for Biobase's ExpressionSet objects
## S3 method for class 'MSnSet' tidy(x, addPheno = FALSE, ...)
## S3 method for class 'MSnSet' tidy(x, addPheno = FALSE, ...)
x |
MSnSet object |
addPheno |
whether columns should be included in the tidied output for those in the MSnSet's phenoData |
... |
extra arguments (not used) |
addPheno=TRUE
adds columns that are redundant (since they
add per-sample information to a per-sample-per-gene data frame), but that
are useful for some kinds of graphs and analyses.
tidy
returns a data frame with one row per gene-sample
combination, with columns
protein |
protein name |
sample |
sample name (from column names) |
value |
protein quantitation data |
if (require("MSnbase")) { library(MSnbase) # import MSnSet object data(msnset) # Use tidy to extract genes, sample ids and measured value tidy(msnset) # add phenoType data tidy(msnset, addPheno=TRUE) }
if (require("MSnbase")) { library(MSnbase) # import MSnSet object data(msnset) # Use tidy to extract genes, sample ids and measured value tidy(msnset) # add phenoType data tidy(msnset, addPheno=TRUE) }
These are methods for turning a qvalue object, from the qvalue package for
false discovery rate control, into a tidy data frame. augment
returns a data.frame of the original p-values combined with the computed
q-values and local false discovery rates, tidy
constructs a table
showing how the estimate of pi0 (the proportion of true nulls) depends
on the choice of the tuning parameter lambda, and glance
returns a
data.frame with only the chosen pi0 value.
## S3 method for class 'qvalue' tidy(x, ...) ## S3 method for class 'qvalue' augment(x, data, ...) ## S3 method for class 'qvalue' glance(x, ...)
## S3 method for class 'qvalue' tidy(x, ...) ## S3 method for class 'qvalue' augment(x, data, ...) ## S3 method for class 'qvalue' glance(x, ...)
x |
qvalue object |
... |
extra arguments (not used) |
data |
Original data |
All tidying methods return a data.frame
without rownames.
The structure depends on the method chosen.
tidy
returns one row for each choice of the tuning
parameter lambda that was considered (argument lambda
to qvalue),
containing
lambda |
the tuning parameter |
pi0 |
corresponding estimate of pi0 |
smoothed |
whether the estimate has been spline-smoothed) |
If pi0.method="smooth"
, the pi0 estimates and smoothed values both
appear in the table. If pi0.method="bootstrap"
, smoothed
is FALSE for all entries.
augment
returns a data.frame with
p.value |
the original p-values given to |
q.value |
the computed q-values |
lfdr |
the local false discovery rate |
glance
returns a one-row data.frame containing
pi0 |
the estimated pi0 (proportion of nulls) |
lambda |
lambda used to compute pi0. Note that if pi0 is 1, this may be NA since it can be ambiguous which lambda was used |
library(ggplot2) if (require("qvalue")) { set.seed(2014) # generate p-values from many one sample t-tests: half of them null oracle <- rep(c(0, .5), each=1000) pvals <- sapply(oracle, function(mu) t.test(rnorm(15, mu))$p.value) qplot(pvals) q <- qvalue(pvals) tidy(q) head(augment(q)) glance(q) # use augmented data to compare p-values to q-values ggplot(augment(q), aes(p.value, q.value)) + geom_point() # use tidy see how pi0 estimate changes with lambda, comparing # to smoothed version g <- ggplot(tidy(q), aes(lambda, pi0, color=smoothed)) + geom_line() g # show the chosen value g + geom_hline(yintercept=q$pi0, lty=2) }
library(ggplot2) if (require("qvalue")) { set.seed(2014) # generate p-values from many one sample t-tests: half of them null oracle <- rep(c(0, .5), each=1000) pvals <- sapply(oracle, function(mu) t.test(rnorm(15, mu))$p.value) qplot(pvals) q <- qvalue(pvals) tidy(q) head(augment(q)) glance(q) # use augmented data to compare p-values to q-values ggplot(augment(q), aes(p.value, q.value)) + geom_point() # use tidy see how pi0 estimate changes with lambda, comparing # to smoothed version g <- ggplot(tidy(q), aes(lambda, pi0, color=smoothed)) + geom_line() g # show the chosen value g + geom_hline(yintercept=q$pi0, lty=2) }
Tidying methods for Biobase's SummarizedExperiment objects
## S3 method for class 'RangedSummarizedExperiment' tidy(x, addPheno = FALSE, assay = SummarizedExperiment::assayNames(x)[1L], ...)
## S3 method for class 'RangedSummarizedExperiment' tidy(x, addPheno = FALSE, assay = SummarizedExperiment::assayNames(x)[1L], ...)
x |
SummarizedExperiment object |
addPheno |
whether columns should be included in the tidied output for those in the SummarizedExperiment colData |
assay |
Which assay to return as the |
... |
extra arguments (not used) |
addPheno=TRUE
adds columns that are redundant (since they
add per-sample information to a per-sample-per-gene data frame), but that
are useful for some kinds of graphs and analyses.
tidy
returns a data frame with one row per gene-sample
combination, with columns
gene |
gene name |
sample |
sample name (from column names) |
value |
expressions |
If addPheno
is TRUE then information from colData
is added.
if (require("SummarizedExperiment", "airway")) { data(airway) se <- airway tidy(se) }
if (require("SummarizedExperiment", "airway")) { data(airway) se <- airway tidy(se) }
Tidying methods for edge's deSet object
## S3 method for class 'deSet' tidy(x, addPheno = FALSE, ...) ## S3 method for class 'deSet' augment(x, data, ...) ## S3 method for class 'deSet' glance(x, ...)
## S3 method for class 'deSet' tidy(x, addPheno = FALSE, ...) ## S3 method for class 'deSet' augment(x, data, ...) ## S3 method for class 'deSet' glance(x, ...)
x |
deSet object |
addPheno |
whether columns should be included in the tidied output for those in the ExpressionSet's phenoData |
... |
extra arguments (not used) |
data |
Original data can be added. Default is NULL. |
addPheno=TRUE
adds columns that are redundant (since they
add per-sample information to a per-sample-per-gene data frame), but that
are useful for some kinds of graphs and analyses.
tidy
returns a data frame with one row per gene-sample
combination, with columns
gene |
gene name |
sample |
sample name (from column names) |
value |
expressions on log2 scale |
augment
returns a data.frame with
p.value |
the original p-values given to |
q.value |
the computed q-values |
lfdr |
the local false discovery rate |
glance
returns a data.frame with the model fits
Tidying methods for GRanges and GRangesList objects.
## S3 method for class 'GRanges' tidy(x, ...) ## S3 method for class 'GRangesList' tidy(x, ...) ## S3 method for class 'GRanges' glance(x, ...) ## S3 method for class 'GRangesList' glance(x, ...)
## S3 method for class 'GRanges' tidy(x, ...) ## S3 method for class 'GRangesList' tidy(x, ...) ## S3 method for class 'GRanges' glance(x, ...) ## S3 method for class 'GRangesList' glance(x, ...)
x |
GRanges or GRangesList object |
... |
Not used. |
All tidying methods return a data.frame
without rownames. tidy
returns one row for each range, which contains
start of the range
end of the range
width (or length) of the range
names of the range
strand
seqname Name of the sequence from which the range comes (usually the chromosome)
metadata Any included metadata, (ie, score, GC content)
For GRangesList
, there will also be a column representing which group the ranges comes from.
glance
returns a data.frame
with the number of ranges, the number of sequences, and the number of groups (if applicable).
if (require("GenomicRanges", "airway")) { data(airway) # GRangesList object air_gr <- rowRanges(airway) tidy(air_gr) glance(air_gr) # GRanges object air_gr <- rowRanges(airway)@unlistData tidy(air_gr) glance(air_gr) }
if (require("GenomicRanges", "airway")) { data(airway) # GRangesList object air_gr <- rowRanges(airway) tidy(air_gr) glance(air_gr) # GRanges object air_gr <- rowRanges(airway)@unlistData tidy(air_gr) glance(air_gr) }