Title: | Differential gene expression analysis based on the negative binomial distribution |
---|---|
Description: | Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution. |
Authors: | Michael Love [aut, cre], Constantin Ahlmann-Eltze [ctb], Kwame Forbes [ctb], Simon Anders [aut, ctb], Wolfgang Huber [aut, ctb], RADIANT EU FP7 [fnd], NIH NHGRI [fnd], CZI [fnd] |
Maintainer: | Michael Love <[email protected]> |
License: | LGPL (>= 3) |
Version: | 1.47.1 |
Built: | 2024-11-15 03:09:54 UTC |
Source: | https://github.com/bioc/DESeq2 |
The DESeq2 package is designed for normalization, visualization, and differential analysis of high-dimensional count data. It makes use of empirical Bayes techniques to estimate priors for log fold change and dispersion, and to calculate posterior estimates for these quantities.
The main functions are:
DESeqDataSet
- build the dataset, see tximeta & tximport packages for preparing input
DESeq
- perform differential analysis
results
- build a results table
lfcShrink
- estimate shrunken LFC (posterior estimates) using apeglm & ashr pakges
vst
- apply variance stabilizing transformation, e.g. for PCA or sample clustering
Plots, e.g.: plotPCA
, plotMA
, plotCounts
For detailed information on usage, see the package vignette, by typing
vignette("DESeq2")
, or the workflow linked to on the first page
of the vignette.
All software-related questions should be posted to the Bioconductor Support Site:
https://support.bioconductor.org
The code can be viewed at the GitHub repository, which also lists the contributor code of conduct:
https://github.com/mikelove/tximport
Michael Love, Wolfgang Huber, Simon Anders
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. https://doi.org/10.1186/s13059-014-0550-8
Useful links:
Note: results tables with log2 fold change, p-values, adjusted p-values, etc.
for each gene are best generated using the results
function. The coef
function is designed for advanced users who wish to inspect all model coefficients at once.
## S3 method for class 'DESeqDataSet' coef(object, SE = FALSE, ...)
## S3 method for class 'DESeqDataSet' coef(object, SE = FALSE, ...)
object |
a DESeqDataSet returned by |
SE |
whether to give the standard errors instead of coefficients. defaults to FALSE so that the coefficients are given. |
... |
additional arguments |
Estimated model coefficients or estimated standard errors are provided in a matrix
form, number of genes by number of parameters, on the log2 scale.
The columns correspond to columns of the model matrix for final GLM fitting, i.e.,
attr(dds, "modelMatrix")
.
Michael Love
dds <- makeExampleDESeqDataSet(m=4) dds <- DESeq(dds) coef(dds)[1,] coef(dds, SE=TRUE)[1,]
dds <- makeExampleDESeqDataSet(m=4) dds <- DESeq(dds) coef(dds)[1,] coef(dds, SE=TRUE)[1,]
Collapses the columns in object
by summing within levels
of a grouping factor groupby
. The purpose of this function
is to sum up read counts from technical replicates to create an object
with a single column of read counts for each sample.
This function will issue a warning if there are other assays other than
"counts"
, see details below in 'Value'.
collapseReplicates(object, groupby, run, renameCols = TRUE)
collapseReplicates(object, groupby, run, renameCols = TRUE)
object |
A |
groupby |
a grouping factor, as long as the columns of object |
run |
optional, the names of each unique column in object. if provided,
a new column |
renameCols |
whether to rename the columns of the returned object using the levels of the grouping factor |
Note: by "technical replicates", we mean multiple sequencing runs of the same library, in constrast to "biological replicates" in which multiple libraries are prepared from separate biological units. Optionally renames the columns of returned object with the levels of the grouping factor. Note: this function is written very simply and can be easily altered to produce other behavior by examining the source code.
the object
with as many columns as levels in groupby
.
This object has "counts"
data which is summed from the various
columns which are grouped together, and the colData
is subset using
the first column for each group in groupby
.
Other assays are dropped, as it is not unambiguous the correct
form of combination, and a warning is printed if they are present, so
the user is aware they should take care of such assays manually.
dds <- makeExampleDESeqDataSet(m=12) # make data with two technical replicates for three samples dds$sample <- factor(sample(paste0("sample",rep(1:9, c(2,1,1,2,1,1,2,1,1))))) dds$run <- paste0("run",1:12) ddsColl <- collapseReplicates(dds, dds$sample, dds$run) # examine the colData and column names of the collapsed data colData(ddsColl) colnames(ddsColl) # check that the sum of the counts for "sample1" is the same # as the counts in the "sample1" column in ddsColl matchFirstLevel <- dds$sample == levels(dds$sample)[1] stopifnot(all(rowSums(counts(dds[,matchFirstLevel])) == counts(ddsColl[,1])))
dds <- makeExampleDESeqDataSet(m=12) # make data with two technical replicates for three samples dds$sample <- factor(sample(paste0("sample",rep(1:9, c(2,1,1,2,1,1,2,1,1))))) dds$run <- paste0("run",1:12) ddsColl <- collapseReplicates(dds, dds$sample, dds$run) # examine the colData and column names of the collapsed data colData(ddsColl) colnames(ddsColl) # check that the sum of the counts for "sample1" is the same # as the counts in the "sample1" column in ddsColl matchFirstLevel <- dds$sample == levels(dds$sample)[1] stopifnot(all(rowSums(counts(dds[,matchFirstLevel])) == counts(ddsColl[,1])))
The counts slot holds the count data as a matrix of non-negative integer count values, one row for each observational unit (gene or the like), and one column for each sample.
## S4 method for signature 'DESeqDataSet' counts(object, normalized = FALSE, replaced = FALSE) ## S4 replacement method for signature 'DESeqDataSet,matrix' counts(object) <- value
## S4 method for signature 'DESeqDataSet' counts(object, normalized = FALSE, replaced = FALSE) ## S4 replacement method for signature 'DESeqDataSet,matrix' counts(object) <- value
object |
a |
normalized |
logical indicating whether or not to divide the counts by the size factors or normalization factors before returning (normalization factors always preempt size factors) |
replaced |
after a |
value |
an integer matrix |
Simon Anders
sizeFactors
, normalizationFactors
dds <- makeExampleDESeqDataSet(m=4) head(counts(dds)) dds <- estimateSizeFactors(dds) # run this or DESeq() first head(counts(dds, normalized=TRUE))
dds <- makeExampleDESeqDataSet(m=4) head(counts(dds)) dds <- estimateSizeFactors(dds) # run this or DESeq() first head(counts(dds, normalized=TRUE))
This function performs a default analysis through the steps:
estimation of size factors: estimateSizeFactors
estimation of dispersion: estimateDispersions
Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest
For complete details on each step, see the manual pages of the respective
functions. After the DESeq
function returns a DESeqDataSet object,
results tables (log2 fold changes and p-values) can be generated
using the results
function.
Shrunken LFC can then be generated using the lfcShrink
function.
All support questions should be posted to the Bioconductor
support site: http://support.bioconductor.org.
DESeq( object, test = c("Wald", "LRT"), fitType = c("parametric", "local", "mean", "glmGamPoi"), sfType = c("ratio", "poscounts", "iterate"), betaPrior, full = design(object), reduced, quiet = FALSE, minReplicatesForReplace = 7, modelMatrixType, useT = FALSE, minmu = if (fitType == "glmGamPoi") 1e-06 else 0.5, parallel = FALSE, BPPARAM = bpparam() )
DESeq( object, test = c("Wald", "LRT"), fitType = c("parametric", "local", "mean", "glmGamPoi"), sfType = c("ratio", "poscounts", "iterate"), betaPrior, full = design(object), reduced, quiet = FALSE, minReplicatesForReplace = 7, modelMatrixType, useT = FALSE, minmu = if (fitType == "glmGamPoi") 1e-06 else 0.5, parallel = FALSE, BPPARAM = bpparam() )
object |
a DESeqDataSet object, see the constructor functions
|
test |
either "Wald" or "LRT", which will then use either
Wald significance tests (defined by |
fitType |
either "parametric", "local", "mean", or "glmGamPoi"
for the type of fitting of dispersions to the mean intensity.
See |
sfType |
either "ratio", "poscounts", or "iterate"
for the type of size factor estimation. See
|
betaPrior |
whether or not to put a zero-mean normal prior on
the non-intercept coefficients
See |
full |
for |
reduced |
for |
quiet |
whether to print messages at each step |
minReplicatesForReplace |
the minimum number of replicates required
in order to use |
modelMatrixType |
either "standard" or "expanded", which describe
how the model matrix, X of the GLM formula is formed.
"standard" is as created by |
useT |
logical, passed to |
minmu |
lower bound on the estimated count for fitting gene-wise dispersion
and for use with |
parallel |
if FALSE, no parallelization. if TRUE, parallel
execution using |
BPPARAM |
an optional parameter object passed internally
to |
The differential expression analysis uses a generalized linear model of the form:
where counts for gene i, sample j are modeled using
a Negative Binomial distribution with fitted mean
and a gene-specific dispersion parameter
.
The fitted mean is composed of a sample-specific size factor
and a parameter
proportional to the
expected true concentration of fragments for sample j.
The coefficients
give the log2 fold changes for gene i for each
column of the model matrix
.
The sample-specific size factors can be replaced by
gene-specific normalization factors for each sample using
normalizationFactors
.
For details on the fitting of the log2 fold changes and calculation of p-values,
see nbinomWaldTest
if using test="Wald"
,
or nbinomLRT
if using test="LRT"
.
Experiments without replicates do not allow for estimation of the dispersion of counts around the expected value for each group, which is critical for differential expression analysis. Analysis without replicates was deprecated in v1.20 and is no longer supported since v1.22.
The argument minReplicatesForReplace
is used to decide which samples
are eligible for automatic replacement in the case of extreme Cook's distance.
By default, DESeq
will replace outliers if the Cook's distance is
large for a sample which has 7 or more replicates (including itself).
Outlier replacement is turned off entirely for fitType="glmGamPoi"
.
This replacement is performed by the replaceOutliers
function. This default behavior helps to prevent filtering genes
based on Cook's distance when there are many degrees of freedom.
See results
for more information about filtering using
Cook's distance, and the 'Dealing with outliers' section of the vignette.
Unlike the behavior of replaceOutliers
, here original counts are
kept in the matrix returned by counts
, original Cook's
distances are kept in assays(dds)[["cooks"]]
, and the replacement
counts used for fitting are kept in assays(dds)[["replaceCounts"]]
.
Note that if a log2 fold change prior is used (betaPrior=TRUE)
then expanded model matrices will be used in fitting. These are
described in nbinomWaldTest
and in the vignette. The
contrast
argument of results
should be used for
generating results tables.
a DESeqDataSet
object with results stored as
metadata columns. These results should accessed by calling the results
function. By default this will return the log2 fold changes and p-values for the last
variable in the design formula. See results
for how to access results
for other variables.
Michael Love
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. https://doi.org/10.1186/s13059-014-0550-8
For fitType="glmGamPoi"
:
Ahlmann-Eltze, C., Huber, W. (2020) glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa1009
link{results}
, lfcShrink
, nbinomWaldTest
, nbinomLRT
# see vignette for suggestions on generating # count tables from RNA-Seq data cnts <- matrix(rnbinom(n=1000, mu=100, size=1/0.5), ncol=10) cond <- factor(rep(1:2, each=5)) # object construction dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond) # standard analysis dds <- DESeq(dds) res <- results(dds) # moderated log2 fold changes resultsNames(dds) resLFC <- lfcShrink(dds, coef=2, type="apeglm") # an alternate analysis: likelihood ratio test ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1) resLRT <- results(ddsLRT)
# see vignette for suggestions on generating # count tables from RNA-Seq data cnts <- matrix(rnbinom(n=1000, mu=100, size=1/0.5), ncol=10) cond <- factor(rep(1:2, each=5)) # object construction dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond) # standard analysis dds <- DESeq(dds) res <- results(dds) # moderated log2 fold changes resultsNames(dds) resLFC <- lfcShrink(dds, coef=2, type="apeglm") # an alternate analysis: likelihood ratio test ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1) resLRT <- results(ddsLRT)
DESeqDataSet
is a subclass of RangedSummarizedExperiment
,
used to store the input values, intermediate calculations and results of an
analysis of differential expression. The DESeqDataSet
class
enforces non-negative integer values in the "counts" matrix stored as
the first element in the assay list.
In addition, a formula which specifies the design of the experiment must be provided.
The constructor functions create a DESeqDataSet object
from various types of input:
a RangedSummarizedExperiment, a matrix, count files generated by
the python package HTSeq, or a list from the tximport function in the
tximport package.
See the vignette for examples of construction from different types.
DESeqDataSet(se, design, ignoreRank = FALSE, skipIntegerMode = FALSE) DESeqDataSetFromMatrix( countData, colData, design, tidy = FALSE, ignoreRank = FALSE, ... ) DESeqDataSetFromHTSeqCount( sampleTable, directory = ".", design, ignoreRank = FALSE, ... ) DESeqDataSetFromTximport(txi, colData, design, ...)
DESeqDataSet(se, design, ignoreRank = FALSE, skipIntegerMode = FALSE) DESeqDataSetFromMatrix( countData, colData, design, tidy = FALSE, ignoreRank = FALSE, ... ) DESeqDataSetFromHTSeqCount( sampleTable, directory = ".", design, ignoreRank = FALSE, ... ) DESeqDataSetFromTximport(txi, colData, design, ...)
se |
a |
design |
a |
ignoreRank |
use of this argument is reserved for DEXSeq developers only. Users will immediately encounter an error upon trying to estimate dispersion using a design with a model matrix which is not full rank. |
skipIntegerMode |
do not convert counts to integer mode (glmGamPoi allows non-integer counts) |
countData |
for matrix input: a matrix of non-negative integers |
colData |
for matrix input: a |
tidy |
for matrix input: whether the first column of countData is the rownames for the count matrix |
... |
arguments provided to |
sampleTable |
for htseq-count: a |
directory |
for htseq-count: the directory relative to which the filenames are specified. defaults to current directory |
txi |
for tximport: the simple list output of the |
Note on the error message "assay colnames() must be NULL or equal colData rownames()":
this means that the colnames of countData are different than the rownames of colData.
Fix this with: colnames(countData) <- NULL
A DESeqDataSet object.
See http://www-huber.embl.de/users/anders/HTSeq for htseq-count
countData <- matrix(1:100,ncol=4) condition <- factor(c("A","A","B","B")) dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)
countData <- matrix(1:100,ncol=4) condition <- factor(c("A","A","B","B")) dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)
This constructor function would not typically be used by "end users".
This simple class indirectly extends the DataFrame class defined in the
S4Vectors package to allow other packages to write methods for results
objects from the DESeq2 package. It is used by results
to wrap up the results table.
DESeqResults(DataFrame, priorInfo = list())
DESeqResults(DataFrame, priorInfo = list())
DataFrame |
a DataFrame of results, standard column names are: baseMean, log2FoldChange, lfcSE, stat, pvalue, padj. |
priorInfo |
a list giving information on the log fold change prior |
a DESeqResults object
This constructor function would not typically be used by "end users".
This simple class extends the RangedSummarizedExperiment class of the
SummarizedExperiment package.
It is used by rlog
and
varianceStabilizingTransformation
to wrap up the results into a class for downstream methods,
such as plotPCA
.
DESeqTransform(SummarizedExperiment)
DESeqTransform(SummarizedExperiment)
SummarizedExperiment |
a RangedSummarizedExperiment |
a DESeqTransform object
The design holds the R formula
which expresses how the
counts depend on the variables in colData
.
See DESeqDataSet
for details.
## S4 method for signature 'DESeqDataSet' design(object) ## S4 replacement method for signature 'DESeqDataSet,formula' design(object) <- value ## S4 replacement method for signature 'DESeqDataSet,matrix' design(object) <- value
## S4 method for signature 'DESeqDataSet' design(object) ## S4 replacement method for signature 'DESeqDataSet,formula' design(object) <- value ## S4 replacement method for signature 'DESeqDataSet,matrix' design(object) <- value
object |
a |
value |
a |
dds <- makeExampleDESeqDataSet(m=4) design(dds) <- formula(~ 1)
dds <- makeExampleDESeqDataSet(m=4) design(dds) <- formula(~ 1)
The dispersion function is calculated by estimateDispersions
and
used by varianceStabilizingTransformation
. Parametric dispersion
fits store the coefficients of the fit as attributes in this slot.
dispersionFunction(object, ...) dispersionFunction(object, ...) <- value ## S4 method for signature 'DESeqDataSet' dispersionFunction(object) ## S4 replacement method for signature 'DESeqDataSet,function' dispersionFunction(object) <- value
dispersionFunction(object, ...) dispersionFunction(object, ...) <- value ## S4 method for signature 'DESeqDataSet' dispersionFunction(object) ## S4 replacement method for signature 'DESeqDataSet,function' dispersionFunction(object) <- value
object |
a |
... |
additional arguments |
value |
a |
Using this setter function will also overwrite mcols(object)$dispFit
and the estimate of the variance of dispersion residuals.
dds <- makeExampleDESeqDataSet(m=4) dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dispersionFunction(dds)
dds <- makeExampleDESeqDataSet(m=4) dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dispersionFunction(dds)
The dispersions for each row of the DESeqDataSet. Generally,
these are set by estimateDispersions
.
dispersions(object, ...) dispersions(object, ...) <- value ## S4 method for signature 'DESeqDataSet' dispersions(object) ## S4 replacement method for signature 'DESeqDataSet,numeric' dispersions(object) <- value
dispersions(object, ...) dispersions(object, ...) <- value ## S4 method for signature 'DESeqDataSet' dispersions(object) ## S4 replacement method for signature 'DESeqDataSet,numeric' dispersions(object) <- value
object |
a |
... |
additional arguments |
value |
the dispersions to use for the Negative Binomial modeling |
Simon Anders
These lower-level functions are called within DESeq
or nbinomWaldTest
.
End users should use those higher-level function instead.
NOTE: estimateBetaPriorVar
returns a numeric vector, not a DESEqDataSet!
For advanced users: to use these functions, first run estimateMLEForBetaPriorVar
and then run estimateBetaPriorVar
.
estimateBetaPriorVar( object, betaPriorMethod = c("weighted", "quantile"), upperQuantile = 0.05, modelMatrix = NULL ) estimateMLEForBetaPriorVar( object, maxit = 100, useOptim = TRUE, useQR = TRUE, modelMatrixType = NULL )
estimateBetaPriorVar( object, betaPriorMethod = c("weighted", "quantile"), upperQuantile = 0.05, modelMatrix = NULL ) estimateMLEForBetaPriorVar( object, maxit = 100, useOptim = TRUE, useQR = TRUE, modelMatrixType = NULL )
object |
a DESeqDataSet |
betaPriorMethod |
the method for calculating the beta prior variance, either "quanitle" or "weighted": "quantile" matches a normal distribution using the upper quantile of the finite MLE betas. "weighted" matches a normal distribution using the upper quantile, but weighting by the variance of the MLE betas. |
upperQuantile |
the upper quantile to be used for the "quantile" or "weighted" method of beta prior variance estimation |
modelMatrix |
an optional matrix, typically this is set to NULL and created within the function |
maxit |
as defined in |
useOptim |
as defined in |
useQR |
as defined in |
modelMatrixType |
an optional override for the type which is set internally |
for estimateMLEForBetaPriorVar
, a DESeqDataSet, with the
necessary information stored in order to calculate the prior variance.
for estimateBetaPriorVar
, the vector of variances for the prior
on the betas in the DESeq
GLM
This function obtains dispersion estimates for Negative Binomial distributed data.
## S4 method for signature 'DESeqDataSet' estimateDispersions( object, fitType = c("parametric", "local", "mean", "glmGamPoi"), maxit = 100, useCR = TRUE, weightThreshold = 0.01, quiet = FALSE, modelMatrix = NULL, minmu = if (fitType == "glmGamPoi") 1e-06 else 0.5 )
## S4 method for signature 'DESeqDataSet' estimateDispersions( object, fitType = c("parametric", "local", "mean", "glmGamPoi"), maxit = 100, useCR = TRUE, weightThreshold = 0.01, quiet = FALSE, modelMatrix = NULL, minmu = if (fitType == "glmGamPoi") 1e-06 else 0.5 )
object |
a DESeqDataSet |
fitType |
either "parametric", "local", "mean", or "glmGamPoi" for the type of fitting of dispersions to the mean intensity.
|
maxit |
control parameter: maximum number of iterations to allow for convergence |
useCR |
whether to use Cox-Reid correction - see McCarthy et al (2012) |
weightThreshold |
threshold for subsetting the design matrix and GLM weights for calculating the Cox-Reid correction |
quiet |
whether to print messages at each step |
modelMatrix |
an optional matrix which will be used for fitting the expected counts.
by default, the model matrix is constructed from |
minmu |
lower bound on the estimated count for fitting gene-wise dispersion |
Typically the function is called with the idiom:
dds <- estimateDispersions(dds)
The fitting proceeds as follows: for each gene, an estimate of the dispersion
is found which maximizes the Cox Reid-adjusted profile likelihood
(the methods of Cox Reid-adjusted profile likelihood maximization for
estimation of dispersion in RNA-Seq data were developed by McCarthy,
et al. (2012), first implemented in the edgeR package in 2010);
a trend line capturing the dispersion-mean relationship is fit to the maximum likelihood estimates;
a normal prior is determined for the log dispersion estimates centered
on the predicted value from the trended fit
with variance equal to the difference between the observed variance of the
log dispersion estimates and the expected sampling variance;
finally maximum a posteriori dispersion estimates are returned.
This final dispersion parameter is used in subsequent tests.
The final dispersion estimates can be accessed from an object using dispersions
.
The fitted dispersion-mean relationship is also used in
varianceStabilizingTransformation
.
All of the intermediate values (gene-wise dispersion estimates, fitted dispersion
estimates from the trended fit, etc.) are stored in mcols(dds)
, with
information about these columns in mcols(mcols(dds))
.
The log normal prior on the dispersion parameter has been proposed by Wu, et al. (2012) and is also implemented in the DSS package.
In DESeq2, the dispersion estimation procedure described above replaces the different methods of dispersion from the previous version of the DESeq package.
Since version 1.29, DESeq2 can call the glmGamPoi package, which can speed up the inference
and is optimized for fitting many samles with very small counts (for example single cell
RNA-seq data). To call functions from the glmGamPoi package, make sure that it is installed
and set fitType = "glmGamPoi"
. In addition, to the gene estimates, the trend and the MAP,
the glmGamPoi package calculates the corresponding quasi-likelihood estimates. Those can be
used with the nbinomLRT()
test to get more precise p-value estimates.
The lower-level functions called by estimateDispersions
are:
estimateDispersionsGeneEst
,
estimateDispersionsFit
, and
estimateDispersionsMAP
.
The DESeqDataSet passed as parameters, with the dispersion information
filled in as metadata columns, accessible via mcols
, or the final dispersions
accessible via dispersions
.
Simon Anders, Wolfgang Huber: Differential expression analysis for sequence count data. Genome Biology 11 (2010) R106, http://dx.doi.org/10.1186/gb-2010-11-10-r106
McCarthy, DJ, Chen, Y, Smyth, GK: Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research 40 (2012), 4288-4297, http://dx.doi.org/10.1093/nar/gks042
Wu, H., Wang, C. & Wu, Z. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics (2012). http://dx.doi.org/10.1093/biostatistics/kxs033
Ahlmann-Eltze, C., Huber, W. glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data. Bioinformatics (2020). https://doi.org/10.1093/bioinformatics/btaa1009
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) head(dispersions(dds))
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) head(dispersions(dds))
Normal users should instead use estimateDispersions
.
These low-level functions are called by estimateDispersions
,
but are exported and documented for non-standard usage.
For instance, it is possible to replace fitted values with a custom fit and continue
with the maximum a posteriori dispersion estimation, as demonstrated in the
examples below.
estimateDispersionsGeneEst( object, minDisp = 1e-08, kappa_0 = 1, dispTol = 1e-06, maxit = 100, useCR = TRUE, weightThreshold = 0.01, quiet = FALSE, modelMatrix = NULL, niter = 1, linearMu = NULL, minmu = if (type == "glmGamPoi") 1e-06 else 0.5, alphaInit = NULL, type = c("DESeq2", "glmGamPoi") ) estimateDispersionsFit( object, fitType = c("parametric", "local", "mean", "glmGamPoi"), minDisp = 1e-08, quiet = FALSE ) estimateDispersionsMAP( object, outlierSD = 2, dispPriorVar, minDisp = 1e-08, kappa_0 = 1, dispTol = 1e-06, maxit = 100, useCR = TRUE, weightThreshold = 0.01, modelMatrix = NULL, type = c("DESeq2", "glmGamPoi"), quiet = FALSE ) estimateDispersionsPriorVar(object, minDisp = 1e-08, modelMatrix = NULL)
estimateDispersionsGeneEst( object, minDisp = 1e-08, kappa_0 = 1, dispTol = 1e-06, maxit = 100, useCR = TRUE, weightThreshold = 0.01, quiet = FALSE, modelMatrix = NULL, niter = 1, linearMu = NULL, minmu = if (type == "glmGamPoi") 1e-06 else 0.5, alphaInit = NULL, type = c("DESeq2", "glmGamPoi") ) estimateDispersionsFit( object, fitType = c("parametric", "local", "mean", "glmGamPoi"), minDisp = 1e-08, quiet = FALSE ) estimateDispersionsMAP( object, outlierSD = 2, dispPriorVar, minDisp = 1e-08, kappa_0 = 1, dispTol = 1e-06, maxit = 100, useCR = TRUE, weightThreshold = 0.01, modelMatrix = NULL, type = c("DESeq2", "glmGamPoi"), quiet = FALSE ) estimateDispersionsPriorVar(object, minDisp = 1e-08, modelMatrix = NULL)
object |
a DESeqDataSet |
minDisp |
small value for the minimum dispersion, to allow for calculations in log scale, one order of magnitude above this value is used as a test for inclusion in mean-dispersion fitting |
kappa_0 |
control parameter used in setting the initial proposal in backtracking search, higher kappa_0 results in larger steps |
dispTol |
control parameter to test for convergence of log dispersion, stop when increase in log posterior is less than dispTol |
maxit |
control parameter: maximum number of iterations to allow for convergence |
useCR |
whether to use Cox-Reid correction |
weightThreshold |
threshold for subsetting the design matrix and GLM weights for calculating the Cox-Reid correction |
quiet |
whether to print messages at each step |
modelMatrix |
for advanced use only, a substitute model matrix for gene-wise and MAP dispersion estimation |
niter |
number of times to iterate between estimation of means and estimation of dispersion |
linearMu |
estimate the expected counts matrix using a linear model, default is NULL, in which case a lienar model is used if the number of groups defined by the model matrix is equal to the number of columns of the model matrix |
minmu |
lower bound on the estimated count for fitting gene-wise dispersion |
alphaInit |
initial guess for the dispersion estimate, alpha |
type |
can either be "DESeq2" or "glmGamPoi". Specifies if the glmGamPoi package is used to calculate the dispersion. This can be significantly faster if there are many replicates with small counts. |
fitType |
either "parametric", "local", "mean", or "glmGamPoi"
for the type of fitting of dispersions to the mean intensity.
See |
outlierSD |
the number of standard deviations of log gene-wise estimates above the prior mean (fitted value), above which dispersion estimates will be labelled outliers. Outliers will keep their original value and not be shrunk using the prior. |
dispPriorVar |
the variance of the normal prior on the log dispersions. If not supplied, this is calculated as the difference between the mean squared residuals of gene-wise estimates to the fitted dispersion and the expected sampling variance of the log dispersion |
a DESeqDataSet with gene-wise, fitted, or final MAP dispersion estimates in the metadata columns of the object.
estimateDispersionsPriorVar
is called inside of estimateDispersionsMAP
and stores the dispersion prior variance as an attribute of
dispersionFunction(dds)
, which can be manually provided to
estimateDispersionsMAP
for parallel execution.
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersionsGeneEst(dds) dds <- estimateDispersionsFit(dds) dds <- estimateDispersionsMAP(dds) plotDispEsts(dds) # after having run estimateDispersionsFit() # the dispersion prior variance over all genes # can be obtained like so: dispPriorVar <- estimateDispersionsPriorVar(dds)
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersionsGeneEst(dds) dds <- estimateDispersionsFit(dds) dds <- estimateDispersionsMAP(dds) plotDispEsts(dds) # after having run estimateDispersionsFit() # the dispersion prior variance over all genes # can be obtained like so: dispPriorVar <- estimateDispersionsPriorVar(dds)
DESeqDataSet
This function estimates the size factors using the
"median ratio method" described by Equation 5 in Anders and Huber (2010).
The estimated size factors can be accessed using the accessor function sizeFactors
.
Alternative library size estimators can also be supplied
using the assignment function sizeFactors<-
.
## S4 method for signature 'DESeqDataSet' estimateSizeFactors( object, type = c("ratio", "poscounts", "iterate"), locfunc = stats::median, geoMeans, controlGenes, normMatrix, quiet = FALSE )
## S4 method for signature 'DESeqDataSet' estimateSizeFactors( object, type = c("ratio", "poscounts", "iterate"), locfunc = stats::median, geoMeans, controlGenes, normMatrix, quiet = FALSE )
object |
a DESeqDataSet |
type |
Method for estimation: either "ratio", "poscounts", or "iterate". "ratio" uses the standard median ratio method introduced in DESeq. The size factor is the median ratio of the sample over a "pseudosample": for each gene, the geometric mean of all samples. "poscounts" and "iterate" offer alternative estimators, which can be used even when all genes contain a sample with a zero (a problem for the default method, as the geometric mean becomes zero, and the ratio undefined). The "poscounts" estimator deals with a gene with some zeros, by calculating a modified geometric mean by taking the n-th root of the product of the non-zero counts. This evolved out of use cases with Paul McMurdie's phyloseq package for metagenomic samples. The "iterate" estimator iterates between estimating the dispersion with a design of ~1, and finding a size factor vector by numerically optimizing the likelihood of the ~1 model. |
locfunc |
a function to compute a location for a sample. By default, the
median is used. However, especially for low counts, the
|
geoMeans |
by default this is not provided and the
geometric means of the counts are calculated within the function.
A vector of geometric means from another count matrix can be provided
for a "frozen" size factor calculation. The size factors will be
scaled to have a geometric mean of 1 when supplying |
controlGenes |
optional, numeric or logical index vector specifying those genes to use for size factor estimation (e.g. housekeeping or spike-in genes) |
normMatrix |
optional, a matrix of normalization factors which do not yet
control for library size. Note that this argument should not be used (and
will be ignored) if the |
quiet |
whether to print messages |
Typically, the function is called with the idiom:
dds <- estimateSizeFactors(dds)
See DESeq
for a description of the use of size factors in the GLM.
One should call this function after DESeqDataSet
unless size factors are manually specified with sizeFactors
.
Alternatively, gene-specific normalization factors for each sample can be provided using
normalizationFactors
which will always preempt sizeFactors
in calculations.
Internally, the function calls estimateSizeFactorsForMatrix
,
which provides more details on the calculation.
The DESeqDataSet passed as parameters, with the size factors filled in.
Simon Anders
Reference for the median ratio method:
Simon Anders, Wolfgang Huber: Differential expression analysis for sequence count data. Genome Biology 2010, 11:106. http://dx.doi.org/10.1186/gb-2010-11-10-r106
dds <- makeExampleDESeqDataSet(n=1000, m=4) dds <- estimateSizeFactors(dds) sizeFactors(dds) dds <- estimateSizeFactors(dds, controlGenes=1:200) m <- matrix(runif(1000 * 4, .5, 1.5), ncol=4) dds <- estimateSizeFactors(dds, normMatrix=m) normalizationFactors(dds)[1:3,] geoMeans <- exp(rowMeans(log(counts(dds)))) dds <- estimateSizeFactors(dds,geoMeans=geoMeans) sizeFactors(dds)
dds <- makeExampleDESeqDataSet(n=1000, m=4) dds <- estimateSizeFactors(dds) sizeFactors(dds) dds <- estimateSizeFactors(dds, controlGenes=1:200) m <- matrix(runif(1000 * 4, .5, 1.5), ncol=4) dds <- estimateSizeFactors(dds, normMatrix=m) normalizationFactors(dds)[1:3,] geoMeans <- exp(rowMeans(log(counts(dds)))) dds <- estimateSizeFactors(dds,geoMeans=geoMeans) sizeFactors(dds)
Given a matrix or data frame of count data, this function estimates the size
factors as follows: Each column is divided by the geometric means of the
rows. The median (or, if requested, another location estimator) of these
ratios (skipping the genes with a geometric mean of zero) is used as the size
factor for this column. Typically, one will not call this function directly, but use
estimateSizeFactors
.
estimateSizeFactorsForMatrix( counts, locfunc = stats::median, geoMeans, controlGenes, type = c("ratio", "poscounts") )
estimateSizeFactorsForMatrix( counts, locfunc = stats::median, geoMeans, controlGenes, type = c("ratio", "poscounts") )
counts |
a matrix or data frame of counts, i.e., non-negative integer values |
locfunc |
a function to compute a location for a sample. By default, the
median is used. However, especially for low counts, the
|
geoMeans |
by default this is not provided, and the geometric means of the counts are calculated within the function. A vector of geometric means from another count matrix can be provided for a "frozen" size factor calculation |
controlGenes |
optional, numeric or logical index vector specifying those genes to use for size factor estimation (e.g. housekeeping or spike-in genes) |
type |
standard median ratio ( |
a vector with the estimates size factors, one element per column
Simon Anders
dds <- makeExampleDESeqDataSet() estimateSizeFactorsForMatrix(counts(dds)) geoMeans <- exp(rowMeans(log(counts(dds)))) estimateSizeFactorsForMatrix(counts(dds),geoMeans=geoMeans)
dds <- makeExampleDESeqDataSet() estimateSizeFactorsForMatrix(counts(dds)) geoMeans <- exp(rowMeans(log(counts(dds)))) estimateSizeFactorsForMatrix(counts(dds),geoMeans=geoMeans)
The following function returns fragment counts normalized
per kilobase of feature length per million mapped fragments
(by default using a robust estimate of the library size,
as in estimateSizeFactors
).
fpkm(object, robust = TRUE)
fpkm(object, robust = TRUE)
object |
a |
robust |
whether to use size factors to normalize
rather than taking the column sums of the raw counts,
using the |
The length of the features (e.g. genes) is calculated one of two ways:
(1) If there is a matrix named "avgTxLength" in assays(dds)
,
this will take precedence in the length normalization.
This occurs when using the tximport-DESeq2 pipeline.
(2) Otherwise, feature length is calculated
from the rowRanges
of the dds object,
if a column basepairs
is not present in mcols(dds)
.
The calculated length is the number of basepairs in the union of all GRanges
assigned to a given row of object
, e.g.,
the union of all basepairs of exons of a given gene.
Note that the second approach over-estimates the gene length
(average transcript length, weighted by abundance is a more appropriate
normalization for gene counts), and so the FPKM will be an underestimate of the true value.
Note that, when the read/fragment counting has inter-feature dependencies, a strict normalization would not incorporate the basepairs of a feature which overlap another feature. This inter-feature dependence is not taken into consideration in the internal union basepair calculation.
a matrix which is normalized per kilobase of the
union of basepairs in the GRangesList
or GRanges
of the mcols(object), and per million of mapped fragments,
either using the robust median ratio method (robust=TRUE, default)
or using raw counts (robust=FALSE).
Defining a column mcols(object)$basepairs
takes
precedence over internal calculation of the kilobases for each row.
# create a matrix with 1 million counts for the # 2nd and 3rd column, the 1st and 4th have # half and double the counts, respectively. m <- matrix(1e6 * rep(c(.125, .25, .25, .5), each=4), ncol=4, dimnames=list(1:4,1:4)) mode(m) <- "integer" se <- SummarizedExperiment(list(counts=m), colData=DataFrame(sample=1:4)) dds <- DESeqDataSet(se, ~ 1) # create 4 GRanges with lengths: 1, 1, 2, 2.5 Kb gr1 <- GRanges("chr1",IRanges(1,1000)) # 1kb gr2 <- GRanges("chr1",IRanges(c(1,1001),c( 500,1500))) # 1kb gr3 <- GRanges("chr1",IRanges(c(1,1001),c(1000,2000))) # 2kb gr4 <- GRanges("chr1",IRanges(c(1,1001),c(200,1300))) # 500bp rowRanges(dds) <- GRangesList(gr1,gr2,gr3,gr4) # the raw counts counts(dds) # the FPM values fpm(dds) # the FPKM values fpkm(dds)
# create a matrix with 1 million counts for the # 2nd and 3rd column, the 1st and 4th have # half and double the counts, respectively. m <- matrix(1e6 * rep(c(.125, .25, .25, .5), each=4), ncol=4, dimnames=list(1:4,1:4)) mode(m) <- "integer" se <- SummarizedExperiment(list(counts=m), colData=DataFrame(sample=1:4)) dds <- DESeqDataSet(se, ~ 1) # create 4 GRanges with lengths: 1, 1, 2, 2.5 Kb gr1 <- GRanges("chr1",IRanges(1,1000)) # 1kb gr2 <- GRanges("chr1",IRanges(c(1,1001),c( 500,1500))) # 1kb gr3 <- GRanges("chr1",IRanges(c(1,1001),c(1000,2000))) # 2kb gr4 <- GRanges("chr1",IRanges(c(1,1001),c(200,1300))) # 500bp rowRanges(dds) <- GRangesList(gr1,gr2,gr3,gr4) # the raw counts counts(dds) # the FPM values fpm(dds) # the FPKM values fpkm(dds)
Calculates either a robust version (default) or the traditional matrix of fragments/counts per million mapped fragments (FPM/CPM). Note: this function is written very simply and can be easily altered to produce other behavior by examining the source code.
fpm(object, robust = TRUE)
fpm(object, robust = TRUE)
object |
a |
robust |
whether to use size factors to normalize rather than taking the column sums of the raw counts. If TRUE, the size factors and the geometric mean of column sums are multiplied to create a robust library size estimate. Robust normalization is not used if average transcript lengths are present. |
a matrix which is normalized per million of mapped fragments, either using the robust median ratio method (robust=TRUE, default) or using raw counts (robust=FALSE).
# generate a dataset with size factors: .5, 1, 1, 2 dds <- makeExampleDESeqDataSet(m = 4, n = 1000, interceptMean=log2(1e3), interceptSD=0, sizeFactors=c(.5,1,1,2), dispMeanRel=function(x) .01) # add a few rows with very high count counts(dds)[4:10,] <- 2e5L # in this robust version, the counts are comparable across samples round(head(fpm(dds), 3)) # in this column sum version, the counts are still skewed: # sample1 < sample2 & 3 < sample 4 round(head(fpm(dds, robust=FALSE), 3)) # the column sums of the robust version # are not equal to 1e6, but the # column sums of the non-robust version # are equal to 1e6 by definition colSums(fpm(dds))/1e6 colSums(fpm(dds, robust=FALSE))/1e6
# generate a dataset with size factors: .5, 1, 1, 2 dds <- makeExampleDESeqDataSet(m = 4, n = 1000, interceptMean=log2(1e3), interceptSD=0, sizeFactors=c(.5,1,1,2), dispMeanRel=function(x) .01) # add a few rows with very high count counts(dds)[4:10,] <- 2e5L # in this robust version, the counts are comparable across samples round(head(fpm(dds), 3)) # in this column sum version, the counts are still skewed: # sample1 < sample2 & 3 < sample 4 round(head(fpm(dds, robust=FALSE), 3)) # the column sums of the robust version # are not equal to 1e6, but the # column sums of the non-robust version # are equal to 1e6 by definition colSums(fpm(dds))/1e6 colSums(fpm(dds, robust=FALSE))/1e6
A function that assists with integration of bulk DE results tables with pre-processed scRNA-seq datasets available on Bioconductor, for downstream visualization tasks. The user is prompted to pick a scRNA-seq dataset from a menu. The output of the function is a list with the original results table, bulk gene counts, and the SingleCellExperiment object selected by the user.
integrateWithSingleCell(res, dds, ...)
integrateWithSingleCell(res, dds, ...)
res |
a results table, as produced via |
dds |
a DESeqDataSet with the bulk gene expression data (should contain gene-level counts) |
... |
additional arguments passed to the dataset-accessing function |
This function assists the user in choosing a datset from a menu of options
that are selected based on the organism of the current dataset.
Currently only human and mouse bulk and single-cell RNA-seq datasets
are supported, and it is assumed that the bulk DE dataset has GENCODE
or Ensembl gene identifiers. Following the selection of the scRNA-seq
dataset, visualization can be performed with a package vizWithSCE
,
which can be installed with install_github("KwameForbes/vizWithSCE")
.
list containing: res, dds, and a SingleCellExperiment as selected by the user
Kwame Forbes
## Not run: # involves interactive menu selection... dds <- makeExampleDESeqDataSet() rownames(dds) <- paste0("ENSG",1:nrow(dds)) dds <- DESeq(dds) res <- results(dds) dat <- integrateWithSingleCell(res, dds) ## End(Not run)
## Not run: # involves interactive menu selection... dds <- makeExampleDESeqDataSet() rownames(dds) <- paste0("ENSG",1:nrow(dds)) dds <- DESeq(dds) res <- results(dds) dat <- integrateWithSingleCell(res, dds) ## End(Not run)
Adds shrunken log2 fold changes (LFC) and SE to a
results table from DESeq
run without LFC shrinkage.
For consistency with results
, the column name lfcSE
is used here although what is returned is a posterior SD.
Three shrinkage estimators for LFC are available via type
(see the vignette for more details on the estimators).
The apeglm publication demonstrates that 'apeglm' and 'ashr' outperform
the original 'normal' shrinkage estimator.
lfcShrink( dds, coef, contrast, res, type = c("apeglm", "ashr", "normal"), lfcThreshold = 0, svalue = FALSE, returnList = FALSE, format = c("DataFrame", "GRanges", "GRangesList"), saveCols = NULL, apeAdapt = TRUE, apeMethod = "nbinomCR", parallel = FALSE, BPPARAM = bpparam(), quiet = FALSE, ... )
lfcShrink( dds, coef, contrast, res, type = c("apeglm", "ashr", "normal"), lfcThreshold = 0, svalue = FALSE, returnList = FALSE, format = c("DataFrame", "GRanges", "GRangesList"), saveCols = NULL, apeAdapt = TRUE, apeMethod = "nbinomCR", parallel = FALSE, BPPARAM = bpparam(), quiet = FALSE, ... )
dds |
a DESeqDataSet object, after running |
coef |
the name or number of the coefficient (LFC) to shrink,
consult |
contrast |
see argument description in |
res |
a DESeqResults object. Results table produced by the
default pipeline, i.e. |
type |
|
lfcThreshold |
a non-negative value which specifies a log2 fold change
threshold (as in |
svalue |
logical, should p-values and adjusted p-values be replaced
with s-values when using |
returnList |
logical, should |
format |
same as defined in |
saveCols |
same as defined in |
apeAdapt |
logical, should |
apeMethod |
what |
parallel |
if FALSE, no parallelization. if TRUE, parallel
execution using |
BPPARAM |
see same argument of |
quiet |
whether to print messages |
... |
arguments passed to |
See vignette for a comparison of shrinkage estimators on an example dataset.
For all shrinkage methods, details on the prior is included in
priorInfo(res)
, including the fitted_g
mixture for ashr.
For type="apeglm":
Specifying apeglm
passes along DESeq2 MLE log2
fold changes and standard errors to the apeglm
function
in the apeglm package, and re-estimates posterior LFCs for
the coefficient specified by coef
.
For type="ashr":
Specifying ashr
passes along DESeq2 MLE log2
fold changes and standard errors to the ash
function
in the ashr package,
with arguments mixcompdist="normal"
and method="shrink"
.
For type="normal":
For design as a formula, shrinkage cannot be applied
to coefficients in a model with interaction terms.
For user-supplied model matrices, shrinkage is only
supported via coef
. coef
will make use
of standard model matrices, while contrast
will make use of expanded model matrices, and for the
latter, a results object should be provided to
res
. With numeric- or list-style contrasts,
it is possible to use lfcShrink
, but likely easier to use
DESeq
with betaPrior=TRUE
followed by results
,
because the numeric or list should reference the coefficients
from the expanded model matrix. These coefficients will be printed
to console if 'contrast' is used.
a DESeqResults object with the log2FoldChange
and lfcSE
columns replaced with shrunken LFC and SE.
For consistency with results
(and similar to the output of bayesglm
)
the column name lfcSE
is used here, although what is returned is a posterior SD.
For normal
and for apeglm
the estimate is the posterior mode,
for ashr
it is the posterior mean.
priorInfo(res)
contains information about the shrinkage procedure,
relevant to the various methods specified by type
.
Publications for the following shrinkage estimators:
type="apeglm"
:
Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
type="ashr"
:
Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2. https://doi.org/10.1093/biostatistics/kxw041
type="normal"
:
Love, M.I., Huber, W., Anders, S. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15:550. https://doi.org/10.1186/s13059-014-0550-8
Related work, the bayesglm
function in the arm
package:
Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.-S. (2009). A Weakly Informative Default Prior Distribution For Logistic And Other Regression Models. The Annals of Applied Statistics 2:4. http://www.stat.columbia.edu/~gelman/research/published/ priors11.pdf
set.seed(1) dds <- makeExampleDESeqDataSet(n=500,betaSD=1) dds <- DESeq(dds) res <- results(dds) # these are the coefficients from the model # we can specify them using 'coef' by name or number below resultsNames(dds) res.ape <- lfcShrink(dds=dds, coef=2, type="apeglm") res.ash <- lfcShrink(dds=dds, coef=2, type="ashr") res.norm <- lfcShrink(dds=dds, coef=2, type="normal")
set.seed(1) dds <- makeExampleDESeqDataSet(n=500,betaSD=1) dds <- DESeq(dds) res <- results(dds) # these are the coefficients from the model # we can specify them using 'coef' by name or number below resultsNames(dds) res.ape <- lfcShrink(dds=dds, coef=2, type="apeglm") res.ash <- lfcShrink(dds=dds, coef=2, type="ashr") res.norm <- lfcShrink(dds=dds, coef=2, type="normal")
Constructs a simulated dataset of Negative Binomial data from
two conditions. By default, there are no fold changes between
the two conditions, but this can be adjusted with the betaSD
argument.
makeExampleDESeqDataSet( n = 1000, m = 12, betaSD = 0, interceptMean = 4, interceptSD = 2, dispMeanRel = function(x) 4/x + 0.1, sizeFactors = rep(1, m) )
makeExampleDESeqDataSet( n = 1000, m = 12, betaSD = 0, interceptMean = 4, interceptSD = 2, dispMeanRel = function(x) 4/x + 0.1, sizeFactors = rep(1, m) )
n |
number of rows |
m |
number of columns |
betaSD |
the standard deviation for non-intercept betas, i.e. beta ~ N(0,betaSD) |
interceptMean |
the mean of the intercept betas (log2 scale) |
interceptSD |
the standard deviation of the intercept betas (log2 scale) |
dispMeanRel |
a function specifying the relationship of the dispersions on
|
sizeFactors |
multiplicative factors for each sample |
a DESeqDataSet
with true dispersion,
intercept and beta values in the metadata columns. Note that the true
betas are provided on the log2 scale.
dds <- makeExampleDESeqDataSet() dds
dds <- makeExampleDESeqDataSet() dds
This function tests for significance of change in deviance between a
full and reduced model which are provided as formula
.
Fitting uses previously calculated sizeFactors
(or normalizationFactors
)
and dispersion estimates.
nbinomLRT( object, full = design(object), reduced, betaTol = 1e-08, maxit = 100, useOptim = TRUE, quiet = FALSE, useQR = TRUE, minmu = if (type == "glmGamPoi") 1e-06 else 0.5, type = c("DESeq2", "glmGamPoi") )
nbinomLRT( object, full = design(object), reduced, betaTol = 1e-08, maxit = 100, useOptim = TRUE, quiet = FALSE, useQR = TRUE, minmu = if (type == "glmGamPoi") 1e-06 else 0.5, type = c("DESeq2", "glmGamPoi") )
object |
a DESeqDataSet |
full |
the full model formula, this should be the formula in
|
reduced |
a reduced formula to compare against, e.g. the full model with a term or terms of interest removed. alternatively, can be a matrix |
betaTol |
control parameter defining convergence |
maxit |
the maximum number of iterations to allow for convergence of the coefficient vector |
useOptim |
whether to use the native optim function on rows which do not converge within maxit |
quiet |
whether to print messages at each step |
useQR |
whether to use the QR decomposition on the design matrix X while fitting the GLM |
minmu |
lower bound on the estimated count while fitting the GLM |
type |
either "DESeq2" or "glmGamPoi". If |
The difference in deviance is compared to a chi-squared distribution
with df = (reduced residual degrees of freedom - full residual degrees of freedom).
This function is comparable to the nbinomGLMTest
of the previous version of DESeq
and an alternative to the default nbinomWaldTest
.
a DESeqDataSet with new results columns accessible
with the results
function. The coefficients and standard errors are
reported on a log2 scale.
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dds <- nbinomLRT(dds, reduced = ~ 1) res <- results(dds)
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dds <- nbinomLRT(dds, reduced = ~ 1) res <- results(dds)
This function tests for significance of coefficients in a Negative
Binomial GLM, using previously calculated sizeFactors
(or normalizationFactors
)
and dispersion estimates. See DESeq
for the GLM formula.
nbinomWaldTest( object, betaPrior = FALSE, betaPriorVar, modelMatrix = NULL, modelMatrixType, betaTol = 1e-08, maxit = 100, useOptim = TRUE, quiet = FALSE, useT = FALSE, df, useQR = TRUE, minmu = 0.5 )
nbinomWaldTest( object, betaPrior = FALSE, betaPriorVar, modelMatrix = NULL, modelMatrixType, betaTol = 1e-08, maxit = 100, useOptim = TRUE, quiet = FALSE, useT = FALSE, df, useQR = TRUE, minmu = 0.5 )
object |
a DESeqDataSet |
betaPrior |
whether or not to put a zero-mean normal prior on the non-intercept coefficients |
betaPriorVar |
a vector with length equal to the number of model terms including the intercept. betaPriorVar gives the variance of the prior on the sample betas on the log2 scale. if missing (default) this is estimated from the data |
modelMatrix |
an optional matrix, typically this is set to NULL and created within the function |
modelMatrixType |
either "standard" or "expanded", which describe
how the model matrix, X of the formula in |
betaTol |
control parameter defining convergence |
maxit |
the maximum number of iterations to allow for convergence of the coefficient vector |
useOptim |
whether to use the native optim function on rows which do not converge within maxit |
quiet |
whether to print messages at each step |
useT |
whether to use a t-distribution as a null distribution,
for significance testing of the Wald statistics.
If FALSE, a standard normal null distribution is used.
See next argument |
df |
the degrees of freedom for the t-distribution.
This can be of length 1 or the number of rows of |
useQR |
whether to use the QR decomposition on the design matrix X while fitting the GLM |
minmu |
lower bound on the estimated count while fitting the GLM |
The fitting proceeds as follows: standard maximum likelihood estimates
for GLM coefficients (synonymous with "beta", "log2 fold change", "effect size")
are calculated.
Then, optionally, a zero-centered Normal prior distribution
(betaPrior
) is assumed for the coefficients other than the intercept.
Note that this posterior log2 fold change
estimation is now not the default setting for nbinomWaldTest
,
as the standard workflow for coefficient shrinkage has moved to
an additional function link{lfcShrink}
.
For calculating Wald test p-values, the coefficients are scaled by their
standard errors and then compared to a standard Normal distribution.
The results
function without any arguments will automatically perform a contrast of the
last level of the last variable in the design formula over the first level.
The contrast
argument of the results
function can be used
to generate other comparisons.
The Wald test can be replaced with the nbinomLRT
for an alternative test of significance.
Notes on the log2 fold change prior:
The variance of the prior distribution for each non-intercept coefficient is calculated using the observed distribution of the maximum likelihood coefficients. The final coefficients are then maximum a posteriori estimates using this prior (Tikhonov/ridge regularization). See below for details on the prior variance and the Methods section of the DESeq2 manuscript for more detail. The use of a prior has little effect on genes with high counts and helps to moderate the large spread in coefficients for genes with low counts.
The prior variance is calculated by matching the 0.05 upper quantile
of the observed MLE coefficients to a zero-centered Normal distribution.
In a change of methods since the 2014 paper,
the weighted upper quantile is calculated using the
wtd.quantile
function from the Hmisc package
(function has been copied into DESeq2 to avoid extra dependencies).
The weights are the inverse of the expected variance of log counts, so the inverse of
using the mean of
normalized counts and the trended dispersion fit. The weighting ensures
that noisy estimates of log fold changes from small count genes do not
overly influence the calculation of the prior variance.
See
estimateBetaPriorVar
.
The final prior variance for a factor level is the average of the
estimated prior variance over all contrasts of all levels of the factor.
When a log2 fold change prior is used (betaPrior=TRUE),
then nbinomWaldTest
will by default use expanded model matrices,
as described in the modelMatrixType
argument, unless this argument
is used to override the default behavior.
This ensures that log2 fold changes will be independent of the choice
of reference level. In this case, the beta prior variance for each factor
is calculated as the average of the mean squared maximum likelihood
estimates for each level and every possible contrast.
a DESeqDataSet with results columns accessible
with the results
function. The coefficients and standard errors are
reported on a log2 scale.
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds) res <- results(dds)
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) dds <- nbinomWaldTest(dds) res <- results(dds)
Gene-specific normalization factors for each sample can be provided as a matrix,
which will preempt sizeFactors
. In some experiments, counts for each
sample have varying dependence on covariates, e.g. on GC-content for sequencing
data run on different days, and in this case it makes sense to provide
gene-specific factors for each sample rather than a single size factor.
normalizationFactors(object, ...) normalizationFactors(object, ...) <- value ## S4 method for signature 'DESeqDataSet' normalizationFactors(object) ## S4 replacement method for signature 'DESeqDataSet,matrix' normalizationFactors(object) <- value
normalizationFactors(object, ...) normalizationFactors(object, ...) <- value ## S4 method for signature 'DESeqDataSet' normalizationFactors(object) ## S4 replacement method for signature 'DESeqDataSet,matrix' normalizationFactors(object) <- value
object |
a |
... |
additional arguments |
value |
the matrix of normalization factors |
Normalization factors alter the model of DESeq
in the following way, for
counts and normalization factors
for gene i and sample j:
Normalization factors are on the scale of the counts (similar to sizeFactors
)
and unlike offsets, which are typically on the scale of the predictors (in this case, log counts).
Normalization factors should include library size normalization. They should have
row-wise geometric mean near 1, as is the case with size factors, such that the mean of normalized
counts is close to the mean of unnormalized counts. See example code below.
dds <- makeExampleDESeqDataSet(n=100, m=4) normFactors <- matrix(runif(nrow(dds)*ncol(dds),0.5,1.5), ncol=ncol(dds),nrow=nrow(dds), dimnames=list(1:nrow(dds),1:ncol(dds))) # the normalization factors matrix should not have 0's in it # it should have geometric mean near 1 for each row normFactors <- normFactors / exp(rowMeans(log(normFactors))) normalizationFactors(dds) <- normFactors dds <- DESeq(dds)
dds <- makeExampleDESeqDataSet(n=100, m=4) normFactors <- matrix(runif(nrow(dds)*ncol(dds),0.5,1.5), ncol=ncol(dds),nrow=nrow(dds), dimnames=list(1:nrow(dds),1:ncol(dds))) # the normalization factors matrix should not have 0's in it # it should have geometric mean near 1 for each row normFactors <- normFactors / exp(rowMeans(log(normFactors))) normalizationFactors(dds) <- normFactors dds <- DESeq(dds)
Normalize for gene length using the output of transcript abundance estimators
normalizeGeneLength(...)
normalizeGeneLength(...)
... |
... |
This function is deprecated and moved to a new general purpose package, tximport, which will be added to Bioconductor.
A simple function for creating a DESeqTransform
object after applying: f(count(dds,normalized=TRUE) + pc)
.
normTransform(object, f = log2, pc = 1)
normTransform(object, f = log2, pc = 1)
object |
a DESeqDataSet object |
f |
a function to apply to normalized counts |
pc |
a pseudocount to add to normalized counts |
varianceStabilizingTransformation
, rlog
Normalized counts plus a pseudocount of 0.5 are shown by default.
plotCounts( dds, gene, intgroup = "condition", normalized = TRUE, transform = TRUE, main, xlab = "group", returnData = FALSE, replaced = FALSE, pc, ... )
plotCounts( dds, gene, intgroup = "condition", normalized = TRUE, transform = TRUE, main, xlab = "group", returnData = FALSE, replaced = FALSE, pc, ... )
dds |
a |
gene |
a character, specifying the name of the gene to plot |
intgroup |
interesting groups: a character vector of names in |
normalized |
whether the counts should be normalized by size factor (default is TRUE) |
transform |
whether to have log scale y-axis or not. defaults to TRUE |
main |
as in 'plot' |
xlab |
as in 'plot' |
returnData |
should the function only return the data.frame of counts and covariates for custom plotting (default is FALSE) |
replaced |
use the outlier-replaced counts if they exist |
pc |
pseudocount for log transform |
... |
arguments passed to plot |
dds <- makeExampleDESeqDataSet() plotCounts(dds, "gene1")
dds <- makeExampleDESeqDataSet() plotCounts(dds, "gene1")
A simple helper function that plots the per-gene dispersion estimates together with the fitted mean-dispersion relationship.
## S4 method for signature 'DESeqDataSet' plotDispEsts( object, ymin, CV = FALSE, genecol = "black", fitcol = "red", finalcol = "dodgerblue", legend = TRUE, xlab, ylab, log = "xy", cex = 0.45, ... )
## S4 method for signature 'DESeqDataSet' plotDispEsts( object, ymin, CV = FALSE, genecol = "black", fitcol = "red", finalcol = "dodgerblue", legend = TRUE, xlab, ylab, log = "xy", cex = 0.45, ... )
object |
a DESeqDataSet, with dispersions estimated |
ymin |
the lower bound for points on the plot, points beyond this are drawn as triangles at ymin |
CV |
logical, whether to plot the asymptotic or biological
coefficient of variation (the square root of dispersion) on the y-axis.
As the mean grows to infinity, the square root of dispersion gives
the coefficient of variation for the counts. Default is |
genecol |
the color for gene-wise dispersion estimates |
fitcol |
the color of the fitted estimates |
finalcol |
the color of the final estimates used for testing |
legend |
logical, whether to draw a legend |
xlab |
xlab |
ylab |
ylab |
log |
log |
cex |
cex |
... |
further arguments to |
Simon Anders
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) plotDispEsts(dds)
dds <- makeExampleDESeqDataSet() dds <- estimateSizeFactors(dds) dds <- estimateDispersions(dds) plotDispEsts(dds)
A simple helper function that makes a so-called "MA-plot", i.e. a scatter plot of log2 fold changes (on the y-axis) versus the mean of normalized counts (on the x-axis).
## S4 method for signature 'DESeqDataSet' plotMA( object, alpha = 0.1, main = "", xlab = "mean of normalized counts", ylim, colNonSig = "gray60", colSig = "blue", colLine = "grey40", returnData = FALSE, MLE = FALSE, ... ) ## S4 method for signature 'DESeqResults' plotMA( object, alpha, main = "", xlab = "mean of normalized counts", ylim, colNonSig = "gray60", colSig = "blue", colLine = "grey40", returnData = FALSE, MLE = FALSE, ... )
## S4 method for signature 'DESeqDataSet' plotMA( object, alpha = 0.1, main = "", xlab = "mean of normalized counts", ylim, colNonSig = "gray60", colSig = "blue", colLine = "grey40", returnData = FALSE, MLE = FALSE, ... ) ## S4 method for signature 'DESeqResults' plotMA( object, alpha, main = "", xlab = "mean of normalized counts", ylim, colNonSig = "gray60", colSig = "blue", colLine = "grey40", returnData = FALSE, MLE = FALSE, ... )
object |
a |
alpha |
the significance level for thresholding adjusted p-values |
main |
optional title for the plot |
xlab |
optional defaults to "mean of normalized counts" |
ylim |
optional y limits |
colNonSig |
color to use for non-significant data points |
colSig |
color to use for significant data points |
colLine |
color to use for the horizontal (y=0) line |
returnData |
logical, whether to return the data.frame used for plotting |
MLE |
if |
... |
further arguments passed to |
This function is essentially two lines of code: building a
data.frame
and passing this to the plotMA
method
for data.frame
, now copied from the geneplotter package.
The code was modified in version 1.28 to change from red to blue points
for better visibility for users with color-blindness. The original plots
can still be made via the use of returnData=TRUE
and passing the
resulting data.frame directly to geneplotter::plotMA
.
The code of this function can be seen with:
getMethod("plotMA","DESeqDataSet")
If the object
contains a column svalue
then these
will be used for coloring the points (with a default alpha=0.005
).
Michael Love
dds <- makeExampleDESeqDataSet() dds <- DESeq(dds) plotMA(dds) res <- results(dds) plotMA(res)
dds <- makeExampleDESeqDataSet() dds <- DESeq(dds) plotMA(dds) res <- results(dds) plotMA(res)
This plot helps to check for batch effects and the like.
## S4 method for signature 'DESeqTransform' plotPCA( object, intgroup = "condition", ntop = 500, returnData = FALSE, pcsToUse = 1:2 )
## S4 method for signature 'DESeqTransform' plotPCA( object, intgroup = "condition", ntop = 500, returnData = FALSE, pcsToUse = 1:2 )
object |
a |
intgroup |
interesting groups: a character vector of
names in |
ntop |
number of top genes to use for principal components, selected by highest row variance |
returnData |
should the function only return the data.frame of PC1 and PC2 with intgroup covariates for custom plotting (default is FALSE) |
pcsToUse |
numeric of length 2, which PCs to plot |
An object created by ggplot
, which can be assigned and further customized.
See the vignette for an example of variance stabilization and PCA plots.
Note that the source code of plotPCA
is very simple.
The source can be found by typing DESeq2:::plotPCA.DESeqTransform
or getMethod("plotPCA","DESeqTransform")
, or
browsed on github at https://github.com/mikelove/DESeq2/blob/master/R/plots.R
Users should find it easy to customize this function.
Wolfgang Huber
# using rlog transformed data: dds <- makeExampleDESeqDataSet(betaSD=1) vsd <- vst(dds, nsub=500) plotPCA(vsd) # also possible to perform custom transformation: dds <- estimateSizeFactors(dds) # shifted log of normalized counts se <- SummarizedExperiment(log2(counts(dds, normalized=TRUE) + 1), colData=colData(dds)) # the call to DESeqTransform() is needed to # trigger our plotPCA method. plotPCA( DESeqTransform( se ) )
# using rlog transformed data: dds <- makeExampleDESeqDataSet(betaSD=1) vsd <- vst(dds, nsub=500) plotPCA(vsd) # also possible to perform custom transformation: dds <- estimateSizeFactors(dds) # shifted log of normalized counts se <- SummarizedExperiment(log2(counts(dds, normalized=TRUE) + 1), colData=colData(dds)) # the call to DESeqTransform() is needed to # trigger our plotPCA method. plotPCA( DESeqTransform( se ) )
A simple plot of the concentration of counts in a single sample over the sum of counts per gene. Not technically the same as "sparsity", but this plot is useful diagnostic for datasets which might not fit a negative binomial assumption: genes with many zeros and individual very large counts are difficult to model with the negative binomial distribution.
plotSparsity(x, normalized = TRUE, ...)
plotSparsity(x, normalized = TRUE, ...)
x |
a matrix or DESeqDataSet |
normalized |
whether to normalize the counts from a DESeqDataSEt |
... |
passed to |
dds <- makeExampleDESeqDataSet(n=1000,m=4,dispMeanRel=function(x) .5) dds <- estimateSizeFactors(dds) plotSparsity(dds)
dds <- makeExampleDESeqDataSet(n=1000,m=4,dispMeanRel=function(x) .5) dds <- estimateSizeFactors(dds) plotSparsity(dds)
The priorInfo slot contains details about the prior on log fold changes
priorInfo(object, ...) priorInfo(object, ...) <- value ## S4 method for signature 'DESeqResults' priorInfo(object) ## S4 replacement method for signature 'DESeqResults,list' priorInfo(object) <- value
priorInfo(object, ...) priorInfo(object, ...) <- value ## S4 method for signature 'DESeqResults' priorInfo(object) ## S4 replacement method for signature 'DESeqResults,list' priorInfo(object) <- value
object |
a |
... |
additional arguments |
value |
a |
Note that this function is called within DESeq
, so is not
necessary to call on top of a DESeq
call. See the minReplicatesForReplace
argument documented in link{DESeq}
.
replaceOutliers( object, trim = 0.2, cooksCutoff, minReplicates = 7, whichSamples ) replaceOutliersWithTrimmedMean( object, trim = 0.2, cooksCutoff, minReplicates = 7, whichSamples )
replaceOutliers( object, trim = 0.2, cooksCutoff, minReplicates = 7, whichSamples ) replaceOutliersWithTrimmedMean( object, trim = 0.2, cooksCutoff, minReplicates = 7, whichSamples )
object |
a DESeqDataSet object, which has already been processed by
either DESeq, nbinomWaldTest or nbinomLRT, and therefore contains a matrix
contained in |
trim |
the fraction (0 to 0.5) of observations to be trimmed from each end of the normalized counts for a gene before the mean is computed |
cooksCutoff |
the threshold for defining an outlier to be replaced. Defaults to the .99 quantile of the F(p, m - p) distribution, where p is the number of parameters and m is the number of samples. |
minReplicates |
the minimum number of replicate samples necessary to consider a sample eligible for replacement (including itself). Outlier counts will not be replaced if the sample is in a cell which has less than minReplicates replicates. |
whichSamples |
optional, a numeric or logical index to specify which samples should have outliers replaced. if missing, this is determined using minReplicates. |
This function replaces outlier counts flagged by extreme Cook's distances,
as calculated by DESeq
, nbinomWaldTest
or nbinomLRT
, with values predicted by the trimmed mean
over all samples (and adjusted by size factor or normalization factor).
This function replaces the counts in the matrix returned by counts(dds)
and the Cook's distances in assays(dds)[["cooks"]]
. Original counts are
preserved in assays(dds)[["originalCounts"]]
.
The DESeq
function calculates a diagnostic measure called
Cook's distance for every gene and every sample. The results
function then sets the p-values to NA
for genes which contain
an outlying count as defined by a Cook's distance above a threshold.
With many degrees of freedom, i.e. many more samples than number of parameters to
be estimated– it might be undesirable to remove entire genes from the analysis
just because their data include a single count outlier.
An alternate strategy is to replace the outlier counts
with the trimmed mean over all samples, adjusted by the size factor or normalization
factor for that sample. The following simple function performs this replacement
for the user, for samples which have at least minReplicates
number
of replicates (including that sample).
For more information on Cook's distance, please see the two
sections of the vignette: 'Dealing with count outliers' and 'Count outlier detection'.
a DESeqDataSet with replaced counts in the slot returned by
counts
and the original counts preserved in
assays(dds)[["originalCounts"]]
results
extracts a result table from a DESeq analysis giving base means across samples,
log2 fold changes, standard errors, test statistics, p-values and adjusted p-values;
resultsNames
returns the names of the estimated effects (coefficents) of the model;
removeResults
returns a DESeqDataSet
object with results columns removed.
results( object, contrast, name, lfcThreshold = 0, altHypothesis = c("greaterAbs", "lessAbs", "greater", "less", "greaterAbs2014"), listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta, pAdjustMethod = "BH", filterFun, format = c("DataFrame", "GRanges", "GRangesList"), saveCols = NULL, test = c("Wald", "LRT"), addMLE = FALSE, tidy = FALSE, parallel = FALSE, BPPARAM = bpparam(), minmu = 0.5 ) resultsNames(object) removeResults(object)
results( object, contrast, name, lfcThreshold = 0, altHypothesis = c("greaterAbs", "lessAbs", "greater", "less", "greaterAbs2014"), listValues = c(1, -1), cooksCutoff, independentFiltering = TRUE, alpha = 0.1, filter, theta, pAdjustMethod = "BH", filterFun, format = c("DataFrame", "GRanges", "GRangesList"), saveCols = NULL, test = c("Wald", "LRT"), addMLE = FALSE, tidy = FALSE, parallel = FALSE, BPPARAM = bpparam(), minmu = 0.5 ) resultsNames(object) removeResults(object)
object |
a DESeqDataSet, on which one
of the following functions has already been called:
|
contrast |
this argument specifies what comparison to extract from
the
If specified, the |
name |
the name of the individual effect (coefficient) for
building a results table. Use this argument rather than |
lfcThreshold |
a non-negative value which specifies a log2 fold change
threshold. The default value is 0, corresponding to a test that
the log2 fold changes are equal to zero. The user can
specify the alternative hypothesis using the |
altHypothesis |
character which specifies the alternative hypothesis,
i.e. those values of log2 fold change which the user is interested in
finding. The complement of this set of values is the null hypothesis which
will be tested. If the log2 fold change specified by
|
listValues |
only used if a list is provided to |
cooksCutoff |
theshold on Cook's distance, such that if one or more
samples for a row have a distance higher, the p-value for the row is
set to NA. The default cutoff is the .99 quantile of the F(p, m-p) distribution,
where p is the number of coefficients being fitted and m is the number of samples.
Set to |
independentFiltering |
logical, whether independent filtering should be applied automatically |
alpha |
the significance cutoff used for optimizing the independent
filtering (by default 0.1). If the adjusted p-value cutoff (FDR) will be a
value other than 0.1, |
filter |
the vector of filter statistics over which the independent filtering will be optimized. By default the mean of normalized counts is used. |
theta |
the quantiles at which to assess the number of rejections from independent filtering |
pAdjustMethod |
the method to use for adjusting p-values, see |
filterFun |
an optional custom function for performing independent filtering
and p-value adjustment, with arguments |
format |
character, either |
saveCols |
character or numeric vector, the columns of
|
test |
this is automatically detected internally if not provided.
the one exception is after |
addMLE |
if |
tidy |
whether to output the results table with rownames as a first column 'row'.
the table will also be coerced to |
parallel |
if FALSE, no parallelization. if TRUE, parallel
execution using |
BPPARAM |
an optional parameter object passed internally
to |
minmu |
lower bound on the estimated count (used when calculating contrasts) |
The results table when printed will provide the information about
the comparison, e.g. "log2 fold change (MAP): condition treated vs untreated", meaning
that the estimates are of log2(treated / untreated), as would be returned by
contrast=c("condition","treated","untreated")
.
Multiple results can be returned for analyses beyond a simple two group comparison,
so results
takes arguments contrast
and name
to help
the user pick out the comparisons of interest for printing a results table.
The use of the contrast
argument is recommended for exact specification
of the levels which should be compared and their order.
If results
is run without specifying contrast
or name
,
it will return the comparison of the last level of the last variable in the
design formula over the first level of this variable. For example, for a simple two-group
comparison, this would return the log2 fold changes of the second group over the
first group (the reference level). Please see examples below and in the vignette.
The argument contrast
can be used to generate results tables for
any comparison of interest, for example, the log2 fold change between
two levels of a factor, and its usage is described below. It can also
accomodate more complicated numeric comparisons.
Note that contrast
will set to 0 the estimated LFC in a
comparison of two groups, where all of the counts in the two groups
are equal to 0 (while other groups have positive counts), while
name
will not automatically set these LFC to 0.
The test statistic used for a contrast is:
The argument name
can be used to generate results tables for
individual effects, which must be individual elements of resultsNames(object)
.
These individual effects could represent continuous covariates, effects
for individual levels, or individual interaction effects.
Information on the comparison which was used to build the results table,
and the statistical test which was used for p-values (Wald test or likelihood ratio test)
is stored within the object returned by results
. This information is in
the metadata columns of the results table, which is accessible by calling mcols
on the DESeqResults
object returned by results
.
On p-values:
By default, independent filtering is performed to select a set of genes
for multiple test correction which maximizes the number of adjusted
p-values less than a given critical value alpha
(by default 0.1).
See the reference in this man page for details on independent filtering.
The filter used for maximizing the number of rejections is the mean
of normalized counts for all samples in the dataset.
Several arguments from the filtered_p
function of
the genefilter package (used within the results
function)
are provided here to control the independent filtering behavior.
(Note filtered_p
R code is now copied into DESeq2
package to avoid gfortran requirements.)
In DESeq2 version >= 1.10, the threshold that is chosen is
the lowest quantile of the filter for which the
number of rejections is close to the peak of a curve fit
to the number of rejections over the filter quantiles.
'Close to' is defined as within 1 residual standard deviation.
The adjusted p-values for the genes which do not pass the filter threshold
are set to NA
.
By default, results
assigns a p-value of NA
to genes containing count outliers, as identified using Cook's distance.
See the cooksCutoff
argument for control of this behavior.
Cook's distances for each sample are accessible as a matrix "cooks"
stored in the assays()
list. This measure is useful for identifying rows where the
observed counts might not fit to a Negative Binomial distribution.
For analyses using the likelihood ratio test (using nbinomLRT
),
the p-values are determined solely by the difference in deviance between
the full and reduced model formula. A single log2 fold change is printed
in the results table for consistency with other results table outputs,
however the test statistic and p-values may nevertheless involve
the testing of one or more log2 fold changes.
Which log2 fold change is printed in the results table can be controlled
using the name
argument, or by default this will be the estimated
coefficient for the last element of resultsNames(object)
.
If useT=TRUE
was specified when running DESeq
or nbinomWaldTest
,
then the p-value generated by results
will also make use of the
t distribution for the Wald statistic, using the degrees of freedom
in mcols(object)$tDegreesFreedom
.
For results
: a DESeqResults
object, which is
a simple subclass of DataFrame. This object contains the results columns:
baseMean
, log2FoldChange
, lfcSE
, stat
,
pvalue
and padj
,
and also includes metadata columns of variable information.
The lfcSE
gives the standard error of the log2FoldChange
.
For the Wald test, stat
is the Wald statistic: the log2FoldChange
divided by lfcSE
, which is compared to a standard Normal distribution
to generate a two-tailed pvalue
. For the likelihood ratio test (LRT),
stat
is the difference in deviance between the reduced model and the full model,
which is compared to a chi-squared distribution to generate a pvalue
.
For resultsNames
: the names of the columns available as results,
usually a combination of the variable name and a level
For removeResults
: the original DESeqDataSet
with results metadata columns removed
Richard Bourgon, Robert Gentleman, Wolfgang Huber: Independent filtering increases detection power for high-throughput experiments. PNAS (2010), http://dx.doi.org/10.1073/pnas.0914005107
## Example 1: two-group comparison dds <- makeExampleDESeqDataSet(m=4) dds <- DESeq(dds) res <- results(dds, contrast=c("condition","B","A")) # with more than two groups, the call would look similar, e.g.: # results(dds, contrast=c("condition","C","A")) # etc. ## Example 2: two conditions, two genotypes, with an interaction term dds <- makeExampleDESeqDataSet(n=100,m=12) dds$genotype <- factor(rep(rep(c("I","II"),each=3),2)) design(dds) <- ~ genotype + condition + genotype:condition dds <- DESeq(dds) resultsNames(dds) # the condition effect for genotype I (the main effect) results(dds, contrast=c("condition","B","A")) # the condition effect for genotype II # this is, by definition, the main effect *plus* the interaction term # (the extra condition effect in genotype II compared to genotype I). results(dds, list( c("condition_B_vs_A","genotypeII.conditionB") )) # the interaction term, answering: is the condition effect *different* across genotypes? results(dds, name="genotypeII.conditionB") ## Example 3: two conditions, three genotypes # ~~~ Using interaction terms ~~~ dds <- makeExampleDESeqDataSet(n=100,m=18) dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2)) design(dds) <- ~ genotype + condition + genotype:condition dds <- DESeq(dds) resultsNames(dds) # the condition effect for genotype I (the main effect) results(dds, contrast=c("condition","B","A")) # the condition effect for genotype III. # this is the main effect *plus* the interaction term # (the extra condition effect in genotype III compared to genotype I). results(dds, contrast=list( c("condition_B_vs_A","genotypeIII.conditionB") )) # the interaction term for condition effect in genotype III vs genotype I. # this tests if the condition effect is different in III compared to I results(dds, name="genotypeIII.conditionB") # the interaction term for condition effect in genotype III vs genotype II. # this tests if the condition effect is different in III compared to II results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB")) # Note that a likelihood ratio could be used to test if there are any # differences in the condition effect between the three genotypes. # ~~~ Using a grouping variable ~~~ # This is a useful construction when users just want to compare # specific groups which are combinations of variables. dds$group <- factor(paste0(dds$genotype, dds$condition)) design(dds) <- ~ group dds <- DESeq(dds) resultsNames(dds) # the condition effect for genotypeIII results(dds, contrast=c("group", "IIIB", "IIIA"))
## Example 1: two-group comparison dds <- makeExampleDESeqDataSet(m=4) dds <- DESeq(dds) res <- results(dds, contrast=c("condition","B","A")) # with more than two groups, the call would look similar, e.g.: # results(dds, contrast=c("condition","C","A")) # etc. ## Example 2: two conditions, two genotypes, with an interaction term dds <- makeExampleDESeqDataSet(n=100,m=12) dds$genotype <- factor(rep(rep(c("I","II"),each=3),2)) design(dds) <- ~ genotype + condition + genotype:condition dds <- DESeq(dds) resultsNames(dds) # the condition effect for genotype I (the main effect) results(dds, contrast=c("condition","B","A")) # the condition effect for genotype II # this is, by definition, the main effect *plus* the interaction term # (the extra condition effect in genotype II compared to genotype I). results(dds, list( c("condition_B_vs_A","genotypeII.conditionB") )) # the interaction term, answering: is the condition effect *different* across genotypes? results(dds, name="genotypeII.conditionB") ## Example 3: two conditions, three genotypes # ~~~ Using interaction terms ~~~ dds <- makeExampleDESeqDataSet(n=100,m=18) dds$genotype <- factor(rep(rep(c("I","II","III"),each=3),2)) design(dds) <- ~ genotype + condition + genotype:condition dds <- DESeq(dds) resultsNames(dds) # the condition effect for genotype I (the main effect) results(dds, contrast=c("condition","B","A")) # the condition effect for genotype III. # this is the main effect *plus* the interaction term # (the extra condition effect in genotype III compared to genotype I). results(dds, contrast=list( c("condition_B_vs_A","genotypeIII.conditionB") )) # the interaction term for condition effect in genotype III vs genotype I. # this tests if the condition effect is different in III compared to I results(dds, name="genotypeIII.conditionB") # the interaction term for condition effect in genotype III vs genotype II. # this tests if the condition effect is different in III compared to II results(dds, contrast=list("genotypeIII.conditionB", "genotypeII.conditionB")) # Note that a likelihood ratio could be used to test if there are any # differences in the condition effect between the three genotypes. # ~~~ Using a grouping variable ~~~ # This is a useful construction when users just want to compare # specific groups which are combinations of variables. dds$group <- factor(paste0(dds$genotype, dds$condition)) design(dds) <- ~ group dds <- DESeq(dds) resultsNames(dds) # the condition effect for genotypeIII results(dds, contrast=c("group", "IIIB", "IIIA"))
This function transforms the count data to the log2 scale in a way
which minimizes differences between samples for rows with small counts,
and which normalizes with respect to library size.
The rlog transformation produces a similar variance stabilizing effect as
varianceStabilizingTransformation
,
though rlog
is more robust in the
case when the size factors vary widely.
The transformation is useful when checking for outliers
or as input for machine learning techniques
such as clustering or linear discriminant analysis.
rlog
takes as input a DESeqDataSet
and returns a
RangedSummarizedExperiment
object.
rlog(object, blind = TRUE, intercept, betaPriorVar, fitType = "parametric") rlogTransformation( object, blind = TRUE, intercept, betaPriorVar, fitType = "parametric" )
rlog(object, blind = TRUE, intercept, betaPriorVar, fitType = "parametric") rlogTransformation( object, blind = TRUE, intercept, betaPriorVar, fitType = "parametric" )
object |
a DESeqDataSet, or matrix of counts |
blind |
logical, whether to blind the transformation to the experimental design. blind=TRUE should be used for comparing samples in an manner unbiased by prior information on samples, for example to perform sample QA (quality assurance). blind=FALSE should be used for transforming data for downstream analysis, where the full use of the design information should be made. blind=FALSE will skip re-estimation of the dispersion trend, if this has already been calculated. If many of genes have large differences in counts due to the experimental design, it is important to set blind=FALSE for downstream analysis. |
intercept |
by default, this is not provided and calculated automatically.
if provided, this should be a vector as long as the number of rows of object,
which is log2 of the mean normalized counts from a previous dataset.
this will enforce the intercept for the GLM, allowing for a "frozen" rlog
transformation based on a previous dataset.
You will also need to provide |
betaPriorVar |
a single value, the variance of the prior on the sample betas, which if missing is estimated from the data |
fitType |
in case dispersions have not yet been estimated for |
Note that neither rlog transformation nor the VST are used by the
differential expression estimation in DESeq
, which always
occurs on the raw count data, through generalized linear modeling which
incorporates knowledge of the variance-mean dependence. The rlog transformation
and VST are offered as separate functionality which can be used for visualization,
clustering or other machine learning tasks. See the transformation section of the
vignette for more details, including a statement on timing. If rlog
is run on data with number of samples in [30-49] it will print a message
that it may take a few minutes, if the number of samples is 50 or larger, it
will print a message that it may take a "long time", and in both cases, it
will mention that the vst
is a much faster transformation.
The transformation does not require that one has already estimated size factors and dispersions.
The regularization is on the log fold changes of the count for each sample
over an intercept, for each gene. As nearby count values for low counts genes
are almost as likely as the observed count, the rlog shrinkage is greater for low counts.
For high counts, the rlog shrinkage has a much weaker effect.
The fitted dispersions are used rather than the MAP dispersions
(so similar to the varianceStabilizingTransformation
).
The prior variance for the shrinkag of log fold changes is calculated as follows:
a matrix is constructed of the logarithm of the counts plus a pseudocount of 0.5,
the log of the row means is then subtracted, leaving an estimate of
the log fold changes per sample over the fitted value using only an intercept.
The prior variance is then calculated by matching the upper quantiles of the observed
log fold change estimates with an upper quantile of the normal distribution.
A GLM fit is then calculated using this prior. It is also possible to supply the variance of the prior.
See the vignette for an example of the use and a comparison with varianceStabilizingTransformation
.
The transformed values, rlog(K), are equal to
,
with formula terms defined in
DESeq
.
The parameters of the rlog transformation from a previous dataset
can be frozen and reapplied to new samples. See the 'Data quality assessment'
section of the vignette for strategies to see if new samples are
sufficiently similar to previous datasets.
The frozen rlog is accomplished by saving the dispersion function,
beta prior variance and the intercept from a previous dataset,
and running rlog
with 'blind' set to FALSE
(see example below).
a DESeqTransform
if a DESeqDataSet
was provided,
or a matrix if a count matrix was provided as input.
Note that for DESeqTransform
output, the matrix of
transformed values is stored in assay(rld)
.
To avoid returning matrices with NA values, in the case of a row
of all zeros, the rlog transformation returns zeros
(essentially adding a pseudocount of 1 only to these rows).
Reference for regularized logarithm (rlog):
Michael I Love, Wolfgang Huber, Simon Anders: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 2014, 15:550. http://dx.doi.org/10.1186/s13059-014-0550-8
plotPCA
, varianceStabilizingTransformation
, normTransform
dds <- makeExampleDESeqDataSet(m=6,betaSD=1) rld <- rlog(dds) dists <- dist(t(assay(rld))) # plot(hclust(dists))
dds <- makeExampleDESeqDataSet(m=6,betaSD=1) rld <- rlog(dds) dists <- dist(t(assay(rld))) # plot(hclust(dists))
Prints out the information from the metadata columns of the results object regarding the log2 fold changes and p-values, then shows the DataFrame using the standard method.
## S4 method for signature 'DESeqResults' show(object)
## S4 method for signature 'DESeqResults' show(object)
object |
a DESeqResults object |
Michael Love
The sizeFactors vector assigns to each column of the count matrix a value, the
size factor, such that count values in the columns can be brought to a common
scale by dividing by the corresponding size factor (as performed by
counts(dds, normalized=TRUE)
).
See DESeq
for a description of the use of size factors.
If gene-specific normalization
is desired for each sample, use normalizationFactors
.
## S4 method for signature 'DESeqDataSet' sizeFactors(object) ## S4 replacement method for signature 'DESeqDataSet,numeric' sizeFactors(object) <- value
## S4 method for signature 'DESeqDataSet' sizeFactors(object) ## S4 replacement method for signature 'DESeqDataSet,numeric' sizeFactors(object) <- value
object |
a |
value |
a numeric vector, one size factor for each column in the count data. |
Simon Anders
Print a summary of the results from a DESeq analysis.
## S4 method for signature 'DESeqResults' summary(object, alpha, ...)
## S4 method for signature 'DESeqResults' summary(object, alpha, ...)
object |
a |
alpha |
the adjusted p-value cutoff. If not set, this
defaults to the |
... |
additional arguments |
Michael Love
dds <- makeExampleDESeqDataSet(m=4) dds <- DESeq(dds) res <- results(dds) summary(res)
dds <- makeExampleDESeqDataSet(m=4) dds <- DESeq(dds) res <- results(dds) summary(res)
Unmixes samples in x
according to pure
components,
using numerical optimization. The components in pure
are added on the scale of gene expression (either normalized counts, or TPMs).
The loss function when comparing fitted expression to the
samples in x
occurs in a variance stabilized space.
This task is sometimes referred to as "deconvolution",
and can be used, for example, to identify contributions from
various tissues.
Note: some groups have found that the mixing contributions
may be more accurate if very lowly expressed genes across x
and pure
are first removed. We have not explored this fully.
Note: if the pbapply
package is installed a progress bar
will be displayed while mixing components are fit.
unmix(x, pure, alpha, shift, power = 1, format = "matrix", quiet = FALSE)
unmix(x, pure, alpha, shift, power = 1, format = "matrix", quiet = FALSE)
x |
normalized counts or TPMs of the samples to be unmixed |
pure |
normalized counts or TPMs of the "pure" samples |
alpha |
for normalized counts, the dispersion of the data
when a negative binomial model is fit. this can be found by examining
the asymptotic value of |
shift |
for TPMs, the shift which approximately stabilizes the variance
of log shifted TPMs. Can be assessed with |
power |
either 1 (for L1) or 2 (for squared) loss function. Default is 1. |
format |
|
quiet |
suppress progress bar. default is FALSE, show progress bar if pbapply is installed. |
a matrix, the mixture components for each sample in x
(rows).
The "pure" samples make up the columns, and so each row sums to 1.
If colnames existed on the input matrices they will be propagated to the output matrix.
If format="list"
, then a list, containing as elements:
(1) the matrix of mixture components,
(2) the correlations in the variance stabilized space of the fitted samples
to the samples in x
, and
(3) the fitted samples as a matrix with the same dimension as x
.
# some artificial data cts <- matrix(c(80,50,1,100, 1,1,60,100, 0,50,60,100), ncol=4, byrow=TRUE) # make a DESeqDataSet dds <- DESeqDataSetFromMatrix(cts, data.frame(row.names=seq_len(ncol(cts))), ~1) colnames(dds) <- paste0("sample",1:4) # note! here you would instead use # estimateSizeFactors() to do actual normalization sizeFactors(dds) <- rep(1, ncol(dds)) norm.cts <- counts(dds, normalized=TRUE) # 'pure' should also have normalized counts... pure <- matrix(c(10,0,0, 0,0,10, 0,10,0), ncol=3, byrow=TRUE) colnames(pure) <- letters[1:3] # for real data, you need to find alpha after fitting estimateDispersions() mix <- unmix(norm.cts, pure, alpha=0.01)
# some artificial data cts <- matrix(c(80,50,1,100, 1,1,60,100, 0,50,60,100), ncol=4, byrow=TRUE) # make a DESeqDataSet dds <- DESeqDataSetFromMatrix(cts, data.frame(row.names=seq_len(ncol(cts))), ~1) colnames(dds) <- paste0("sample",1:4) # note! here you would instead use # estimateSizeFactors() to do actual normalization sizeFactors(dds) <- rep(1, ncol(dds)) norm.cts <- counts(dds, normalized=TRUE) # 'pure' should also have normalized counts... pure <- matrix(c(10,0,0, 0,0,10, 0,10,0), ncol=3, byrow=TRUE) colnames(pure) <- letters[1:3] # for real data, you need to find alpha after fitting estimateDispersions() mix <- unmix(norm.cts, pure, alpha=0.01)
This function calculates a variance stabilizing transformation (VST) from the
fitted dispersion-mean relation(s) and then transforms the count data (normalized
by division by the size factors or normalization factors), yielding a matrix
of values which are now approximately homoskedastic (having constant variance along the range
of mean values). The transformation also normalizes with respect to library size.
The rlog
is less sensitive
to size factors, which can be an issue when size factors vary widely.
These transformations are useful when checking for outliers or as input for
machine learning techniques such as clustering or linear discriminant analysis.
varianceStabilizingTransformation(object, blind = TRUE, fitType = "parametric") getVarianceStabilizedData(object)
varianceStabilizingTransformation(object, blind = TRUE, fitType = "parametric") getVarianceStabilizedData(object)
object |
a DESeqDataSet or matrix of counts |
blind |
logical, whether to blind the transformation to the experimental design. blind=TRUE should be used for comparing samples in a manner unbiased by prior information on samples, for example to perform sample QA (quality assurance). blind=FALSE should be used for transforming data for downstream analysis, where the full use of the design information should be made. blind=FALSE will skip re-estimation of the dispersion trend, if this has already been calculated. If many of genes have large differences in counts due to the experimental design, it is important to set blind=FALSE for downstream analysis. |
fitType |
in case dispersions have not yet been estimated for |
For each sample (i.e., column of counts(dds)
), the full variance function
is calculated from the raw variance (by scaling according to the size factor and adding
the shot noise). We recommend a blind estimation of the variance function, i.e.,
one ignoring conditions. This is performed by default, and can be modified using the
'blind' argument.
Note that neither rlog transformation nor the VST are used by the
differential expression estimation in DESeq
, which always
occurs on the raw count data, through generalized linear modeling which
incorporates knowledge of the variance-mean dependence. The rlog transformation
and VST are offered as separate functionality which can be used for visualization,
clustering or other machine learning tasks. See the transformation section of the
vignette for more details.
The transformation does not require that one has already estimated size factors and dispersions.
A typical workflow is shown in Section Variance stabilizing transformation in the package vignette.
If estimateDispersions
was called with:
fitType="parametric"
,
a closed-form expression for the variance stabilizing
transformation is used on the normalized
count data. The expression can be found in the file ‘vst.pdf’
which is distributed with the vignette.
fitType="local"
,
the reciprocal of the square root of the variance of the normalized counts, as derived
from the dispersion fit, is then numerically
integrated, and the integral (approximated by a spline function) is evaluated for each
count value in the column, yielding a transformed value.
fitType="mean"
, a VST is applied for Negative Binomial distributed counts, 'k',
with a fixed dispersion, 'a': ( 2 asinh(sqrt(a k)) - log(a) - log(4) )/log(2).
In all cases, the transformation is scaled such that for large counts, it becomes asymptotically (for large values) equal to the logarithm to base 2 of normalized counts.
The variance stabilizing transformation from a previous dataset
can be "frozen" and reapplied to new samples.
The frozen VST is accomplished by saving the dispersion function
accessible with dispersionFunction
, assigning this
to the DESeqDataSet
with the new samples, and running
varianceStabilizingTransformation with 'blind' set to FALSE.
Then the dispersion function from the previous dataset will be used
to transform the new sample(s).
Limitations: In order to preserve normalization, the same
transformation has to be used for all samples. This results in the
variance stabilizition to be only approximate. The more the size
factors differ, the more residual dependence of the variance on the
mean will be found in the transformed data. rlog
is a
transformation which can perform better in these cases.
As shown in the vignette, the function meanSdPlot
from the package vsn can be used to see whether this is a problem.
varianceStabilizingTransformation
returns a
DESeqTransform
if a DESeqDataSet
was provided,
or returns a a matrix if a count matrix was provided.
Note that for DESeqTransform
output, the matrix of
transformed values is stored in assay(vsd)
.
getVarianceStabilizedData
also returns a matrix.
Simon Anders
Reference for the variance stabilizing transformation for counts with a dispersion trend:
Simon Anders, Wolfgang Huber: Differential expression analysis for sequence count data. Genome Biology 2010, 11:106. http://dx.doi.org/10.1186/gb-2010-11-10-r106
dds <- makeExampleDESeqDataSet(m=6) vsd <- varianceStabilizingTransformation(dds) dists <- dist(t(assay(vsd))) # plot(hclust(dists))
dds <- makeExampleDESeqDataSet(m=6) vsd <- varianceStabilizingTransformation(dds) dists <- dist(t(assay(vsd))) # plot(hclust(dists))
This is a wrapper for the varianceStabilizingTransformation
(VST)
that provides much faster estimation of the dispersion trend used to determine
the formula for the VST. The speed-up is accomplished by
subsetting to a smaller number of genes in order to estimate this dispersion trend.
The subset of genes is chosen deterministically, to span the range
of genes' mean normalized count.
vst(object, blind = TRUE, nsub = 1000, fitType = "parametric")
vst(object, blind = TRUE, nsub = 1000, fitType = "parametric")
object |
a DESeqDataSet or a matrix of counts |
blind |
logical, whether to blind the transformation to the experimental
design (see |
nsub |
the number of genes to subset to (default 1000) |
fitType |
for estimation of dispersions: this parameter
is passed on to |
a DESeqTranform object or a matrix of transformed, normalized counts
dds <- makeExampleDESeqDataSet(n=2000, m=20) vsd <- vst(dds)
dds <- makeExampleDESeqDataSet(n=2000, m=20) vsd <- vst(dds)