Title: | Fit a Gamma-Poisson Generalized Linear Model |
---|---|
Description: | Fit linear models to overdispersed count data. The package can estimate the overdispersion and fit repeated models for matrix input. It is designed to handle large input datasets as they typically occur in single cell RNA-seq experiments. |
Authors: | Constantin Ahlmann-Eltze [aut, cre] , Nathan Lubock [ctb] , Michael Love [ctb] |
Maintainer: | Constantin Ahlmann-Eltze <[email protected]> |
License: | GPL-3 |
Version: | 1.19.0 |
Built: | 2024-10-30 08:13:14 UTC |
Source: | https://github.com/bioc/glmGamPoi |
Convert glmGamPoi object to a list
## S3 method for class 'glmGamPoi' as.list(x, ...)
## S3 method for class 'glmGamPoi' as.list(x, ...)
x |
an object with class glmGamPoi |
... |
not used |
The method returns a list with the following elements:
Beta
a matrix with dimensions nrow(data) x n_coefficients
where n_coefficients
is
based on the design
argument. It contains the estimated coefficients for each gene.
overdispersions
a vector with length nrow(data)
. The overdispersion parameter for each
gene. It describes how much more the counts vary than one would expect according to the Poisson
model.
Mu
a matrix with the same dimensions as dim(data)
. If the calculation happened on
disk, than Mu
is a HDF5Matrix
. It contains the estimated mean value for each gene and
sample.
size_factors
a vector with length ncol(data)
. The size factors are the inferred
correction factors for different sizes of each sample. They are also sometimes called the
exposure factor.
model_matrix
a matrix with dimensions ncol(data) x n_coefficients
. It is build based
on the design
argument.
This function provides a simple to use interface to fit Gamma-Poisson generalized linear models. It works equally well for small scale (a single model) and large scale data (e.g. thousands of rows and columns, potentially stored on disk). The function automatically determines the appropriate size factors for each sample and efficiently finds the best overdispersion parameter for each gene.
glm_gp( data, design = ~1, col_data = NULL, reference_level = NULL, offset = 0, size_factors = c("normed_sum", "deconvolution", "poscounts", "ratio"), overdispersion = TRUE, overdispersion_shrinkage = TRUE, ridge_penalty = 0, do_cox_reid_adjustment = TRUE, subsample = FALSE, on_disk = NULL, use_assay = NULL, verbose = FALSE )
glm_gp( data, design = ~1, col_data = NULL, reference_level = NULL, offset = 0, size_factors = c("normed_sum", "deconvolution", "poscounts", "ratio"), overdispersion = TRUE, overdispersion_shrinkage = TRUE, ridge_penalty = 0, do_cox_reid_adjustment = TRUE, subsample = FALSE, on_disk = NULL, use_assay = NULL, verbose = FALSE )
data |
any matrix-like object (e.g. matrix, DelayedArray, HDF5Matrix) or
anything that can be cast to a SummarizedExperiment (e.g. |
design |
a specification of the experimental design used to fit the Gamma-Poisson GLM.
It can be a |
col_data |
a dataframe with one row for each sample in |
reference_level |
a single string that specifies which level is used as reference
when the model matrix is created. The reference level becomes the intercept and all
other coefficients are calculated with respect to the |
offset |
Constant offset in the model in addition to |
size_factors |
in large scale experiments, each sample is typically of different size
(for example different sequencing depths). A size factor is an internal mechanism of GLMs to
correct for this effect. |
overdispersion |
the simplest count model is the Poisson model. However, the Poisson model
assumes that
Note that |
overdispersion_shrinkage |
the overdispersion can be difficult to estimate with few replicates. To
improve the overdispersion estimates, we can share information across genes and shrink each individual
overdispersion estimate towards a global overdispersion estimate. Empirical studies show however that
the overdispersion varies based on the mean expression level (lower expression level => higher
dispersion). If |
ridge_penalty |
to avoid overfitting, we can penalize fits with large coefficient estimates. Instead
of directly minimizing the deviance per gene (
Default: |
do_cox_reid_adjustment |
the classical maximum likelihood estimator of the |
subsample |
the estimation of the overdispersion is the slowest step when fitting
a Gamma-Poisson GLM. For datasets with many samples, the estimation can be considerably sped up
without loosing much precision by fitting the overdispersion only on a random subset of the samples.
Default: |
on_disk |
a boolean that indicates if the dataset is loaded into memory or if it is kept on disk
to reduce the memory usage. Processing in memory can be significantly faster than on disk.
Default: |
use_assay |
Specify which assay to use. Default: |
verbose |
a boolean that indicates if information about the individual steps are printed
while fitting the GLM. Default: |
The method follows the following steps:
The size factors are estimated.
If size_factors = "normed_sum"
, the column-sum for each cell is calculated and the resulting
size factors are normalized so that their geometric mean is 1
. If size_factors = "poscounts"
,
a slightly adapted version of the procedure proposed by Anders and Huber (2010) in
equation (5) is used. To handle the large number of zeros the geometric means are calculated for
and ignored during the calculation of the median. Columns with all zeros get a
default size factor of
. If
size_factors = "deconvolution"
, the method
scran::calculateSumFactors()
is called. If size_factors = "ratio"
, the unmodified procedure
from Anders and Huber (2010) in equation (5) is used.
The dispersion estimates are initialized based on the moments of each row of .
The coefficients of the model are estimated.
If all samples belong to the same condition (i.e. design = ~ 1
), the betas are estimated using
a quick Newton-Raphson algorithm. This is similar to the behavior of edgeR
. For more complex
designs, the general Fisher-scoring algorithm is used. Here, the code is based on a fork of the
internal function fitBeta()
from DESeq2
. It does however contain some modification to make
the fit more robust and faster.
The mean for each gene and sample is calculate.
Note that this step can be very IO intensive if data
is or contains a DelayedArray.
The overdispersion is estimated.
The classical method for estimating the overdispersion for each gene is to maximize the
Gamma-Poisson log-likelihood by iterating over each count and summing the the corresponding
log-likelihood. It is however, much more efficient
for genes with many small counts to work on the contingency table of the counts. Originally, this
approach had already been used by Anscombe (1950). In this package, I have implemented an
extension of their method that can handle general offsets.
See also overdispersion_mle()
.
The beta coefficients are estimated once more with the updated overdispersion estimates
The mean for each gene and sample is calculated again.
This method can handle not just in memory data, but also data stored on disk. This is essential for
large scale datasets with thousands of samples, as they sometimes encountered in modern single-cell
RNA-seq analysis. glmGamPoi
relies on the DelayedArray
and beachmat
package to efficiently
implement the access to the on-disk data.
The method returns a list with the following elements:
Beta
a matrix with dimensions nrow(data) x n_coefficients
where n_coefficients
is
based on the design
argument. It contains the estimated coefficients for each gene.
overdispersions
a vector with length nrow(data)
. The overdispersion parameter for each
gene. It describes how much more the counts vary than one would expect according to the Poisson
model.
overdispersion_shrinkage_list
a list with additional information from the quasi-likelihood
shrinkage. For details see overdispersion_shrinkage()
.
deviances
a vector with the deviance of the fit for each row. The deviance is a measure how well the data is fit by the model. A smaller deviance means a better fit.
Mu
a matrix with the same dimensions as dim(data)
. If the calculation happened on
disk, than Mu
is a HDF5Matrix
. It contains the estimated mean value for each gene and
sample.
size_factors
a vector with length ncol(data)
. The size factors are the inferred
correction factors for different sizes of each sample. They are also sometimes called the
exposure factor.
Offset
a matrix with the same dimensions as dim(data)
. If the calculation happened on
disk, than Offset
is a HDF5Matrix
. It contains the log(size_factors) + offset
from the
function call.
data
a SummarizedExperiment
that contains the input counts and the col_data
model_matrix
a matrix with dimensions ncol(data) x n_coefficients
. It is build based
on the design
argument.
design_formula
the formula that used to fit the model, or NULL
otherwise
ridge_penalty
a vector with the specification of the ridge penalty.
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297. https://doi.org/10.1093/nar/gks042.
Anders Simon, & Huber Wolfgang. (2010). Differential expression analysis for sequence count data. Genome Biology. https://doi.org/10.1016/j.jcf.2018.05.006.
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8.
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2009). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616.
Lun ATL, Pagès H, Smith ML (2018). “beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.” PLoS Comput. Biol., 14(5), e1006135. doi: 10.1371/journal.pcbi.1006135..
Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5). https://doi.org/10.1515/1544-6115.1826.
Lun ATL, Bach K and Marioni JC (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17:75 https://doi.org/10.1186/s13059-016-0947-7
overdispersion_mle()
and overdispersion_shrinkage()
for the internal functions that do the
work. For differential expression analysis, see test_de()
.
set.seed(1) # The simplest example y <- rnbinom(n = 10, mu = 3, size = 1/2.4) c(glm_gp(y, size_factors = FALSE)) # Fitting a whole matrix model_matrix <- cbind(1, rnorm(5)) true_Beta <- cbind(rnorm(n = 30), rnorm(n = 30, mean = 3)) sf <- exp(rnorm(n = 5, mean = 0.7)) model_matrix Y <- matrix(rnbinom(n = 30 * 5, mu = sf * exp(true_Beta %*% t(model_matrix)), size = 1/2.4), nrow = 30, ncol = 5) fit <- glm_gp(Y, design = model_matrix, size_factors = sf, verbose = TRUE) summary(fit) # Fitting a model with covariates data <- data.frame(fav_food = sample(c("apple", "banana", "cherry"), size = 50, replace = TRUE), city = sample(c("heidelberg", "paris", "new york"), size = 50, replace = TRUE), age = rnorm(n = 50, mean = 40, sd = 15)) Y <- matrix(rnbinom(n = 100 * 50, mu = 3, size = 1/3.1), nrow = 100, ncol = 50) fit <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data) summary(fit) # Specify 'ridge_penalty' to penalize extreme Beta coefficients fit_reg <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data, ridge_penalty = 1.5) summary(fit_reg)
set.seed(1) # The simplest example y <- rnbinom(n = 10, mu = 3, size = 1/2.4) c(glm_gp(y, size_factors = FALSE)) # Fitting a whole matrix model_matrix <- cbind(1, rnorm(5)) true_Beta <- cbind(rnorm(n = 30), rnorm(n = 30, mean = 3)) sf <- exp(rnorm(n = 5, mean = 0.7)) model_matrix Y <- matrix(rnbinom(n = 30 * 5, mu = sf * exp(true_Beta %*% t(model_matrix)), size = 1/2.4), nrow = 30, ncol = 5) fit <- glm_gp(Y, design = model_matrix, size_factors = sf, verbose = TRUE) summary(fit) # Fitting a model with covariates data <- data.frame(fav_food = sample(c("apple", "banana", "cherry"), size = 50, replace = TRUE), city = sample(c("heidelberg", "paris", "new york"), size = 50, replace = TRUE), age = rnorm(n = 50, mean = 40, sd = 15)) Y <- matrix(rnbinom(n = 100 * 50, mu = 3, size = 1/3.1), nrow = 100, ncol = 50) fit <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data) summary(fit) # Specify 'ridge_penalty' to penalize extreme Beta coefficients fit_reg <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data, ridge_penalty = 1.5) summary(fit_reg)
This function fits y based on x through a (weighted) median using
the npoints/2
neighborhood.
loc_median_fit( x, y, fraction = 0.1, npoints = max(1, round(length(x) * fraction)), weighted = TRUE, ignore_zeros = FALSE, sample_fraction = 1 )
loc_median_fit( x, y, fraction = 0.1, npoints = max(1, round(length(x) * fraction)), weighted = TRUE, ignore_zeros = FALSE, sample_fraction = 1 )
x , y
|
the x and y coordinates of the points. |
fraction , npoints
|
the fraction / number of the points that are
considered for each fit. |
weighted |
a boolean that indicates if a weighted median is calculated. |
ignore_zeros |
should the zeros be excluded from the fit |
sample_fraction |
use a fraction of the data to estimate the local median. Useful for extremely large datasets where the trend is well-sampled |
This function is low-level implementation detail and should usually not be called by the user.
locfit
: a package dedicated to local regression.
x <- runif(n = 1000, max = 4) y <- rpois(n = 1000, lambda = x * 10) plot(x, y) fit <- loc_median_fit(x, y, fraction = 0.1) fit2 <- loc_median_fit(x, y, fraction = 0.1, sample_fraction = 0.75) points(x, fit, col = "red") points(x, fit2, col = "blue")
x <- runif(n = 1000, max = 4) y <- rpois(n = 1000, lambda = x * 10) plot(x, y) fit <- loc_median_fit(x, y, fraction = 0.1) fit2 <- loc_median_fit(x, y, fraction = 0.1, sample_fraction = 0.75) points(x, fit, col = "red") points(x, fit2, col = "blue")
Estimate the Overdispersion for a Vector of Counts
overdispersion_mle( y, mean, model_matrix = NULL, do_cox_reid_adjustment = !is.null(model_matrix), global_estimate = FALSE, subsample = FALSE, max_iter = 200, verbose = FALSE )
overdispersion_mle( y, mean, model_matrix = NULL, do_cox_reid_adjustment = !is.null(model_matrix), global_estimate = FALSE, subsample = FALSE, max_iter = 200, verbose = FALSE )
y |
a numeric or integer vector or matrix with the counts for which the overdispersion is estimated |
mean |
a numeric vector of either length 1 or |
model_matrix |
a numeric matrix that specifies the experimental
design. It can be produced using |
do_cox_reid_adjustment |
the classical maximum likelihood estimator of the |
global_estimate |
flag to decide if a single overdispersion for a whole matrix is calculated
instead of one estimate per row. This parameter has no affect if |
subsample |
the estimation of the overdispersion is the slowest step when fitting
a Gamma-Poisson GLM. For datasets with many samples, the estimation can be considerably sped up
without loosing much precision by fitting the overdispersion only on a random subset of the samples.
Default: |
max_iter |
the maximum number of iterations for each gene |
verbose |
a boolean that indicates if information about the individual steps are printed
while fitting the GLM. Default: |
The function is optimized to be fast on many small counts. To
achieve this, the frequency table of the counts is calculated and
used to avoid repetitive calculations. If
there are probably many unique counts the optimization is skipped.
Currently the heuristic is to skip if more than half of the counts
are expected to be unique. The estimation is based on the largest observed
count in y
.
An earlier version of this package (< 1.1.1) used a separate set of functions for the case of many small counts based on a paper by Bandara et al. (2019). However, this didn't bring a sufficient performance increase and meant an additional maintenance burden.
The function returns a list with the following elements:
estimate
the numerical estimate of the overdispersion.
iterations
the number of iterations it took to calculate the result.
message
additional information about the fitting process.
set.seed(1) # true overdispersion = 2.4 y <- rnbinom(n = 10, mu = 3, size = 1/2.4) # estimate = 1.7 overdispersion_mle(y) # true overdispersion = 0 y <- rpois(n = 10, lambda = 3) # estimate = 0 overdispersion_mle(y) # with different mu, overdispersion estimate changes overdispersion_mle(y, mean = 15) # Cox-Reid adjustment changes the result overdispersion_mle(y, mean = 15, do_cox_reid_adjustment = FALSE) # Many very small counts, true overdispersion = 50 y <- rnbinom(n = 1000, mu = 0.01, size = 1/50) summary(y) # estimate = 31 overdispersion_mle(y, do_cox_reid_adjustment = TRUE) # Function can also handle matrix input Y <- matrix(rnbinom(n = 10 * 3, mu = 4, size = 1/2.2), nrow = 10, ncol = 3) Y as.data.frame(overdispersion_mle(Y))
set.seed(1) # true overdispersion = 2.4 y <- rnbinom(n = 10, mu = 3, size = 1/2.4) # estimate = 1.7 overdispersion_mle(y) # true overdispersion = 0 y <- rpois(n = 10, lambda = 3) # estimate = 0 overdispersion_mle(y) # with different mu, overdispersion estimate changes overdispersion_mle(y, mean = 15) # Cox-Reid adjustment changes the result overdispersion_mle(y, mean = 15, do_cox_reid_adjustment = FALSE) # Many very small counts, true overdispersion = 50 y <- rnbinom(n = 1000, mu = 0.01, size = 1/50) summary(y) # estimate = 31 overdispersion_mle(y, do_cox_reid_adjustment = TRUE) # Function can also handle matrix input Y <- matrix(rnbinom(n = 10 * 3, mu = 4, size = 1/2.2), nrow = 10, ncol = 3) Y as.data.frame(overdispersion_mle(Y))
Low-level function to shrink a set of overdispersion estimates following the quasi-likelihood and Empirical Bayesian framework.
overdispersion_shrinkage( disp_est, gene_means, df, disp_trend = TRUE, ql_disp_trend = NULL, ..., verbose = FALSE )
overdispersion_shrinkage( disp_est, gene_means, df, disp_trend = TRUE, ql_disp_trend = NULL, ..., verbose = FALSE )
disp_est |
vector of overdispersion estimates |
gene_means |
vector of average gene expression values. Used
to fit |
df |
degrees of freedom for estimating the Empirical Bayesian
variance prior. Can be length 1 or same length as |
disp_trend |
vector with the dispersion trend. If |
ql_disp_trend |
a logical to indicate if a second abundance trend
using splines is fitted for the quasi-likelihood dispersions.
Default: |
... |
additional parameters for the |
verbose |
a boolean that indicates if information about the individual steps are printed
while fitting the GLM. Default: |
The function goes through the following steps
Fit trend between overdispersion MLE's and the average
gene expression. Per default it uses the loc_median_fit()
function.
Convert the overdispersion MLE's to quasi-likelihood
dispersion estimates by fixing the trended dispersion as
the "true" dispersion value:
Shrink the quasi-likelihood dispersion estimates using Empirical Bayesian variance shrinkage (see Smyth 2004).
the function returns a list with the following elements
the dispersion trend provided by disp_trend
or the
local median fit.
the quasi-likelihood dispersion estimates based on
the dispersion trend, disp_est
, and gene_means
the ql_disp_estimate
still might show a trend with
respect to gene_means
. If ql_disp_trend = TRUE
a spline is used to
remove this secondary trend. If ql_disp_trend = FALSE
it corresponds
directly to the dispersion prior
the shrunken quasi-likelihood dispersion estimates.
They are shrunken towards ql_disp_trend
.
the degrees of freedom of the empirical Bayesian shrinkage.
They correspond to spread of the ql_disp_estimate
's
Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5). https://doi.org/10.1515/1544-6115.1826.
Smyth, G. K. (2004). Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1). https://doi.org/10.2202/1544-6115.1027
limma::squeezeVar()
Y <- matrix(rnbinom(n = 300 * 4, mu = 6, size = 1/4.2), nrow = 300, ncol = 4) disps <- sapply(seq_len(nrow(Y)), function(idx){ overdispersion_mle(Y[idx, ])$estimate }) shrink_list <- overdispersion_shrinkage(disps, rowMeans(Y), df = ncol(Y) - 1, disp_trend = FALSE, ql_disp_trend = FALSE) plot(rowMeans(Y), shrink_list$ql_disp_estimate) lines(sort(rowMeans(Y)), shrink_list$ql_disp_trend[order(rowMeans(Y))], col = "red") points(rowMeans(Y), shrink_list$ql_disp_shrunken, col = "blue", pch = 16, cex = 0.5)
Y <- matrix(rnbinom(n = 300 * 4, mu = 6, size = 1/4.2), nrow = 300, ncol = 4) disps <- sapply(seq_len(nrow(Y)), function(idx){ overdispersion_mle(Y[idx, ])$estimate }) shrink_list <- overdispersion_shrinkage(disps, rowMeans(Y), df = ncol(Y) - 1, disp_trend = FALSE, ql_disp_trend = FALSE) plot(rowMeans(Y), shrink_list$ql_disp_estimate) lines(sort(rowMeans(Y)), shrink_list$ql_disp_trend[order(rowMeans(Y))], col = "red") points(rowMeans(Y), shrink_list$ql_disp_shrunken, col = "blue", pch = 16, cex = 0.5)
Predict mu
(i.e., type = "response"
) or log(mu)
(i.e., type = "link"
)
from a 'glmGamPoi' fit (created by glm_gp(...)
) with the corresponding
estimate of the standard error. If newdata
is NULL
, mu
is returned
for the original input data.
## S3 method for class 'glmGamPoi' predict( object, newdata = NULL, type = c("link", "response"), se.fit = FALSE, offset = mean(object$Offset), on_disk = NULL, verbose = FALSE, ... )
## S3 method for class 'glmGamPoi' predict( object, newdata = NULL, type = c("link", "response"), se.fit = FALSE, offset = mean(object$Offset), on_disk = NULL, verbose = FALSE, ... )
object |
a |
newdata |
a specification of the new data for which the
expression for each gene is predicted.
|
type |
either 'link' or 'response'. The default is 'link', which returns the predicted values
before the link function ( |
se.fit |
boolean that indicates if in addition to the mean the standard error of the mean is returned. |
offset |
count models (in particular for sequencing experiments) usually have a sample
specific size factor ( |
on_disk |
a boolean that indicates if the results are |
verbose |
a boolean that indicates if information about the individual steps are
printed while predicting. Default: |
... |
currently ignored. |
For se.fit = TRUE
, the function sticks very close to the behavior of stats::predict.glm()
for
fits from MASS::glm.nb()
.
Note: If type = "link"
, the results are computed using the natural logarithm as the
link function. This differs from the lfc
estimate provided by test_de
, which are on the
log2 scale.
If se.fit == FALSE
, a matrix with dimensions nrow(object$data) x nrow(newdata)
.
If se.fit == TRUE
, a list with three entries
the predicted values as a matrix with dimensions nrow(object$data) x nrow(newdata)
.
This is what would be returned if se.fit == FALSE
.
the associated standard errors for each fit
. Also a matrix with
dimensions nrow(object$data) x nrow(newdata)
.
Currently fixed to 1. In the future, this might become the values from
object$overdispersion_shrinkage_list$ql_disp_shrunken
.
stats::predict.lm()
and stats::predict.glm()
set.seed(1) # The simplest example y <- rnbinom(n = 10, mu = 3, size = 1/2.4) fit <- glm_gp(y, size_factors = FALSE) predict(fit, type = "response") predict(fit, type = "link", se.fit = TRUE) # Fitting a whole matrix model_matrix <- cbind(1, rnorm(5)) true_Beta <- cbind(rnorm(n = 30), rnorm(n = 30, mean = 3)) sf <- exp(rnorm(n = 5, mean = 0.7)) model_matrix Y <- matrix(rnbinom(n = 30 * 5, mu = sf * exp(true_Beta %*% t(model_matrix)), size = 1/2.4), nrow = 30, ncol = 5) fit <- glm_gp(Y, design = model_matrix, size_factors = sf, verbose = TRUE) head(predict(fit, type = "response")) pred <- predict(fit, type = "link", se.fit = TRUE, verbose = TRUE) head(pred$fit) head(pred$se.fit) # Fitting a model with covariates data <- data.frame(fav_food = sample(c("apple", "banana", "cherry"), size = 50, replace = TRUE), city = sample(c("heidelberg", "paris", "new york"), size = 50, replace = TRUE), age = rnorm(n = 50, mean = 40, sd = 15)) Y <- matrix(rnbinom(n = 4 * 50, mu = 3, size = 1/3.1), nrow = 4, ncol = 50) fit <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data) predict(fit)[, 1:3] nd <- data.frame(fav_food = "banana", city = "paris", age = 29) predict(fit, newdata = nd) nd <- data.frame(fav_food = "banana", city = "paris", age = 29:40) predict(fit, newdata = nd, se.fit = TRUE, type = "response")
set.seed(1) # The simplest example y <- rnbinom(n = 10, mu = 3, size = 1/2.4) fit <- glm_gp(y, size_factors = FALSE) predict(fit, type = "response") predict(fit, type = "link", se.fit = TRUE) # Fitting a whole matrix model_matrix <- cbind(1, rnorm(5)) true_Beta <- cbind(rnorm(n = 30), rnorm(n = 30, mean = 3)) sf <- exp(rnorm(n = 5, mean = 0.7)) model_matrix Y <- matrix(rnbinom(n = 30 * 5, mu = sf * exp(true_Beta %*% t(model_matrix)), size = 1/2.4), nrow = 30, ncol = 5) fit <- glm_gp(Y, design = model_matrix, size_factors = sf, verbose = TRUE) head(predict(fit, type = "response")) pred <- predict(fit, type = "link", se.fit = TRUE, verbose = TRUE) head(pred$fit) head(pred$se.fit) # Fitting a model with covariates data <- data.frame(fav_food = sample(c("apple", "banana", "cherry"), size = 50, replace = TRUE), city = sample(c("heidelberg", "paris", "new york"), size = 50, replace = TRUE), age = rnorm(n = 50, mean = 40, sd = 15)) Y <- matrix(rnbinom(n = 4 * 50, mu = 3, size = 1/3.1), nrow = 4, ncol = 50) fit <- glm_gp(Y, design = ~ fav_food + city + age, col_data = data) predict(fit)[, 1:3] nd <- data.frame(fav_food = "banana", city = "paris", age = 29) predict(fit, newdata = nd) nd <- data.frame(fav_food = "banana", city = "paris", age = 29:40) predict(fit, newdata = nd, se.fit = TRUE, type = "response")
Pretty print the result from glm_gp()
## S3 method for class 'glmGamPoi' print(x, ...) ## S3 method for class 'glmGamPoi' format(x, ...) ## S3 method for class 'glmGamPoi' summary(object, ...) ## S3 method for class 'summary.glmGamPoi' print(x, ...) ## S3 method for class 'summary.glmGamPoi' format(x, ...)
## S3 method for class 'glmGamPoi' print(x, ...) ## S3 method for class 'glmGamPoi' format(x, ...) ## S3 method for class 'glmGamPoi' summary(object, ...) ## S3 method for class 'summary.glmGamPoi' print(x, ...) ## S3 method for class 'summary.glmGamPoi' format(x, ...)
x |
the glmGamPoi object |
... |
additional parameters, currently ignored |
object |
the glmGamPoi object that is summarized |
The print()
methods return the object x
. The
format()
method returns a string. The summary()
method returns an object of class summary.glmGamPoi
.
Create a 'SingleCellExperiment' containing pseudo-bulk samples
pseudobulk( data, group_by, ..., aggregation_functions = list(counts = "rowSums2", .default = "rowMeans2"), col_data = NULL, make_colnames = TRUE, verbose = TRUE )
pseudobulk( data, group_by, ..., aggregation_functions = list(counts = "rowSums2", .default = "rowMeans2"), col_data = NULL, make_colnames = TRUE, verbose = TRUE )
data |
a 'SingleCellExperiment' or an object of a related class |
group_by |
an unquoted expression that can refer to columns in
the 'colData()'. All observations with the same factor level are aggregated.
The argument follows the same logic as |
... |
named expressions that summarize columns in 'colData()'. Each expression
must produce a value of length 1. The arguments follow the same logic
as |
aggregation_functions |
a named list with functions that are used to
aggregate the assays in the |
col_data |
additional data with |
make_colnames |
a boolean that decides if the column names are the concatenated
values of |
verbose |
a boolean that indicates if information about the process are printed Default: |
a SingleCellExperiment object
library(SingleCellExperiment) data <- data.frame(sample = sample(c("samp1", "samp2", "samp3"), size = 50, replace = TRUE), celltype = sample(c("T cells", "B cells", "Macrophages"), size = 50, replace = TRUE), size = rnorm(n = 50, mean = 40, sd = 15)) Y <- matrix(rnbinom(n = 100 * 50, mu = 3, size = 1/3.1), nrow = 100, ncol = 50) sce <- SingleCellExperiment(Y, colData = data) aggr_sce <- pseudobulk(sce, group_by = vars(sample, celltype), size = mean(size)) aggr_sce colData(aggr_sce)
library(SingleCellExperiment) data <- data.frame(sample = sample(c("samp1", "samp2", "samp3"), size = 50, replace = TRUE), celltype = sample(c("T cells", "B cells", "Macrophages"), size = 50, replace = TRUE), size = rnorm(n = 50, mean = 40, sd = 15)) Y <- matrix(rnbinom(n = 100 * 50, mu = 3, size = 1/3.1), nrow = 100, ncol = 50) sce <- SingleCellExperiment(Y, colData = data) aggr_sce <- pseudobulk(sce, group_by = vars(sample, celltype), size = mean(size)) aggr_sce colData(aggr_sce)
Extract Residuals of Gamma Poisson Model
## S3 method for class 'glmGamPoi' residuals( object, type = c("deviance", "pearson", "randomized_quantile", "working", "response"), ... )
## S3 method for class 'glmGamPoi' residuals( object, type = c("deviance", "pearson", "randomized_quantile", "working", "response"), ... )
object |
a fit of type |
type |
the type of residual that is calculated. See details for more information.
Default: |
... |
currently ignored. |
This method can calculate a range of different residuals:
The deviance for the Gamma-Poisson model is
and the residual accordingly is
The Pearson residual is
The randomized quantile residual was originally developed
by Dunn & Smyth, 1995. Please see that publication or statmod::qresiduals()
for more
information.
The working residuals are .
The response residuals are
a matrix with the same size as fit$data
. If fit$data
contains a DelayedArray
than the
result will be a DelayedArray
as well.
glm_gp()
and 'stats::residuals.glm()
Solve the equation Y = A B for A or B
solve_lm_for_A(Y, B, w = NULL) solve_lm_for_B(Y, A, w = NULL)
solve_lm_for_A(Y, B, w = NULL) solve_lm_for_B(Y, A, w = NULL)
Y |
the left side of the equation |
w |
a vector with weights. If |
A , B
|
the known matrix on the right side of the equation |
Conduct a quasi-likelihood ratio test for a Gamma-Poisson fit.
test_de( fit, contrast, reduced_design = NULL, full_design = fit$model_matrix, subset_to = NULL, pseudobulk_by = NULL, pval_adjust_method = "BH", sort_by = NULL, decreasing = FALSE, n_max = Inf, max_lfc = 10, compute_lfc_se = FALSE, verbose = FALSE )
test_de( fit, contrast, reduced_design = NULL, full_design = fit$model_matrix, subset_to = NULL, pseudobulk_by = NULL, pval_adjust_method = "BH", sort_by = NULL, decreasing = FALSE, n_max = Inf, max_lfc = 10, compute_lfc_se = FALSE, verbose = FALSE )
fit |
object of class |
contrast |
The contrast to test. Can be a single column name (quoted or as a string)
that is removed from the full model matrix of |
reduced_design |
a specification of the reduced design used as a comparison to see what
how much better |
full_design |
option to specify an alternative |
subset_to |
a vector with the same length as |
pseudobulk_by |
DEPRECTATED, please use the pseudobulk function instead. |
pval_adjust_method |
one of the p-value adjustment method from
p.adjust.methods. Default: |
sort_by |
the name of the column or an expression used to sort the result. If |
decreasing |
boolean to decide if the result is sorted increasing or decreasing
order. Default: |
n_max |
the maximum number of rows to return. Default: |
max_lfc |
set the maximum absolute log fold change that is returned. Large log fold changes
occur for lowly expressed genes because the ratio of two small numbers can be impractically large. For example, limiting
the range of log fold changes can clarify the patterns in a volcano plot. Default: |
compute_lfc_se |
Compute standard errors for the log fold changes, and add a column |
verbose |
a boolean that indicates if information about the individual steps are printed
while fitting the GLM. Default: |
The cond()
helper function simplifies the specification of a contrast for complex experimental designs.
Instead of working out which combination of coefficients corresponds to a research question,
you can simply specify the two conditions that you want to compare.
You can only call the cond
function inside the contrast
argument. The arguments are the selected factor levels
for each covariate. To compare two conditions, simply subtract the two cond
calls. Internally, the package
calls model.matrix using the provided values and the original formula from the fit to produce a vector.
Subtracting two of these vectors produces a contrast vector. Missing covariates are filled with the first factor level
or zero for numerical covariates.
a data.frame
with the following columns
the rownames of the input data
the p-values of the quasi-likelihood ratio test
the adjusted p-values returned from p.adjust()
the F-statistic:
the degrees of freedom of the test: ncol(design) - ncol(reduced_design)
the degrees of freedom of the fit: ncol(data) - ncol(design) + df_0
the log2-fold change. If the alternative model is specified by reduced_design
will
be NA
.
the standard error of the log2-fold change. Only calculated if compute_lfc_se == TRUE
.
Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5). https://doi.org/10.1515/1544-6115.1826.
# Make Data Y <- matrix(rnbinom(n = 30 * 100, mu = 4, size = 0.3), nrow = 30, ncol = 100) annot <- data.frame(mouse = sample(LETTERS[1:6], size = 100, replace = TRUE), celltype = sample(c("Tcell", "Bcell", "Macrophages"), size = 100, replace = TRUE), cont1 = rnorm(100), cont2 = rnorm(100, mean = 30)) annot$condition <- ifelse(annot$mouse %in% c("A", "B", "C"), "ctrl", "treated") head(annot) se <- SummarizedExperiment::SummarizedExperiment(Y, colData = annot) # Fit model fit <- glm_gp(se, design = ~ condition + celltype + cont1 + cont2) # Test with reduced design res <- test_de(fit, reduced_design = ~ celltype + cont1 + cont2) head(res) # Test with contrast argument, the results are identical res2 <- test_de(fit, contrast = conditiontreated) head(res2) # Test with explicit specification of the conditions # The results are still identical res3 <- test_de(fit, contrast = cond(condition = "treated", celltype = "Bcell") - cond(condition = "ctrl", celltype = "Bcell")) head(res3) # The column names of fit$Beta are valid variables in the contrast argument colnames(fit$Beta) # You can also have more complex contrasts: # the following compares cont1 vs cont2: test_de(fit, cont1 - cont2, n_max = 4) # You can also sort the output test_de(fit, cont1 - cont2, n_max = 4, sort_by = "pval") test_de(fit, cont1 - cont2, n_max = 4, sort_by = - abs(f_statistic)) # If the data has multiple samples, it is a good # idea to aggregate the cell counts by samples. # This is called forming a "pseudobulk". se_reduced <- pseudobulk(se, group_by = vars(mouse, condition, celltype), cont1 = mean(cont1), cont2 = min(cont2)) fit_reduced <- glm_gp(se_reduced, design = ~ condition + celltype) test_de(fit_reduced, contrast = "conditiontreated", n_max = 4) test_de(fit_reduced, contrast = cond(condition = "treated", celltype = "Macrophages") - cond(condition = "ctrl", celltype = "Macrophages"), n_max = 4)
# Make Data Y <- matrix(rnbinom(n = 30 * 100, mu = 4, size = 0.3), nrow = 30, ncol = 100) annot <- data.frame(mouse = sample(LETTERS[1:6], size = 100, replace = TRUE), celltype = sample(c("Tcell", "Bcell", "Macrophages"), size = 100, replace = TRUE), cont1 = rnorm(100), cont2 = rnorm(100, mean = 30)) annot$condition <- ifelse(annot$mouse %in% c("A", "B", "C"), "ctrl", "treated") head(annot) se <- SummarizedExperiment::SummarizedExperiment(Y, colData = annot) # Fit model fit <- glm_gp(se, design = ~ condition + celltype + cont1 + cont2) # Test with reduced design res <- test_de(fit, reduced_design = ~ celltype + cont1 + cont2) head(res) # Test with contrast argument, the results are identical res2 <- test_de(fit, contrast = conditiontreated) head(res2) # Test with explicit specification of the conditions # The results are still identical res3 <- test_de(fit, contrast = cond(condition = "treated", celltype = "Bcell") - cond(condition = "ctrl", celltype = "Bcell")) head(res3) # The column names of fit$Beta are valid variables in the contrast argument colnames(fit$Beta) # You can also have more complex contrasts: # the following compares cont1 vs cont2: test_de(fit, cont1 - cont2, n_max = 4) # You can also sort the output test_de(fit, cont1 - cont2, n_max = 4, sort_by = "pval") test_de(fit, cont1 - cont2, n_max = 4, sort_by = - abs(f_statistic)) # If the data has multiple samples, it is a good # idea to aggregate the cell counts by samples. # This is called forming a "pseudobulk". se_reduced <- pseudobulk(se, group_by = vars(mouse, condition, celltype), cont1 = mean(cont1), cont2 = min(cont2)) fit_reduced <- glm_gp(se_reduced, design = ~ condition + celltype) test_de(fit_reduced, contrast = "conditiontreated", n_max = 4) test_de(fit_reduced, contrast = cond(condition = "treated", celltype = "Macrophages") - cond(condition = "ctrl", celltype = "Macrophages"), n_max = 4)
Quote grouping variables
vars(...)
vars(...)
... |
the quoted expression |
ggplot2::vars, dplyr::vars