Title: | Variance Stabilizing Transformation for Gamma-Poisson Models |
---|---|
Description: | Variance-stabilizing transformations help with the analysis of heteroskedastic data (i.e., data where the variance is not constant, like count data). This package provide two types of variance stabilizing transformations: (1) methods based on the delta method (e.g., 'acosh', 'log(x+1)'), (2) model residual based (Pearson and randomized quantile residuals). |
Authors: | Constantin Ahlmann-Eltze [aut, cre] |
Maintainer: | Constantin Ahlmann-Eltze <[email protected]> |
License: | GPL-3 |
Version: | 1.13.0 |
Built: | 2024-12-18 04:41:19 UTC |
Source: | https://github.com/bioc/transformGamPoi |
Delta method-based variance stabilizing transformation
acosh_transform( data, overdispersion = 0.05, size_factors = TRUE, ..., on_disk = NULL, verbose = FALSE ) shifted_log_transform( data, overdispersion = 0.05, pseudo_count = 1/(4 * overdispersion), size_factors = TRUE, minimum_overdispersion = 0.001, ..., on_disk = NULL, verbose = FALSE )
acosh_transform( data, overdispersion = 0.05, size_factors = TRUE, ..., on_disk = NULL, verbose = FALSE ) shifted_log_transform( data, overdispersion = 0.05, pseudo_count = 1/(4 * overdispersion), size_factors = TRUE, minimum_overdispersion = 0.001, ..., on_disk = NULL, verbose = FALSE )
data |
any matrix-like object (e.g. matrix, dgCMatrix, DelayedArray, HDF5Matrix)
with one column per sample and row per gene. It can also be an object of type |
overdispersion |
the simplest count model is the Poisson model. However, the Poisson model
assumes that
Note that |
size_factors |
in large scale experiments, each sample is typically of different size
(for example different sequencing depths). A size factor is an internal mechanism of GLMs to
correct for this effect. |
... |
additional parameters for |
on_disk |
a boolean that indicates if the dataset is loaded into memory or if it is kept on disk
to reduce the memory usage. Processing in memory can be significantly faster than on disk.
Default: |
verbose |
boolean that decides if information about the individual steps are printed.
Default: |
pseudo_count |
instead of specifying the overdispersion, the
|
minimum_overdispersion |
the |
a matrix (or a vector if the input is a vector) with the transformed values.
acosh_transform
: acosh(2 * alpha * x + 1)
shifted_log_transform
:
Ahlmann-Eltze, Constantin, and Wolfgang Huber. "Transformation and Preprocessing of Single-Cell RNA-Seq Data." bioRxiv (2021).
Ahlmann-Eltze, Constantin, and Wolfgang Huber. "glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data." Bioinformatics (2020)
Dunn, Peter K., and Gordon K. Smyth. "Randomized quantile residuals." Journal of Computational and Graphical Statistics 5.3 (1996): 236-244.
Hafemeister, Christoph, and Rahul Satija. "Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression." Genome biology 20.1 (2019): 1-15.
Hafemeister, Christoph, and Rahul Satija. "Analyzing scRNA-seq data with the sctransform and offset models." (2020)
Lause, Jan, Philipp Berens, and Dmitry Kobak. "Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data." Genome Biology (2021).
acosh_transform
, shifted_log_transform
, and residual_transform
# Load a single cell dataset sce <- TENxPBMCData::TENxPBMCData("pbmc4k") # Reduce size for this example set.seed(1) sce_red <- sce[sample(which(rowSums2(counts(sce)) > 0), 1000), sample(ncol(sce), 200)] assay(sce_red, "acosh") <- acosh_transform(sce_red) assay(sce_red, "shifted_log") <- shifted_log_transform(sce_red) plot(rowMeans2(assay(sce_red, "acosh")), rowVars(assay(sce_red, "acosh")), log = "x") points(rowMeans2(assay(sce_red, "shifted_log")), rowVars(assay(sce_red, "shifted_log")), col = "red") # Sqrt transformation sqrt_dat <- acosh_transform(sce_red, overdispersion = 0, size_factor = 1) plot(2 * sqrt(assay(sce_red))[,1], sqrt_dat[,1]); abline(0,1)
# Load a single cell dataset sce <- TENxPBMCData::TENxPBMCData("pbmc4k") # Reduce size for this example set.seed(1) sce_red <- sce[sample(which(rowSums2(counts(sce)) > 0), 1000), sample(ncol(sce), 200)] assay(sce_red, "acosh") <- acosh_transform(sce_red) assay(sce_red, "shifted_log") <- shifted_log_transform(sce_red) plot(rowMeans2(assay(sce_red, "acosh")), rowVars(assay(sce_red, "acosh")), log = "x") points(rowMeans2(assay(sce_red, "shifted_log")), rowVars(assay(sce_red, "shifted_log")), col = "red") # Sqrt transformation sqrt_dat <- acosh_transform(sce_red, overdispersion = 0, size_factor = 1) plot(2 * sqrt(assay(sce_red))[,1], sqrt_dat[,1]); abline(0,1)
Fit an intercept Gamma-Poisson model that corrects for sequencing depth and return the residuals as variance stabilized results for further downstream application, for which no proper count-based method exist or is performant enough (e.g., clustering, dimensionality reduction).
residual_transform( data, residual_type = c("randomized_quantile", "pearson", "analytic_pearson"), clipping = FALSE, overdispersion = 0.05, size_factors = TRUE, offset_model = TRUE, overdispersion_shrinkage = TRUE, ridge_penalty = 2, on_disk = NULL, return_fit = FALSE, verbose = FALSE, ... )
residual_transform( data, residual_type = c("randomized_quantile", "pearson", "analytic_pearson"), clipping = FALSE, overdispersion = 0.05, size_factors = TRUE, offset_model = TRUE, overdispersion_shrinkage = TRUE, ridge_penalty = 2, on_disk = NULL, return_fit = FALSE, verbose = FALSE, ... )
data |
any matrix-like object (e.g. matrix, dgCMatrix, DelayedArray, HDF5Matrix)
with one column per sample and row per gene. It can also be an object of type |
residual_type |
a string that specifies what kind of residual is returned as variance stabilized-value.
The two above options are the most common choices, however you can use any |
clipping |
a single boolean or numeric value specifying that all residuals are in the range
|
overdispersion |
the simplest count model is the Poisson model. However, the Poisson model
assumes that
Note that |
offset_model |
boolean to decide if |
overdispersion_shrinkage , size_factors
|
arguments that are passed to the underlying
call to |
ridge_penalty |
another argument that is passed to |
on_disk |
a boolean that indicates if the dataset is loaded into memory or if it is kept on disk
to reduce the memory usage. Processing in memory can be significantly faster than on disk.
Default: |
return_fit |
boolean to decide if the matrix of residuals is returned directly ( |
verbose |
boolean that decides if information about the individual steps are printed.
Default: |
... |
additional parameters passed to |
Internally, this method uses the glmGamPoi
package. The function goes through the following steps
fit model using glmGamPoi::glm_gp()
plug in the trended overdispersion estimates
call glmGamPoi::residuals.glmGamPoi()
to calculate the residuals.
a matrix (or a vector if the input is a vector) with the transformed values. If return_fit = TRUE
,
a list is returned with two elements: fit
and Residuals
.
Ahlmann-Eltze, Constantin, and Wolfgang Huber. "glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data." Bioinformatics (2020)
Dunn, Peter K., and Gordon K. Smyth. "Randomized quantile residuals." Journal of Computational and Graphical Statistics 5.3 (1996): 236-244.
Hafemeister, Christoph, and Rahul Satija. "Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression." Genome biology 20.1 (2019): 1-15.
Hafemeister, Christoph, and Rahul Satija. "Analyzing scRNA-seq data with the sctransform and offset models." (2020)
Lause, Jan, Philipp Berens, and Dmitry Kobak. "Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data." Genome Biology (2021).
glmGamPoi::glm_gp()
, glmGamPoi::residuals.glmGamPoi()
, sctransform::vst()
,
statmod::qresiduals()
# Load a single cell dataset sce <- TENxPBMCData::TENxPBMCData("pbmc4k") # Reduce size for this example set.seed(1) sce_red <- sce[sample(which(rowSums2(counts(sce)) > 0), 1000), sample(ncol(sce), 200)] counts(sce_red) <- as.matrix(counts(sce_red)) # Residual Based Variance Stabilizing Transformation rq <- residual_transform(sce_red, residual_type = "randomized_quantile", verbose = TRUE) pearson <- residual_transform(sce_red, residual_type = "pearson", verbose = TRUE) # Plot first two principal components pearson_pca <- prcomp(t(pearson), rank. = 2) rq_pca <- prcomp(t(rq), rank. = 2) plot(rq_pca$x, asp = 1) points(pearson_pca$x, col = "red")
# Load a single cell dataset sce <- TENxPBMCData::TENxPBMCData("pbmc4k") # Reduce size for this example set.seed(1) sce_red <- sce[sample(which(rowSums2(counts(sce)) > 0), 1000), sample(ncol(sce), 200)] counts(sce_red) <- as.matrix(counts(sce_red)) # Residual Based Variance Stabilizing Transformation rq <- residual_transform(sce_red, residual_type = "randomized_quantile", verbose = TRUE) pearson <- residual_transform(sce_red, residual_type = "pearson", verbose = TRUE) # Plot first two principal components pearson_pca <- prcomp(t(pearson), rank. = 2) rq_pca <- prcomp(t(rq), rank. = 2) plot(rq_pca$x, asp = 1) points(pearson_pca$x, col = "red")
Variance Stabilizing Transformation for Gamma Poisson Data
transformGamPoi( data, transformation = c("acosh", "shifted_log", "randomized_quantile_residuals", "pearson_residuals", "analytic_pearson_residuals"), overdispersion = 0.05, size_factors = TRUE, ..., on_disk = NULL, verbose = FALSE )
transformGamPoi( data, transformation = c("acosh", "shifted_log", "randomized_quantile_residuals", "pearson_residuals", "analytic_pearson_residuals"), overdispersion = 0.05, size_factors = TRUE, ..., on_disk = NULL, verbose = FALSE )
data |
any matrix-like object (e.g. matrix, dgCMatrix, DelayedArray, HDF5Matrix)
with one column per sample and row per gene. It can also be an object of type |
transformation |
one of |
overdispersion |
the simplest count model is the Poisson model. However, the Poisson model
assumes that
Note that |
size_factors |
in large scale experiments, each sample is typically of different size
(for example different sequencing depths). A size factor is an internal mechanism of GLMs to
correct for this effect. |
... |
additional parameters passed to |
on_disk |
a boolean that indicates if the dataset is loaded into memory or if it is kept on disk
to reduce the memory usage. Processing in memory can be significantly faster than on disk.
Default: |
verbose |
boolean that decides if information about the individual steps are printed.
Default: |
a matrix (or a vector if the input is a vector) with the transformed values.
Ahlmann-Eltze, Constantin, and Wolfgang Huber. "Transformation and Preprocessing of Single-Cell RNA-Seq Data." bioRxiv (2021).
Ahlmann-Eltze, Constantin, and Wolfgang Huber. "glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data." Bioinformatics (2020)
Dunn, Peter K., and Gordon K. Smyth. "Randomized quantile residuals." Journal of Computational and Graphical Statistics 5.3 (1996): 236-244.
Hafemeister, Christoph, and Rahul Satija. "Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression." Genome biology 20.1 (2019): 1-15.
Hafemeister, Christoph, and Rahul Satija. "Analyzing scRNA-seq data with the sctransform and offset models." (2020)
Lause, Jan, Philipp Berens, and Dmitry Kobak. "Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data." Genome Biology (2021).
acosh_transform
, shifted_log_transform
, and residual_transform
# Load a single cell dataset sce <- TENxPBMCData::TENxPBMCData("pbmc4k") # Reduce size for this example set.seed(1) sce_red <- sce[sample(which(rowSums2(counts(sce)) > 0), 1000), sample(ncol(sce), 200)] assay(sce_red, "acosh") <- transformGamPoi(sce_red, "acosh") assay(sce_red, "shifted_log") <- transformGamPoi(sce_red, "shifted_log") # Residual Based Variance Stabilizing Transformation rq <- transformGamPoi(sce_red, transformation = "randomized_quantile", on_disk = FALSE, verbose = TRUE) pearson <- transformGamPoi(sce_red, transformation = "pearson", on_disk = FALSE, verbose = TRUE) plot(rowMeans2(counts(sce_red)), rowVars(assay(sce_red, "acosh")), log = "x") points(rowMeans2(counts(sce_red)), rowVars(assay(sce_red, "shifted_log")), col = "red") points(rowMeans2(counts(sce_red)), rowVars(rq), col = "blue") # Plot first two principal components acosh_pca <- prcomp(t(assay(sce_red, "acosh")), rank. = 2) rq_pca <- prcomp(t(rq), rank. = 2) pearson_pca <- prcomp(t(pearson), rank. = 2) plot(acosh_pca$x, asp = 1) points(rq_pca$x, col = "blue") points(pearson_pca$x, col = "green")
# Load a single cell dataset sce <- TENxPBMCData::TENxPBMCData("pbmc4k") # Reduce size for this example set.seed(1) sce_red <- sce[sample(which(rowSums2(counts(sce)) > 0), 1000), sample(ncol(sce), 200)] assay(sce_red, "acosh") <- transformGamPoi(sce_red, "acosh") assay(sce_red, "shifted_log") <- transformGamPoi(sce_red, "shifted_log") # Residual Based Variance Stabilizing Transformation rq <- transformGamPoi(sce_red, transformation = "randomized_quantile", on_disk = FALSE, verbose = TRUE) pearson <- transformGamPoi(sce_red, transformation = "pearson", on_disk = FALSE, verbose = TRUE) plot(rowMeans2(counts(sce_red)), rowVars(assay(sce_red, "acosh")), log = "x") points(rowMeans2(counts(sce_red)), rowVars(assay(sce_red, "shifted_log")), col = "red") points(rowMeans2(counts(sce_red)), rowVars(rq), col = "blue") # Plot first two principal components acosh_pca <- prcomp(t(assay(sce_red, "acosh")), rank. = 2) rq_pca <- prcomp(t(rq), rank. = 2) pearson_pca <- prcomp(t(pearson), rank. = 2) plot(acosh_pca$x, asp = 1) points(rq_pca$x, col = "blue") points(pearson_pca$x, col = "green")