Title: | A Statistical Framework for Modeling and Identifying Differential Distributions in Single-cell RNA-sequencing Data |
---|---|
Description: | We present a novel statistical framework for identifying differential distributions in single-cell RNA-sequencing (scRNA-seq) data between treatment conditions by modeling gene expression read counts using generalized linear models (GLMs). We model each gene independently under each treatment condition using error distributions Poisson (P), Negative Binomial (NB), Zero-inflated Poisson (ZIP) and Zero-inflated Negative Binomial (ZINB) with log link function and model based normalization for differences in sequencing depth. Since all four distributions considered in our framework belong to the same family of distributions, we first perform a Kolmogorov-Smirnov (KS) test to select genes belonging to the family of ZINB distributions. Genes passing the KS test will be then modeled using GLMs. Model selection is done by calculating the Bayesian Information Criterion (BIC) and likelihood ratio test (LRT) statistic. |
Authors: | Malindrie Dharmaratne [cre, aut] |
Maintainer: | Malindrie Dharmaratne <[email protected]> |
License: | GPL-3 |
Version: | 1.13.0 |
Built: | 2025-01-17 04:55:26 UTC |
Source: | https://github.com/bioc/scShapes |
This function returns a list of genes changing shape between conditions and the list of genes changing distribution from a unimodal to distribution to a zero-inflated distribution
change_shape(shapes_genes)
change_shape(shapes_genes)
shapes_genes |
A dataframe consisting of distributions followed by each gene passing the KS test. Where rows are genes and each column the treatment condition. |
A list of two lists with genes changing distribution shape between conditions. The list "All" contains all the genes changing distribution. The list "Uni2ZI" contains the genes changing distribution form a unimodal distribution in one condition to a zero-inflated distribution in another condition.
This function is used to preprocess matrix of read counts to only keep genes with a certain number of nonzero entries.
filter_counts(counts, perc.zero = 0.1)
filter_counts(counts, perc.zero = 0.1)
counts |
A non-negative integer matrix of scRNA-seq raw read counts. The rows of the matrix are genes and columns are samples/cells. |
perc.zero |
A numeric value between 0 and 1 that represents the proportion of zeros per gene in the processed dataset. |
An object of class Matrix
with genes removed if
they have more than perc.zero
zeros.
# load toy example data data(scData) # apply the filter_counts function to filter out genes if they have # more than 10% zero scData_filt <- filter_counts(scData$counts, perc.zero = 0.1)
# load toy example data data(scData) # apply the filter_counts function to filter out genes if they have # more than 10% zero scData_filt <- filter_counts(scData$counts, perc.zero = 0.1)
This function is used to fit genes with GLM
fit_models(counts, cexpr, lib.size, formula = NULL, model = NULL, BPPARAM)
fit_models(counts, cexpr, lib.size, formula = NULL, model = NULL, BPPARAM)
counts |
A non-negative integer matrix of scRNA-seq filtered read counts
containing genes belonging to the family of ZINB distributions selected from
|
cexpr |
A dataframe that contains the covariate values.
The rows of the dataframe are the corresponding samples/cells from the counts
matrix from |
lib.size |
A numeric vector that contains the total number of counts
per cell from the counts matrix from |
formula |
A regression formula to fit the covariates in the ZINB GLM. |
model |
A specific model to fit (1:P, 2:NB, 3:ZIP, 4:ZINB, NULL:All) |
BPPARAM |
configuration parameter related to the method of parallel execution.
For further information on how to set-up parallel execution refer to
|
A list of models fitted by 'glm'
data(scData) # apply the fit_models function to subset genes belonging to the # family of ZINB distributions, selceted from ks_test function. library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam())
data(scData) # apply the fit_models function to subset genes belonging to the # family of ZINB distributions, selceted from ks_test function. library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam())
This function is used to perform the likelihood ratio test on the models chosen
based on the BIC values from best_model
to check for model adequacy.
gof_model(lbic, cexpr, lib.size, formula = NULL, BPPARAM)
gof_model(lbic, cexpr, lib.size, formula = NULL, BPPARAM)
lbic |
A list of genes together with filtered read counts based on the selected
distribution from |
cexpr |
A dataframe that contains the covariate values.
The rows of the dataframe are the corresponding samples/cells from the counts
matrix from |
lib.size |
A numeric vector that contains the total number of counts
per cell from the counts matrix from |
formula |
A regression formula to fit the covariates in the ZINB GLM. |
BPPARAM |
configuration parameter related to the method of parallel execution.
For further information on how to set-up parallel execution refer to
|
A list of genes with the p-values from performing the GOF tests.
data(scData) # apply the gof_model function to perform the likelihood ratio # test on the models selected by using the lbic_model function library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts) scData_gof <- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam())
data(scData) # apply the gof_model function to perform the likelihood ratio # test on the models selected by using the lbic_model function library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts) scData_gof <- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam())
This function is used to select genes significant from the
ks_test
.
ks_sig(ks.pval.unadj)
ks_sig(ks.pval.unadj)
ks.pval.unadj |
The output from |
List object containing the significant gene indices from the KS test, their adjusted p-values
data(scData) # apply the ks_test function to subset genes belonging to the # family of ZINB distributions. library(BiocParallel) scData_KS <- ks_test(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) # apply the ks_sig function to select genes significant from # the Kolmogorov Smirnov test. scData_KS_sig <- ks_sig(scData_KS)
data(scData) # apply the ks_test function to subset genes belonging to the # family of ZINB distributions. library(BiocParallel) scData_KS <- ks_test(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) # apply the ks_sig function to select genes significant from # the Kolmogorov Smirnov test. scData_KS_sig <- ks_sig(scData_KS)
This function is used to perform Kolmogorv-Smirnov test on the
filtered sparse counts matrix from filter_counts
to select genes
belonging to the family of ZINB distributions
ks_test(counts, cexpr, lib.size, formula = NULL, BPPARAM)
ks_test(counts, cexpr, lib.size, formula = NULL, BPPARAM)
counts |
A non-negative integer matrix of scRNA-seq filtered read counts
from |
cexpr |
A dataframe that contains the covariate values.
The rows of the dataframe are the corresponding samples/cells from the counts
matrix from |
lib.size |
A numeric vector that contains the total number of counts
per cell from the counts matrix from |
formula |
A regression formula to fit the covariates in the ZINB GLM. |
BPPARAM |
configuration parameter related to the method of parallel execution.
For further information on how to set-up parallel execution refer to
|
List object containing the p-values from the KS test.
#' # load toy example data data(scData) # apply the ks_test function to subset genes belonging to the # family of ZINB distributions. library(BiocParallel) scData_KS <- ks_test(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam())
#' # load toy example data data(scData) # apply the ks_test function to subset genes belonging to the # family of ZINB distributions. library(BiocParallel) scData_KS <- ks_test(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam())
This function is used to select the best fit model for each gene based on the least BIC value
lbic_model(bic.value, counts)
lbic_model(bic.value, counts)
bic.value |
A dataframe of BIC values from fitting GLM using
error distributions P, NB, ZIP, ZINB; the output from |
counts |
A non-negative integer matrix of scRNA-seq filtered read counts
containing genes belonging to the family of ZINB distributions selected from
|
A list of genes chosen to be following one of the 4 distributions
P, NB, ZIP, ZINB based on the least BIC value and the corresponding subset
of counts from filter_counts
data(scData) # apply the lbic_model function to select the model with the least # BIC value on the matrix of BIC values obtained after running # model_bic function. library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts)
data(scData) # apply the lbic_model function to select the model with the least # BIC value on the matrix of BIC values obtained after running # model_bic function. library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts)
A Statistical Framework for Modeling and Identifying Differential Distributions in Single-cell RNA-sequencing Data
logo()
logo()
... Reports the logo of the package:scShapes.
This function is used to calculate the Bayesian information criterion of the models fitted in
fit_models
.
model_bic(fit_list)
model_bic(fit_list)
fit_list |
A list of models fitted from |
A dataframe containing the BIC values for each distribution type (P, NB, ZIP, ZINB).
data(scData) # apply the model_bic function to calculate the BIC values on the models # obtained after running fit_models function. library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models)
data(scData) # apply the model_bic function to calculate the BIC values on the models # obtained after running fit_models function. library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models)
This function returns model parameters based on the best fit distribution
as selected by distr_fit
and models fitted by fit_models
model_param(fit.model, gof.res, model = NULL)
model_param(fit.model, gof.res, model = NULL)
fit.model |
A list of models fitted by 'glm' from |
gof.res |
A list of selected model distributions for genes |
model |
A specific model to fit (1:P, 2:NB, 3:ZIP, 4:ZINB, NULL:All) |
A list of model parameters estimated. Estimated model parameters include mean (for all 4 models), theta (over-dispersion parameter for NB & ZINB models), pi (zero-inflation parameter for ZIP & ZINB models).
data(scData) # apply the model_param function to extract the parameters of the best fit # model obtained by running the select_model function library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts) scData_gof <- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_fit <- select_model(scData_gof) scData_params <- model_param(scData_models, scData_fit, model=NULL)
data(scData) # apply the model_param function to extract the parameters of the best fit # model obtained by running the select_model function library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts) scData_gof <- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_fit <- select_model(scData_gof) scData_params <- model_param(scData_models, scData_fit, model=NULL)
Toy example data list of scRNA-seq counts,
information on covariates, and library sizes for randomly
generated 20 genes to illustrate how to use the functions
of the package scShapes
data(scData)
data(scData)
A list of three lists labeled 'counts', 'covariates', 'lib_size'. 'counts' an RNA-seq counts matrix of 20 genes and 500 cells; 'covariates' a dataframe of covariates corresponding to the RNA-seq counts where rows are cells of the counts matrix; 'lib_size' a numeric vector of library sizes corresponding to the columns of the RNA-seq counts matrix.
An RData object
# load toy example data data(scData)
# load toy example data data(scData)
This function is used to select the distribution of best fit for scRNA-seq count data
select_model(lrt.value)
select_model(lrt.value)
lrt.value |
A list of genes with the p-values from performing the GOF tests
from |
A list of selected model distributions for genes scShapes selects.
data(scData) # apply the select_model function to the best fit model from the results of # the gof_model function library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts) scData_gof <- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_fit <- select_model(scData_gof)
data(scData) # apply the select_model function to the best fit model from the results of # the gof_model function library(BiocParallel) scData_models <- fit_models(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_bicvals <- model_bic(scData_models) scData_least.bic <- lbic_model(scData_bicvals, scData$counts) scData_gof <- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=bpparam()) scData_fit <- select_model(scData_gof)