Title: | Robust Outlier-aware Estimation of Composition and Heterogeneity for Single-cell Data |
---|---|
Description: | A robust and outlier-aware method for testing differential tissue composition from single-cell data. This model can infer changes in tissue composition and heterogeneity, and can produce realistic data simulations based on any existing dataset. This model can also transfer knowledge from a large set of integrated datasets to increase accuracy further. |
Authors: | Stefano Mangiola [aut, cre] |
Maintainer: | Stefano Mangiola <[email protected]> |
License: | GPL-3 |
Version: | 1.9.0 |
Built: | 2024-09-24 05:36:46 UTC |
Source: | https://github.com/bioc/sccomp |
A DESCRIPTION OF THE PACKAGE
Maintainer: Stefano Mangiola [email protected]
Stan Development Team (2020). RStan: the R interface to Stan. R package version 2.21.2. https://mc-stan.org
Useful links:
Example data set containing cell counts per cell cluster
data(counts_obj)
data(counts_obj)
A tidy data frame.
This function runs the data modelling and statistical test for the hypothesis that a cell_type includes outlier biological replicate.
multi_beta_glm( .data, formula = ~1, .sample, check_outliers = FALSE, approximate_posterior_inference = TRUE, cores = detect_cores(), seed = sample(1e+05, 1) )
multi_beta_glm( .data, formula = ~1, .sample, check_outliers = FALSE, approximate_posterior_inference = TRUE, cores = detect_cores(), seed = sample(1e+05, 1) )
.data |
A tibble including a cell_type name column | sample name column | read counts column | factor columns | Pvaue column | a significance column |
formula |
A formula. The sample formula used to perform the differential cell_type abundance analysis |
.sample |
A column name as symbol. The sample identifier |
check_outliers |
A boolean. Whether to check for outliers before the fit. |
approximate_posterior_inference |
A boolean. Whether the inference of the joint posterior distribution should be approximated with variational Bayes. It confers execution time advantage. |
cores |
An integer. How many cored to be used with parallel calculations. |
seed |
An integer. Used for development and testing purposes |
A nested tibble tbl
with cell_type-wise information: sample wise data
| plot | ppc samples failed
| exposure deleterious outliers
This function plots a summary of the results of the model.
plot_summary(.data, significance_threshold = 0.025)
plot_summary(.data, significance_threshold = 0.025)
.data |
A tibble including a cell_group name column | sample name column | read counts column | factor columns | Pvalue column | a significance column |
significance_threshold |
A real. FDR threshold for labelling significant cell-groups. |
A ggplot
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, approximate_posterior_inference = "all", cores = 1 ) # estimate |> plot_summary()
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, approximate_posterior_inference = "all", cores = 1 ) # estimate |> plot_summary()
This function plots a summary of the results of the model.
## S3 method for class 'sccomp_tbl' plot(x, ...)
## S3 method for class 'sccomp_tbl' plot(x, ...)
x |
A tibble including a cell_group name column | sample name column | read counts column | factor columns | Pvalue column | a significance column |
... |
parameters like significance_threshold A real. FDR threshold for labelling significant cell-groups. |
A ggplot
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) # estimate |> plot()
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) # estimate |> plot()
This function plots a boxplot of the results of the model.
sccomp_boxplot(.data, factor, significance_threshold = 0.025)
sccomp_boxplot(.data, factor, significance_threshold = 0.025)
.data |
A tibble including a cell_group name column | sample name column | read counts column | factor columns | Pvalue column | a significance column |
factor |
A character string for a factor of interest included in the model |
significance_threshold |
A real. FDR threshold for labelling significant cell-groups. |
A ggplot
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_test() # estimate |> sccomp_boxplot()
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_test() # estimate |> sccomp_boxplot()
The sccomp_estimate
function performs linear modeling on a table of cell counts,
which includes a cell-group identifier, sample identifier, integer count, and factors
(continuous or discrete). The user can define a linear model with an input R formula,
where the first factor is the factor of interest. Alternatively, sccomp
accepts
single-cell data containers (e.g., Seurat, SingleCellExperiment, cell metadata, or
group-size) and derives the count data from cell metadata.
sccomp_estimate( .data, formula_composition = ~1, formula_variability = ~1, .sample, .cell_group, .count = NULL, cores = detectCores(), bimodal_mean_variability_association = FALSE, percent_false_positive = 5, variational_inference = TRUE, prior_mean = list(intercept = c(0, 1), coefficients = c(0, 1)), prior_overdispersion_mean_association = list(intercept = c(5, 2), slope = c(0, 0.6), standard_deviation = c(10, 20)), .sample_cell_group_pairs_to_exclude = NULL, verbose = TRUE, enable_loo = FALSE, noise_model = "multi_beta_binomial", exclude_priors = FALSE, use_data = TRUE, mcmc_seed = sample(1e+05, 1), max_sampling_iterations = 20000, pass_fit = TRUE, approximate_posterior_inference = NULL )
sccomp_estimate( .data, formula_composition = ~1, formula_variability = ~1, .sample, .cell_group, .count = NULL, cores = detectCores(), bimodal_mean_variability_association = FALSE, percent_false_positive = 5, variational_inference = TRUE, prior_mean = list(intercept = c(0, 1), coefficients = c(0, 1)), prior_overdispersion_mean_association = list(intercept = c(5, 2), slope = c(0, 0.6), standard_deviation = c(10, 20)), .sample_cell_group_pairs_to_exclude = NULL, verbose = TRUE, enable_loo = FALSE, noise_model = "multi_beta_binomial", exclude_priors = FALSE, use_data = TRUE, mcmc_seed = sample(1e+05, 1), max_sampling_iterations = 20000, pass_fit = TRUE, approximate_posterior_inference = NULL )
.data |
A tibble including cell_group name column, sample name column, read counts column (optional depending on the input class), and factor columns. |
formula_composition |
A formula describing the model for differential abundance. |
formula_variability |
A formula describing the model for differential variability. |
.sample |
A column name as symbol for the sample identifier. |
.cell_group |
A column name as symbol for the cell_group identifier. |
.count |
A column name as symbol for the cell_group abundance (read count). |
cores |
Number of cores to use for parallel calculations. |
bimodal_mean_variability_association |
Boolean for modeling mean-variability as bimodal. |
percent_false_positive |
Real number between 0 and 100 for outlier identification. |
variational_inference |
Boolean for using variational Bayes for posterior inference. It is faster and convenient. Setting this argument to FALSE runs the full Bayesian (Hamiltonian Monte Carlo) inference, slower but it is the gold standard. |
prior_mean |
List with prior knowledge about mean distribution, they are the intercept and coefficient. |
prior_overdispersion_mean_association |
List with prior knowledge about mean/variability association. |
.sample_cell_group_pairs_to_exclude |
Column name with boolean for sample/cell-group pairs exclusion. |
verbose |
Boolean to print progression. |
enable_loo |
Boolean to enable model comparison using the LOO package. |
noise_model |
Character string for the noise model (e.g., 'multi_beta_binomial'). |
exclude_priors |
Boolean to run a prior-free model. |
use_data |
Boolean to run the model data-free. |
mcmc_seed |
Integer for MCMC reproducibility. |
max_sampling_iterations |
Integer to limit maximum iterations for large datasets. |
pass_fit |
Boolean to include the Stan fit as attribute in the output. |
approximate_posterior_inference |
DEPRECATED please use the |
A nested tibble tbl
, with the following columns
cell_group - column including the cell groups being tested
parameter - The parameter being estimated, from the design matrix dscribed with the input formula_composition and formula_variability
factor - The factor in the formula corresponding to the covariate, if exists (e.g. it does not exist in case og Intercept or contrasts, which usually are combination of parameters)
c_lower - lower (2.5%) quantile of the posterior distribution for a composition (c) parameter.
c_effect - mean of the posterior distribution for a composition (c) parameter.
c_upper - upper (97.5%) quantile of the posterior distribution fo a composition (c) parameter.
c_pH0 - Probability of the null hypothesis (no difference) for a composition (c). This is not a p-value.
c_FDR - False-discovery rate of the null hypothesis (no difference) for a composition (c).
c_n_eff - Effective sample size - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
c_R_k_hat - R statistic, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
v_lower - Lower (2.5%) quantile of the posterior distribution for a variability (v) parameter
v_effect - Mean of the posterior distribution for a variability (v) parameter
v_upper - Upper (97.5%) quantile of the posterior distribution for a variability (v) parameter
v_pH0 - Probability of the null hypothesis (no difference) for a variability (v). This is not a p-value.
v_FDR - False-discovery rate of the null hypothesis (no difference), for a variability (v).
v_n_eff - Effective sample size for a variability (v) parameter - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
v_R_k_hat - R statistic for a variability (v) parameter, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
count_data Nested input count data.
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 )
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 )
The function for linear modelling takes as input a table of cell counts with three columns containing a cell-group identifier, sample identifier, integer count and the factors (continuous or discrete). The user can define a linear model with an input R formula, where the first factor is the factor of interest. Alternatively, sccomp accepts single-cell data containers (Seurat, SingleCellExperiment44, cell metadata or group-size). In this case, sccomp derives the count data from cell metadata.
sccomp_glm( .data, formula_composition = ~1, formula_variability = ~1, .sample, .cell_group, .count = NULL, contrasts = NULL, prior_mean_variable_association = list(intercept = c(5, 2), slope = c(0, 0.6), standard_deviation = c(20, 40)), check_outliers = TRUE, bimodal_mean_variability_association = FALSE, enable_loo = FALSE, cores = detectCores(), percent_false_positive = 5, approximate_posterior_inference = "none", test_composition_above_logit_fold_change = 0.2, .sample_cell_group_pairs_to_exclude = NULL, verbose = FALSE, noise_model = "multi_beta_binomial", exclude_priors = FALSE, use_data = TRUE, mcmc_seed = sample(1e+05, 1), max_sampling_iterations = 20000, pass_fit = TRUE )
sccomp_glm( .data, formula_composition = ~1, formula_variability = ~1, .sample, .cell_group, .count = NULL, contrasts = NULL, prior_mean_variable_association = list(intercept = c(5, 2), slope = c(0, 0.6), standard_deviation = c(20, 40)), check_outliers = TRUE, bimodal_mean_variability_association = FALSE, enable_loo = FALSE, cores = detectCores(), percent_false_positive = 5, approximate_posterior_inference = "none", test_composition_above_logit_fold_change = 0.2, .sample_cell_group_pairs_to_exclude = NULL, verbose = FALSE, noise_model = "multi_beta_binomial", exclude_priors = FALSE, use_data = TRUE, mcmc_seed = sample(1e+05, 1), max_sampling_iterations = 20000, pass_fit = TRUE )
.data |
A tibble including a cell_group name column | sample name column | read counts column (optional depending on the input class) | factor columns. |
formula_composition |
A formula. The formula describing the model for differential abundance, for example ~treatment. |
formula_variability |
A formula. The formula describing the model for differential variability, for example ~treatment. In most cases, if differentially variability is of interest, the formula should only include the factor of interest as a large anount of data is needed to define variability depending to each factors. |
.sample |
A column name as symbol. The sample identifier |
.cell_group |
A column name as symbol. The cell_group identifier |
.count |
A column name as symbol. The cell_group abundance (read count). Used only for data frame count output. The variable in this column should be of class integer. |
contrasts |
A vector of character strings. For example if your formula is |
prior_mean_variable_association |
A list of the form list(intercept = c(5, 2), slope = c(0, 0.6), standard_deviation = c(20, 40)). Where for intercept and slope parameters, we specify mean and standard deviation, while for standard deviation, we specify shape and rate. This is used to incorporate prior knowledge about the mean/variability association of cell-type proportions. |
check_outliers |
A boolean. Whether to check for outliers before the fit. |
bimodal_mean_variability_association |
A boolean. Whether to model the mean-variability as bimodal, as often needed in the case of single-cell RNA sequencing data, and not usually for CyTOF and microbiome data. The plot summary_plot()$credible_intervals_2D can be used to assess whether the bimodality should be modelled. |
enable_loo |
A boolean. Enable model comparison by the R package LOO. This is helpful when you want to compare the fit between two models, for example, analogously to ANOVA, between a one factor model versus a interceot-only model. |
cores |
An integer. How many cored to be used with parallel calculations. |
percent_false_positive |
A real between 0 and 100 non included. This used to identify outliers with a specific false positive rate. |
approximate_posterior_inference |
A boolean. Whether the inference of the joint posterior distribution should be approximated with variational Bayes. It confers execution time advantage. |
test_composition_above_logit_fold_change |
A positive integer. It is the effect threshold used for the hypothesis test. A value of 0.2 correspond to a change in cell proportion of 10% for a cell type with baseline proportion of 50%. That is, a cell type goes from 45% to 50%. When the baseline proportion is closer to 0 or 1 this effect thrshold has consistent value in the logit uncontrained scale. |
.sample_cell_group_pairs_to_exclude |
A column name that includes a boolean variable for the sample/cell-group pairs to be ignored in the fit. This argument is for pro-users. |
verbose |
A boolean. Prints progression. |
noise_model |
A character string. The two noise models available are multi_beta_binomial (default) and dirichlet_multinomial. |
exclude_priors |
A boolean. Whether to run a prior-free model, for benchmarking purposes. |
use_data |
A booelan. Whether to sun the model data free. This can be used for prior predictive check. |
mcmc_seed |
An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed() |
max_sampling_iterations |
An integer. This limit the maximum number of iterations in case a large dataset is used, for limiting the computation time. |
pass_fit |
A boolean. Whether to pass the Stan fit as attribute in the output. Because the Stan fit can be very large, setting this to FALSE can be used to lower the memory imprint to save the output. |
A nested tibble tbl
, with the following columns
cell_group - column including the cell groups being tested
parameter - The parameter being estimated, from the design matrix dscribed with the input formula_composition and formula_variability
factor - The factor in the formula corresponding to the covariate, if exists (e.g. it does not exist in case og Intercept or contrasts, which usually are combination of parameters)
c_lower - lower (2.5%) quantile of the posterior distribution for a composition (c) parameter.
c_effect - mean of the posterior distribution for a composition (c) parameter.
c_upper - upper (97.5%) quantile of the posterior distribution fo a composition (c) parameter.
c_pH0 - Probability of the null hypothesis (no difference) for a composition (c). This is not a p-value.
c_FDR - False-discovery rate of the null hypothesis (no difference) for a composition (c).
c_n_eff - Effective sample size - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
c_R_k_hat - R statistic, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
v_lower - Lower (2.5%) quantile of the posterior distribution for a variability (v) parameter
v_effect - Mean of the posterior distribution for a variability (v) parameter
v_upper - Upper (97.5%) quantile of the posterior distribution for a variability (v) parameter
v_pH0 - Probability of the null hypothesis (no difference) for a variability (v). This is not a p-value.
v_FDR - False-discovery rate of the null hypothesis (no difference), for a variability (v).
v_n_eff - Effective sample size for a variability (v) parameter - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
v_R_k_hat - R statistic for a variability (v) parameter, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
count_data Nested input count data.
data("counts_obj") estimate = sccomp_glm( counts_obj , ~ type, ~1, sample, cell_group, count, check_outliers = FALSE, cores = 1 )
data("counts_obj") estimate = sccomp_glm( counts_obj , ~ type, ~1, sample, cell_group, count, check_outliers = FALSE, cores = 1 )
This function replicates counts from a real-world dataset.
sccomp_predict( fit, formula_composition = NULL, new_data = NULL, number_of_draws = 500, mcmc_seed = sample(1e+05, 1) )
sccomp_predict( fit, formula_composition = NULL, new_data = NULL, number_of_draws = 500, mcmc_seed = sample(1e+05, 1) )
fit |
The result of sccomp_estimate. |
formula_composition |
A formula. The formula describing the model for differential abundance, for example ~treatment. This formula can be a sub-formula of your estimated model; in this case all other factor will be factored out. |
new_data |
A sample-wise data frame including the column that represent the factors in your formula. If you want to predict proportions for 10 samples, there should be 10 rows. T |
number_of_draws |
An integer. How may copies of the data you want to draw from the model joint posterior distribution. |
mcmc_seed |
An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed() |
A nested tibble tbl
with cell_group-wise statistics
data("counts_obj") if(.Platform$OS.type == "unix") sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_predict()
data("counts_obj") if(.Platform$OS.type == "unix") sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_predict()
The function for linear modelling takes as input a table of cell counts with three columns containing a cell-group identifier, sample identifier, integer count and the factors (continuous or discrete). The user can define a linear model with an input R formula, where the first factor is the factor of interest. Alternatively, sccomp accepts single-cell data containers (Seurat, SingleCellExperiment44, cell metadata or group-size). In this case, sccomp derives the count data from cell metadata.
sccomp_remove_outliers( .estimate, percent_false_positive = 5, cores = detectCores(), variational_inference = TRUE, verbose = TRUE, mcmc_seed = sample(1e+05, 1), max_sampling_iterations = 20000, enable_loo = FALSE, approximate_posterior_inference = NULL )
sccomp_remove_outliers( .estimate, percent_false_positive = 5, cores = detectCores(), variational_inference = TRUE, verbose = TRUE, mcmc_seed = sample(1e+05, 1), max_sampling_iterations = 20000, enable_loo = FALSE, approximate_posterior_inference = NULL )
.estimate |
A tibble including a cell_group name column | sample name column | read counts column (optional depending on the input class) | factor columns. |
percent_false_positive |
A real between 0 and 100 non included. This used to identify outliers with a specific false positive rate. |
cores |
An integer. How many cored to be used with parallel calculations. |
variational_inference |
Boolean for using variational Bayes for posterior inference. It is faster and convenient. Setting this argument to FALSE runs the full Bayesian (Hamiltonian Monte Carlo) inference, slower but it is the gold standard. |
verbose |
A boolean. Prints progression. |
mcmc_seed |
An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed() |
max_sampling_iterations |
An integer. This limit the maximum number of iterations in case a large dataset is used, for limiting the computation time. |
enable_loo |
A boolean. Enable model comparison by the R package LOO. This is helpful when you want to compare the fit between two models, for example, analogously to ANOVA, between a one factor model versus a interceot-only model. |
approximate_posterior_inference |
DEPRECATED please use the |
A nested tibble tbl
, with the following columns
cell_group - column including the cell groups being tested
parameter - The parameter being estimated, from the design matrix dscribed with the input formula_composition and formula_variability
factor - The factor in the formula corresponding to the covariate, if exists (e.g. it does not exist in case og Intercept or contrasts, which usually are combination of parameters)
c_lower - lower (2.5%) quantile of the posterior distribution for a composition (c) parameter.
c_effect - mean of the posterior distribution for a composition (c) parameter.
c_upper - upper (97.5%) quantile of the posterior distribution fo a composition (c) parameter.
c_n_eff - Effective sample size - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
c_R_k_hat - R statistic, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
v_lower - Lower (2.5%) quantile of the posterior distribution for a variability (v) parameter
v_effect - Mean of the posterior distribution for a variability (v) parameter
v_upper - Upper (97.5%) quantile of the posterior distribution for a variability (v) parameter
v_n_eff - Effective sample size for a variability (v) parameter - the number of independent draws in the sample, the higher the better (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
v_R_k_hat - R statistic for a variability (v) parameter, a measure of chain equilibrium, should be within 0.05 of 1.0 (mc-stan.org/docs/2_25/cmdstan-guide/stansummary.html).
count_data Nested input count data.
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_remove_outliers(cores = 1)
data("counts_obj") estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_remove_outliers(cores = 1)
This function uses the model to remove unwanted variation from a dataset using the estimated of the model. For example if you fit your data with this formula ~ factor_1 + factor_2
and use this formula to remove unwanted variation ~ factor_1
, the factor_2
will be factored out.
sccomp_remove_unwanted_variation( .data, formula_composition = ~1, formula_variability = NULL )
sccomp_remove_unwanted_variation( .data, formula_composition = ~1, formula_variability = NULL )
.data |
A tibble. The result of sccomp_estimate. |
formula_composition |
A formula. The formula describing the model for differential abundance, for example ~treatment. This formula can be a sub-formula of your estimated model; in this case all other factor will be factored out. |
formula_variability |
A formula. The formula describing the model for differential variability, for example ~treatment. In most cases, if differentially variability is of interest, the formula should only include the factor of interest as a large anount of data is needed to define variability depending to each factors. This formula can be a sub-formula of your estimated model; in this case all other factor will be factored out. |
A nested tibble tbl
with cell_group-wise statistics
data("counts_obj") estimates = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) sccomp_remove_unwanted_variation(estimates)
data("counts_obj") estimates = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) sccomp_remove_unwanted_variation(estimates)
This function replicates counts from a real-world dataset.
sccomp_replicate( fit, formula_composition = NULL, formula_variability = NULL, number_of_draws = 1, mcmc_seed = sample(1e+05, 1) )
sccomp_replicate( fit, formula_composition = NULL, formula_variability = NULL, number_of_draws = 1, mcmc_seed = sample(1e+05, 1) )
fit |
The result of sccomp_estimate. |
formula_composition |
A formula. The formula describing the model for differential abundance, for example ~treatment. This formula can be a sub-formula of your estimated model; in this case all other factor will be factored out. |
formula_variability |
A formula. The formula describing the model for differential variability, for example ~treatment. In most cases, if differentially variability is of interest, the formula should only include the factor of interest as a large anount of data is needed to define variability depending to each factors. This formula can be a sub-formula of your estimated model; in this case all other factor will be factored out. |
number_of_draws |
An integer. How may copies of the data you want to draw from the model joint posterior distribution. |
mcmc_seed |
An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed() |
A nested tibble tbl
with cell_group-wise statistics
data("counts_obj") if(.Platform$OS.type == "unix") sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_replicate()
data("counts_obj") if(.Platform$OS.type == "unix") sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_replicate()
This function test contrasts from a sccomp result.
sccomp_test( .data, contrasts = NULL, percent_false_positive = 5, test_composition_above_logit_fold_change = 0.2, pass_fit = TRUE )
sccomp_test( .data, contrasts = NULL, percent_false_positive = 5, test_composition_above_logit_fold_change = 0.2, pass_fit = TRUE )
.data |
A tibble. The result of sccomp_estimate. |
contrasts |
A vector of character strings. For example if your formula is |
percent_false_positive |
A real between 0 and 100 non included. This used to identify outliers with a specific false positive rate. |
test_composition_above_logit_fold_change |
A positive integer. It is the effect threshold used for the hypothesis test. A value of 0.2 correspond to a change in cell proportion of 10% for a cell type with baseline proportion of 50%. That is, a cell type goes from 45% to 50%. When the baseline proportion is closer to 0 or 1 this effect thrshold has consistent value in the logit uncontrained scale. |
pass_fit |
A boolean. Whether to pass the Stan fit as attribute in the output. Because the Stan fit can be very large, setting this to FALSE can be used to lower the memory imprint to save the output. |
A nested tibble tbl
with cell_group-wise statistics
data("counts_obj") estimates = sccomp_estimate( counts_obj , ~ 0 + type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_test("typecancer - typebenign")
data("counts_obj") estimates = sccomp_estimate( counts_obj , ~ 0 + type, ~1, sample, cell_group, count, cores = 1 ) |> sccomp_test("typecancer - typebenign")
Example SingleCellExperiment data set. SingleCellExperiment data objects can be directly used with sccomp_glm function.
data(sce_obj)
data(sce_obj)
A SingeCellExperiment object. SingeCellExperiment data objects can be directly used with sccomp_glm function.
Example Seurat data set. Seurat data objects can be directly used with sccomp_glm function.
data(seurat_obj)
data(seurat_obj)
A Seurat object
This function simulates counts from a linear model.
simulate_data( .data, .estimate_object, formula_composition, formula_variability = NULL, .sample = NULL, .cell_group = NULL, .coefficients = NULL, variability_multiplier = 5, number_of_draws = 1, mcmc_seed = sample(1e+05, 1) )
simulate_data( .data, .estimate_object, formula_composition, formula_variability = NULL, .sample = NULL, .cell_group = NULL, .coefficients = NULL, variability_multiplier = 5, number_of_draws = 1, mcmc_seed = sample(1e+05, 1) )
.data |
A tibble including a cell_group name column | sample name column | read counts column | factor columns | Pvalue column | a significance column |
.estimate_object |
The result of sccomp_estimate execution. This is used for sampling from real-data properties. |
formula_composition |
A formula. The sample formula used to perform the differential cell_group abundance analysis |
formula_variability |
A formula. The formula describing the model for differential variability, for example ~treatment. In most cases, if differentially variability is of interest, the formula should only include the factor of interest as a large anount of data is needed to define variability depending to each factors. |
.sample |
A column name as symbol. The sample identifier |
.cell_group |
A column name as symbol. The cell_group identifier |
.coefficients |
The column names for coefficients, for example, c(b_0, b_1) |
variability_multiplier |
A real scalar. This can be used for artificially increasing the variability of the simulation for benchmarking purposes. |
number_of_draws |
An integer. How may copies of the data you want to draw from the model joint posterior distribution. |
mcmc_seed |
An integer. Used for Markov-chain Monte Carlo reproducibility. By default a random number is sampled from 1 to 999999. This itself can be controlled by set.seed() |
A nested tibble tbl
with cell_group-wise statistics
data("counts_obj") library(dplyr) estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) # Set coefficients for cell_groups. In this case all coefficients are 0 for simplicity. counts_obj = counts_obj |> mutate(b_0 = 0, b_1 = 0) # Simulate data simulate_data(counts_obj, estimate, ~type, ~1, sample, cell_group, c(b_0, b_1))
data("counts_obj") library(dplyr) estimate = sccomp_estimate( counts_obj , ~ type, ~1, sample, cell_group, count, cores = 1 ) # Set coefficients for cell_groups. In this case all coefficients are 0 for simplicity. counts_obj = counts_obj |> mutate(b_0 = 0, b_1 = 0) # Simulate data simulate_data(counts_obj, estimate, ~type, ~1, sample, cell_group, c(b_0, b_1))
This function test ocntrasts from a sccomp result.
test_contrasts( .data, contrasts = NULL, percent_false_positive = 5, test_composition_above_logit_fold_change = 0.2, pass_fit = TRUE )
test_contrasts( .data, contrasts = NULL, percent_false_positive = 5, test_composition_above_logit_fold_change = 0.2, pass_fit = TRUE )
.data |
A tibble. The result of sccomp_glm. |
contrasts |
A vector of character strings. For example if your formula is |
percent_false_positive |
A real between 0 and 100 non included. This used to identify outliers with a specific false positive rate. |
test_composition_above_logit_fold_change |
A positive integer. It is the effect threshold used for the hypothesis test. A value of 0.2 correspond to a change in cell proportion of 10% for a cell type with baseline proportion of 50%. That is, a cell type goes from 45% to 50%. When the baseline proportion is closer to 0 or 1 this effect thrshold has consistent value in the logit uncontrained scale. |
pass_fit |
A boolean. Whether to pass the Stan fit as attribute in the output. Because the Stan fit can be very large, setting this to FALSE can be used to lower the memory imprint to save the output. |
A nested tibble tbl
with cell_group-wise statistics
data("counts_obj") estimates = sccomp_glm( counts_obj , ~ 0 + type, ~1, sample, cell_group, count, check_outliers = FALSE, cores = 1 ) |> test_contrasts("typecancer - typebenign")
data("counts_obj") estimates = sccomp_glm( counts_obj , ~ 0 + type, ~1, sample, cell_group, count, check_outliers = FALSE, cores = 1 ) |> test_contrasts("typecancer - typebenign")