| Title: | R/Bioconductor interface to the Sanity model gene expression analysis |
|---|---|
| Description: | a Bayesian normalization procedure derived from first principles. Sanity estimates expression values and associated error bars directly from raw unique molecular identifier (UMI) counts without any tunable parameters. |
| Authors: | Teo Sakel [aut, cre] (ORCID: <https://orcid.org/0000-0001-9946-9498>), MCIU/AEI [fnd] (ROR: <https://ror.org/05r0vyz12>, DOI: 10.13039/501100011033) |
| Maintainer: | Teo Sakel <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.3.0 |
| Built: | 2026-05-30 09:53:24 UTC |
| Source: | https://github.com/bioc/SanityR |
Calculates the expected squared Euclidean distance between two cells using a hierarchical model that shrinks noisy gene differences toward zero.
calculateSanityDistance( x, assay = "logcounts", assay.sd = "logcounts_sd", gene_sd = "sanity_activity_sd", gene_mu = "sanity_log_activity_mean", mu_sd = "sanity_log_activity_mean_sd", snr_cutoff = 1, nbin = 400L, subset.row = NULL, BPPARAM = bpparam() )calculateSanityDistance( x, assay = "logcounts", assay.sd = "logcounts_sd", gene_sd = "sanity_activity_sd", gene_mu = "sanity_log_activity_mean", mu_sd = "sanity_log_activity_mean_sd", snr_cutoff = 1, nbin = 400L, subset.row = NULL, BPPARAM = bpparam() )
x |
A SingleCellExperiment or SummarizedExperiment object which stores the results of the Sanity analysis. |
assay |
The name of the assay containing the log normalized counts matrix. |
assay.sd |
The name of the assay containing the standard deviation of the log-normalized counts |
gene_sd |
The name of the column in the |
gene_mu |
The name of the column in the |
mu_sd |
The name of the column in the |
snr_cutoff |
A numeric value indicating the minimum signal-to-noise ratio (SNR) to consider a gene. |
nbin |
Number of bins to use when calculating prior variance of the true distance. |
subset.row |
A vector of row indices or logical vector indicating which rows to use. |
BPPARAM |
A BiocParallelParam object specifying the parallelization strategy. |
The method calculates the expected squared Euclidean distance between two cells,
adjusting for uncertainty in gene expression estimates. For each gene ,
the contribution to the squared distance between cells and is:
where:
(observed difference in Sanity's estimates)
(combined error variance)
(shrinkage factor)
The shrinkage factor balances the observed gene expression differences
against their measurement uncertainty . For genes with
high-confidence estimates (), it preserves the
observed differences while for noisy genes (), it shrinks
the result towards the common expected biological variation inferred from the
data ().
The function returns the square root of the expected squared distance
The key hyperparameter controls the prior distribution of :
Thus:
: the 2 cells have identical expression states.
: the 2 cells have independent expression states.
The function implements numerical integration over using a grid
of nbin values to compute the expected value of the squared distance across
all possible .
Signal-to-Noise Ratio (SNR) is defined as the ratio of the variance of log-normalized counts across cells versus the mean variance (i.e. error bars) for each genes.
A dist object containing the expected pairwise distances between cells.
sce <- simulate_branched_random_walk(N_gene = 500, N_path = 10, length_path = 10) sce <- Sanity(sce) # necessary step before computing distances d <- calculateSanityDistance(sce) # Downstream analysis and visualization hc <- hclust(d, method = "ward.D2") plot(hc)sce <- simulate_branched_random_walk(N_gene = 500, N_path = 10, length_path = 10) sce <- Sanity(sce) # necessary step before computing distances d <- calculateSanityDistance(sce) # Downstream analysis and visualization hc <- hclust(d, method = "ward.D2") plot(hc)
This function provides a user-friendly interface to the Sanity model for gene expression analysis.
Sanity(x, ...) ## S4 method for signature 'ANY' Sanity( x, size.factors = NULL, vmin = 0.001, vmax = 50, nbin = 160L, a = 1, b = 0, BPPARAM = bpparam() ) ## S4 method for signature 'SummarizedExperiment' Sanity(x, ..., assay.type = "counts", name = "logcounts", subset.row = NULL) ## S4 method for signature 'SingleCellExperiment' Sanity(x, size.factors = sizeFactors(x), ...)Sanity(x, ...) ## S4 method for signature 'ANY' Sanity( x, size.factors = NULL, vmin = 0.001, vmax = 50, nbin = 160L, a = 1, b = 0, BPPARAM = bpparam() ) ## S4 method for signature 'SummarizedExperiment' Sanity(x, ..., assay.type = "counts", name = "logcounts", subset.row = NULL) ## S4 method for signature 'SingleCellExperiment' Sanity(x, size.factors = sizeFactors(x), ...)
x |
A numeric matrix of counts where features are rows and columns are cells. Alternatively, a SummarizedExperiment or a SingleCellExperiment containing such counts. |
... |
For the generic, further arguments to pass to each method. For the For the |
size.factors |
A numeric vector of cell-specific size factors.
Alternatively |
vmin |
The minimum value for the gene-level variance (must be > 0). |
vmax |
The maximum value for the gene-level variance. |
nbin |
Number of variance bins to use. |
a, b
|
Gamma prior parameter (see Details). |
BPPARAM |
A BiocParallelParam object specifying whether the calculations should be parallelized. |
assay.type |
A string specifying the assay of |
name |
String containing an assay name for storing the output normalized values. |
subset.row |
A vector specifying the subset of rows of |
The method models gene activity using a Bayesian framework, assuming
a Gamma prior on expression and integrating over cell-level variability.
It returns posterior estimates for mean expression (mu), cell-specific
deviations (delta), and their variances, as well as expression variance
(var). Expected log-normalized counts are computed by combining mean
expression and cell-specific log-fold changes. The standard deviation of
log-counts is computed by summing the variances of the components.
If no size.factors are provided, they are assumed all equal so that all
cells have the same library size mean(colSums(x)).
The model adopts a Bayesian framework by placing a Gamma prior Gamma(a, b)
over the gene activity, where a is the shape and b the rate parameter,
respectively. This allows for flexible regularization and uncertainty
modeling. The posterior likelihood is estimated by integrating over possible
values of the variance in expression.
Intuitively:
a acts as a pseudo-count added to the total count of the gene.
b acts as a pseudo-count penalizing deviations from the average.
expression — i.e., it regularizes the total number of UMIs that differ from
the expected value.
Setting a = 1 and b = 0 corresponds to an uninformative (uniform) prior,
which was used in the original Sanity model publication.
For matrix-like object it returns a named list with the following elements
(symbols as defined in the Supplementary Text of the publication):
Posterior mean of log expression across cells .
Posterior variance of the mean expression .
Posterior variance of expression across cells .
Vector of log fold-changes for each cell relative to .
Posterior variance of the cell-level fold-changes .
Normalized likelihood across the evaluated variance grid for diagnostics.
If called on a SingleCellExperiment or SummarizedExperiment
it appends the following columns to the rowData slot:
mu
sqrt(var_mu)
sqrt(var)
and appends the following assays (assuming name = "logcounts"):
mu + delta
sqrt(var_mu + var_delta)
Breda, J., Zavolan, M., & van Nimwegen, E. (2021). Bayesian inference of gene expression states from single-cell RNA-seq data. Nature Biotechnology, 39, 1008–1016. https://doi.org/10.1038/s41587-021-00875-x
library(SingleCellExperiment) sce <- simulate_independent_cells(N_cell = 500, N_gene = 100) # Standard Sanity normalization sce_norm <- Sanity(sce) logcounts(sce_norm)[1:5,1:5] # Using size factors sf <- colSums(counts(sce)) sizeFactors(sce) <- sf / mean(sf) sce_norm2 <- Sanity(sce) logcounts(sce_norm2)[1:5,1:5]library(SingleCellExperiment) sce <- simulate_independent_cells(N_cell = 500, N_gene = 100) # Standard Sanity normalization sce_norm <- Sanity(sce) logcounts(sce_norm)[1:5,1:5] # Using size factors sf <- colSums(counts(sce)) sizeFactors(sce) <- sf / mean(sf) sce_norm2 <- Sanity(sce) logcounts(sce_norm2)[1:5,1:5]
These functions generate synthetic single-cell RNA-seq datasets based the methods described in original Sanity publication for benchmarking.
simulate_independent_cells( cell_size = NULL, gene_size = NULL, N_cell = NULL, N_gene = NULL, ltq_var_rate = 0.5 ) simulate_branched_random_walk( cell_size = NULL, gene_size = NULL, N_gene = NULL, ltq_var_rate = 0.5, N_path = 149L, length_path = 13L )simulate_independent_cells( cell_size = NULL, gene_size = NULL, N_cell = NULL, N_gene = NULL, ltq_var_rate = 0.5 ) simulate_branched_random_walk( cell_size = NULL, gene_size = NULL, N_gene = NULL, ltq_var_rate = 0.5, N_path = 149L, length_path = 13L )
cell_size |
Optional vector of real or simulated total UMI counts per cell.
If |
gene_size |
Optional vector of real or simulated total UMI counts per gene.
If |
N_cell |
Integer. Number of cells to simulate.
(For |
N_gene |
Integer. Number of genes to simulate. If |
ltq_var_rate |
Rate parameter for the exponential distribution used to
simulate per-gene variance (default: |
N_path |
(Only for |
length_path |
(Only for |
simulate_independent_cells: gene expression values are generated
independently for each cell. This results in uncorrelated expression patterns
across the dataset.
simulate_branched_random_walk: cells follow a branched random walk
through gene expression space, producing correlated gene expression patterns
that reflect pseudo-temporal differentiation trajectories.
A SingleCellExperiment object containing:
assays$counts: Simulated UMI count matrix.
assays$logFC: Simulated log fold-changes for each gene-cell pair.
rowData: Gene-level metadata including ltq_mean and ltq_var.
colData: Cell-level metadata including predecessor for simulated_branched_random_walk.
A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure Baron, Maayan et al. Cell Systems, Volume 3, Issue 4, 346 - 360.e4 https://doi.org/10.1016/j.cels.2016.08.011
# Simulate dataset with independent gene expression sce_indep <- simulate_independent_cells(N_cell = 100, N_gene = 50) # Simulate dataset with a branched random walk trajectory sce_branch <- simulate_branched_random_walk(N_path = 20, length_path = 5, N_gene = 50)# Simulate dataset with independent gene expression sce_indep <- simulate_independent_cells(N_cell = 100, N_gene = 50) # Simulate dataset with a branched random walk trajectory sce_branch <- simulate_branched_random_walk(N_path = 20, length_path = 5, N_gene = 50)