| Title: | Batch-Aware Cell Quality Control for Single-Cell RNA-seq |
|---|---|
| Description: | scBatchQC provides a hierarchical empirical Bayes framework for quality control in multi-sample, multi-batch single-cell RNA-seq experiments. Unlike per-sample QC tools, scBatchQC jointly models QC metric distributions (library size, gene count, mitochondrial fraction) and doublet rates across batches, enabling calibrated cell-level QC calls that account for batch structure. The package operates natively on SingleCellExperiment objects and returns augmented colData with per-cell QC flags and batch-adjusted doublet scores. |
| Authors: | Subhadip Jana [aut, cre] (ORCID: <https://orcid.org/0009-0003-7860-2853>) |
| Maintainer: | Subhadip Jana <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.99.3 |
| Built: | 2026-06-28 11:13:33 UTC |
| Source: | https://github.com/bioc/scBatchQC |
scBatchQC provides a hierarchical empirical Bayes framework for quality control (QC) in multi-sample, multi-batch single-cell RNA-sequencing (scRNA-seq) experiments.
Unlike per-sample QC tools such as scuttle::isOutlier, which
apply a single global MAD threshold, scBatchQC jointly models
QC metric distributions (library size, gene count, mitochondrial
fraction) and doublet rates across batches. This prevents
over-filtering of high-quality batches and under-filtering of
low-quality ones — a common but underappreciated problem in
multi-batch scRNA-seq workflows.
batchAwareQCMetricsCompute per-cell QC metrics and flag outliers using batch-harmonized MAD thresholds.
estimateBatchDoubletRateModel expected doublet rates per batch as a function of cells loaded and protocol.
harmonizeQCThresholdsInspect and update harmonized thresholds at arbitrary MAD stringency.
plotBatchQCVisualize QC metric distributions per batch with threshold overlays.
All functions accept and return
SingleCellExperiment objects.
Results are stored as additional columns in colData().
Maintainer: Subhadip Jana [email protected] (ORCID)
Authors:
Subhadip Jana [email protected] (ORCID)
Amezquita RA et al. (2020). Orchestrating single-cell analysis with Bioconductor. Nature Methods, 17, 137-145.
Useful links:
Report bugs at https://github.com/SubhadipJana1409/scBatchQC/issues
Computes per-cell quality control metrics and identifies outlier cells
using a hierarchical empirical Bayes approach. Unlike
scuttle::isOutlier,
which applies a single global MAD threshold, batchAwareQCMetrics
estimates batch-specific MAD scales and shrinks them toward a global
prior, preventing over-filtering in high-quality batches and
under-filtering in low-quality ones.
batchAwareQCMetrics( sce, batch = NULL, metrics = c("sum", "detected", "subsets_MT_percent"), nmads = 3, mt_pattern = "^MT-", shrink_strength = 0.5, BPPARAM = SerialParam() )batchAwareQCMetrics( sce, batch = NULL, metrics = c("sum", "detected", "subsets_MT_percent"), nmads = 3, mt_pattern = "^MT-", shrink_strength = 0.5, BPPARAM = SerialParam() )
sce |
A |
batch |
A |
metrics |
A |
nmads |
A |
mt_pattern |
A |
shrink_strength |
A |
BPPARAM |
A |
For each QC metric and batch , the function estimates:
Per-batch median and MAD .
A shrinkage weight based on batch cell count.
A global prior pooled across batches.
A harmonized threshold
where and are the
shrinkage estimates.
A cell is flagged as an outlier if any QC metric exceeds its batch-specific harmonized threshold.
The input sce with the following additions to
colData:
scBatchQC_sum: library size (total UMI count).
scBatchQC_detected: number of detected genes.
scBatchQC_subsets_MT_percent: mitochondrial fraction.
scBatchQC_outlier: logical flag; TRUE if the cell
fails any QC threshold.
scBatchQC_outlier_reason: character string naming which
metric(s) caused the flag.
Lun ATL et al. (2016). A step-by-step workflow for low-level analysis of single-cell RNA sequencing data with Bioconductor. F1000Research, 5, 2122.
estimateBatchDoubletRate,
harmonizeQCThresholds, plotBatchQC
library(SingleCellExperiment) # Simulate a minimal SCE with two batches set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) rownames(counts)[1:10] <- paste0("MT-", seq_len(10)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) sce <- batchAwareQCMetrics(sce, batch = "batch") table(sce$scBatchQC_outlier, sce$batch)library(SingleCellExperiment) # Simulate a minimal SCE with two batches set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) rownames(counts)[1:10] <- paste0("MT-", seq_len(10)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) sce <- batchAwareQCMetrics(sce, batch = "batch") table(sce$scBatchQC_outlier, sce$batch)
Returns the per-batch summary DataFrame.
batchSummary(x, ...) ## S4 method for signature 'BQCResult' batchSummary(x, ...)batchSummary(x, ...) ## S4 method for signature 'BQCResult' batchSummary(x, ...)
x |
A |
... |
Additional arguments (not used). |
A DataFrame of per-batch statistics.
library(S4Vectors) obj <- BQCResult( qcFlags = DataFrame(low_lib = c(FALSE, TRUE)), doubletScores = c(0.04, 0.06), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) batchSummary(obj)library(S4Vectors) obj <- BQCResult( qcFlags = DataFrame(low_lib = c(FALSE, TRUE)), doubletScores = c(0.04, 0.06), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) batchSummary(obj)
Create a new BQCResult object.
BQCResult(qcFlags, doubletScores, batchSummary, params = list())BQCResult(qcFlags, doubletScores, batchSummary, params = list())
qcFlags |
A |
doubletScores |
A numeric vector of doublet scores. |
batchSummary |
A |
params |
A list of analysis parameters. |
A BQCResult object.
library(S4Vectors) qf <- DataFrame(low_lib = c(FALSE, TRUE, FALSE)) obj <- BQCResult( qcFlags = qf, doubletScores = c(0.04, 0.06, 0.05), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) objlibrary(S4Vectors) qf <- DataFrame(low_lib = c(FALSE, TRUE, FALSE)) obj <- BQCResult( qcFlags = qf, doubletScores = c(0.04, 0.06, 0.05), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) obj
An S4 class to store the output of batch-aware quality control
applied to a SingleCellExperiment object. Slots hold
per-cell QC flags, batch-level doublet rate estimates, and
harmonized QC thresholds for each batch.
qcFlagsA DataFrame with per-cell logical QC flags,
one row per cell and one column per QC metric.
doubletScoresA numeric vector of batch-adjusted
doublet probability scores, one per cell.
batchSummaryA DataFrame with one row per batch
summarising estimated doublet rate and harmonized thresholds.
paramsA list storing the parameters used in the
analysis (batch variable name, NMAD multiplier, etc.).
Returns the per-cell doublet score vector.
doubletScores(x, ...) ## S4 method for signature 'BQCResult' doubletScores(x, ...)doubletScores(x, ...) ## S4 method for signature 'BQCResult' doubletScores(x, ...)
x |
A |
... |
Additional arguments (not used). |
A numeric vector of doublet scores.
library(S4Vectors) obj <- BQCResult( qcFlags = DataFrame(low_lib = c(FALSE, TRUE)), doubletScores = c(0.04, 0.06), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) doubletScores(obj)library(S4Vectors) obj <- BQCResult( qcFlags = DataFrame(low_lib = c(FALSE, TRUE)), doubletScores = c(0.04, 0.06), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) doubletScores(obj)
Estimates the expected doublet rate for each batch in a
multi-sample SingleCellExperiment experiment. The doublet
rate is modelled as a linear function of per-batch technical
covariates (number of cells loaded, median library size, protocol
type), enabling principled flagging of likely doublets across
batches with heterogeneous capture efficiencies.
estimateBatchDoubletRate( sce, batch = NULL, cells_loaded = NULL, protocol = NULL, observed_doublets = NULL, return_sce = TRUE )estimateBatchDoubletRate( sce, batch = NULL, cells_loaded = NULL, protocol = NULL, observed_doublets = NULL, return_sce = TRUE )
sce |
A |
batch |
A |
cells_loaded |
A named |
protocol |
A named |
observed_doublets |
A |
return_sce |
Logical. If |
Doublet rates in droplet-based scRNA-seq follow approximately:
where is the number of cells loaded per batch and
for 10x Genomics Chromium.
estimateBatchDoubletRate allows to vary by batch
covariates (e.g. protocol, operator) by fitting a linear model on
the log-transformed per-batch cell count. When external doublet
simulations are not desired, this gives a lightweight alternative to
full simulation-based tools like scDblFinder.
Optionally, if the user supplies observed doublet calls from an
external tool in colData, the function will calibrate the
rate model against those observations.
If return_sce = TRUE: the input sce with a
scBatchQC_doublet_rate column in colData giving the
estimated doublet probability for each cell's batch.
If return_sce = FALSE: a DataFrame with one row per
batch and columns batch, n_cells_obs,
cells_loaded, doublet_rate_est, protocol.
batchAwareQCMetrics, plotBatchQC
library(SingleCellExperiment) set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) cells_loaded <- c(B1 = 5000, B2 = 8000) sce <- estimateBatchDoubletRate(sce, batch = "batch", cells_loaded = cells_loaded ) sce$scBatchQC_doublet_ratelibrary(SingleCellExperiment) set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) cells_loaded <- c(B1 = 5000, B2 = 8000) sce <- estimateBatchDoubletRate(sce, batch = "batch", cells_loaded = cells_loaded ) sce$scBatchQC_doublet_rate
Given a SingleCellExperiment
that has already been processed by batchAwareQCMetrics,
harmonizeQCThresholds returns the per-batch QC threshold
table and optionally updates colData with revised flags at a
user-specified stringency.
This is useful for interactive threshold exploration or for
downstream reporting: instead of re-running the full QC pipeline,
the user can sweep nmads and inspect how the number of
flagged cells changes per batch.
harmonizeQCThresholds( sce, batch = NULL, nmads = 3, shrink_strength = 0.5, update_sce = FALSE )harmonizeQCThresholds( sce, batch = NULL, nmads = 3, shrink_strength = 0.5, update_sce = FALSE )
sce |
A |
batch |
A |
nmads |
A |
shrink_strength |
A |
update_sce |
Logical. If |
A list with components:
thresholdsA named list of per-metric
data.frames (rows = batches, columns = lower
and upper).
n_flaggedA DataFrame with one row per
batch and one column per metric showing the number of cells
that would be flagged at these thresholds.
sceThe (possibly updated) sce, returned
only when update_sce = TRUE.
batchAwareQCMetrics,
estimateBatchDoubletRate, plotBatchQC
library(SingleCellExperiment) set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) rownames(counts)[1:10] <- paste0("MT-", seq_len(10)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) sce <- batchAwareQCMetrics(sce, batch = "batch") # Explore with 2.5 MADs instead of default 3 result <- harmonizeQCThresholds(sce, batch = "batch", nmads = 2.5) result$n_flaggedlibrary(SingleCellExperiment) set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) rownames(counts)[1:10] <- paste0("MT-", seq_len(10)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) sce <- batchAwareQCMetrics(sce, batch = "batch") # Explore with 2.5 MADs instead of default 3 result <- harmonizeQCThresholds(sce, batch = "batch", nmads = 2.5) result$n_flagged
Produces a panel of violin plots showing per-batch distributions of
QC metrics, with harmonized threshold lines overlaid. Useful for
inspecting whether batchAwareQCMetrics thresholds are
sensible and for comparing batch quality visually.
plotBatchQC( sce, batch = NULL, metrics = NULL, show_thresholds = TRUE, nmads = 3, colour_by = "scBatchQC_outlier", point_size = 0.4, point_alpha = 0.4 )plotBatchQC( sce, batch = NULL, metrics = NULL, show_thresholds = TRUE, nmads = 3, colour_by = "scBatchQC_outlier", point_size = 0.4, point_alpha = 0.4 )
sce |
A |
batch |
A |
metrics |
A |
show_thresholds |
Logical. If |
nmads |
Passed to |
colour_by |
A |
point_size |
Numeric. Jitter point size. Default: |
point_alpha |
Numeric. Jitter point alpha. Default: |
A ggplot2 object. Can be modified with standard
ggplot2 functions or saved with ggsave().
batchAwareQCMetrics,
harmonizeQCThresholds
library(SingleCellExperiment) set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) rownames(counts)[1:10] <- paste0("MT-", seq_len(10)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) sce <- batchAwareQCMetrics(sce, batch = "batch") plotBatchQC(sce, batch = "batch")library(SingleCellExperiment) set.seed(42) counts <- matrix(rpois(2000, lambda = 5), nrow = 200, ncol = 100) rownames(counts) <- paste0("Gene", seq_len(200)) rownames(counts)[1:10] <- paste0("MT-", seq_len(10)) colnames(counts) <- paste0("Cell", seq_len(100)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("B1", "B2"), each = 50) sce <- batchAwareQCMetrics(sce, batch = "batch") plotBatchQC(sce, batch = "batch")
Returns the per-cell QC flag DataFrame.
qcFlags(x, ...) ## S4 method for signature 'BQCResult' qcFlags(x, ...)qcFlags(x, ...) ## S4 method for signature 'BQCResult' qcFlags(x, ...)
x |
A |
... |
Additional arguments (not used). |
A DataFrame of per-cell logical QC flags.
library(S4Vectors) obj <- BQCResult( qcFlags = DataFrame(low_lib = c(FALSE, TRUE)), doubletScores = c(0.04, 0.06), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) qcFlags(obj)library(S4Vectors) obj <- BQCResult( qcFlags = DataFrame(low_lib = c(FALSE, TRUE)), doubletScores = c(0.04, 0.06), batchSummary = DataFrame(batch = "B1", doublet_rate_est = 0.04) ) qcFlags(obj)
Prints a compact summary of a BQCResult object.
## S4 method for signature 'BQCResult' show(object)## S4 method for signature 'BQCResult' show(object)
object |
A |
Invisibly returns object.