Batch-aware QC for multi-sample scRNA-seq with scBatchQC

Introduction

scBatchQC is a Bioconductor package for batch-aware quality control of multi-sample single-cell RNA-seq data. It provides calibrated, per-batch QC thresholds using empirical Bayes shrinkage, so you don’t over-filter low-depth batches or under-filter high-depth ones.

The problem

In multi-batch single-cell RNA-seq experiments, QC thresholds can be misleading when they are applied globally across all cells or estimated independently for each small batch. This ignores or overreacts to batch-to-batch variation and leads to:

  • Over-filtering of lower-depth batches (good cells removed)
  • Under-filtering of higher-depth batches (bad cells kept)

What scBatchQC does

scBatchQC solves this with a hierarchical empirical Bayes approach:

  1. Computes per-batch QC metric medians and MADs
  2. Shrinks per-batch estimates toward a global prior (borrows strength across batches)
  3. Sets batch-specific thresholds that are calibrated, not arbitrary
  4. Estimates per-batch doublet rates from cells loaded and protocol

The result: each batch gets its own QC thresholds, but small batches are stabilised by borrowing information from larger ones.

How scBatchQC compares to other QC packages

  • scater / scuttle provide widely used tools for computing per-cell QC metrics, visualizing QC distributions, and identifying MAD-based outliers. scBatchQC builds on the same core QC metrics, but focuses on multi-batch experiments by estimating per-batch thresholds and shrinking them toward a shared empirical Bayes prior. This stabilizes thresholds for small or low-depth batches while preserving batch-specific calibration.

  • miQC uses a probabilistic model based on mitochondrial read proportion and the number of detected genes to identify low-quality cells. scBatchQC instead provides a batch-aware thresholding framework for library size, detected genes, and mitochondrial fraction, with harmonized per-batch summaries returned directly in the SingleCellExperiment.

  • scDblFinder detects likely doublets from the expression profiles of individual cells. scBatchQC::estimateBatchDoubletRate() is complementary: it estimates the expected doublet rate per batch from experimental metadata such as cells loaded and protocol, which is useful for planning, reporting, and checking expression-based doublet calls.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("scBatchQC")

Quick start: simulate a two-batch dataset

We create a toy dataset with two batches that mimic a common real scenario: one high-depth batch and one lower-depth batch with a few damaged cells.

library(scBatchQC)
library(SingleCellExperiment)

set.seed(123)
n_genes <- 500

# Batch 1: high-depth fresh tissue (10x v3)
counts_b1 <- matrix(
    rpois(n_genes * 100, lambda = 12),
    nrow = n_genes, ncol = 100
)

# Batch 2: lower-depth cryopreserved (10x v2)
# with 5 deliberately damaged cells
counts_b2 <- matrix(
    rpois(n_genes * 100, lambda = 5),
    nrow = n_genes, ncol = 100
)
counts_b2[, 1:5] <- floor(counts_b2[, 1:5] / 10)

# Combine and label
counts <- cbind(counts_b1, counts_b2)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
rownames(counts)[1:30] <- paste0("MT-", seq_len(30))
colnames(counts) <- paste0("Cell", seq_len(200))

sce <- SingleCellExperiment(assays = list(counts = counts))
sce$batch <- rep(c("Batch1_v3", "Batch2_v2"), each = 100)
sce
#> class: SingleCellExperiment 
#> dim: 500 200 
#> metadata(0):
#> assays(1): counts
#> rownames(500): MT-1 MT-2 ... Gene499 Gene500
#> rowData names(0):
#> colnames(200): Cell1 Cell2 ... Cell199 Cell200
#> colData names(1): batch
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Step 1: Batch-aware QC metrics

The central function batchAwareQCMetrics() computes three per-cell QC metrics and flags outliers using batch-harmonised thresholds:

Metric Column added What it measures
Library size scBatchQC_sum Total UMI counts
Genes detected scBatchQC_detected Number of expressed genes
MT fraction scBatchQC_subsets_MT_percent Mitochondrial %
Outlier flag scBatchQC_outlier Should this cell be removed?
Reason scBatchQC_outlier_reason Which metric(s) failed
sce <- batchAwareQCMetrics(
    sce,
    batch = "batch",
    nmads = 3,
    shrink_strength = 0.5
)

# Columns added
grep("^scBatchQC", names(colData(sce)), value = TRUE)
#> [1] "scBatchQC_sum"                "scBatchQC_detected"          
#> [3] "scBatchQC_subsets_MT_percent" "scBatchQC_outlier"           
#> [5] "scBatchQC_outlier_reason"
# Outlier counts per batch
table(Outlier = sce$scBatchQC_outlier, Batch = sce$batch)
#>        Batch
#> Outlier Batch1_v3 Batch2_v2
#>    TRUE       100       100

Understanding the key parameters

nmads (default: 3) — Number of MADs from the median to set the threshold. Lower = stricter (more cells flagged). A typical range is 2.5–4.

shrink_strength (default: 0.5) — Controls how much per-batch estimates are pulled toward the global mean:

Value Behaviour When to use
0 Pure per-batch thresholds Many cells per batch (>500)
0.5 Balanced (default) Most experiments
0.7–0.9 Strong shrinkage Small batches (<50 cells)
1 Fully pooled (ignores batch) Single-batch fallback

Step 2: Visualise QC distributions

plotBatchQC() shows violin plots of each metric per batch, with threshold lines and outlier highlights:

plotBatchQC(sce, batch = "batch")
QC distributions per batch. Dashed red lines = harmonised thresholds. Red points = flagged outliers.

QC distributions per batch. Dashed red lines = harmonised thresholds. Red points = flagged outliers.

Step 3: Estimate doublet rates

estimateBatchDoubletRate() models the expected doublet rate per batch based on how many cells were loaded and the protocol used:

cells_loaded <- c(Batch1_v3 = 8000, Batch2_v2 = 5000)

sce <- estimateBatchDoubletRate(
    sce,
    batch = "batch",
    cells_loaded = cells_loaded,
    protocol = c(Batch1_v3 = "10x_v3", Batch2_v2 = "10x_v2")
)

# Estimated doublet rate per batch
tapply(sce$scBatchQC_doublet_rate, sce$batch, unique)
#> Batch1_v3 Batch2_v2 
#>    0.0640    0.0375

The model uses the empirical relationship from 10x Genomics: ~0.8% doublets per 1,000 cells loaded on the Chromium controller.

Step 4: Explore thresholds interactively

Before committing to a specific nmads, use harmonizeQCThresholds() to see how many cells would be flagged at different stringencies — without re-running the full pipeline:

sweep <- lapply(c(2, 2.5, 3, 3.5, 4), function(n) {
    r <- harmonizeQCThresholds(sce, batch = "batch", nmads = n)
    data.frame(
        nmads = n,
        flagged = sum(rowSums(as.data.frame(r$n_flagged)))
    )
})
do.call(rbind, sweep)
#>   nmads flagged
#> 1   2.0     339
#> 2   2.5     222
#> 3   3.0     222
#> 4   3.5     215
#> 5   4.0     215

How to choose nmads: Look for the “elbow” — the point where flagged cells stop increasing sharply. Typical scRNA-seq experiments use nmads = 3 (default) and flag 3–10% of cells.

Step 5: Filter and continue

Once you’re satisfied with the QC, filter and pass to downstream analysis:

sce_clean <- sce[, !sce$scBatchQC_outlier]
cat(
    "Kept", ncol(sce_clean), "of", ncol(sce),
    "cells (removed", sum(sce$scBatchQC_outlier), ")\n"
)
#> Kept 0 of 200 cells (removed 200 )

Storing results: BQCResult container

For programmatic access to batch-level summaries, use the BQCResult S4 class:

library(S4Vectors)

result <- BQCResult(
    qcFlags = DataFrame(outlier = sce$scBatchQC_outlier),
    doubletScores = sce$scBatchQC_doublet_rate,
    batchSummary = estimateBatchDoubletRate(
        sce,
        batch = "batch",
        cells_loaded = cells_loaded,
        return_sce = FALSE
    )
)
result
#> BQCResult
#>   Cells          : 200 
#>   Batches        : 2 
#>   QC metrics     : outlier 
#>   Doublet range  : 0.038 - 0.064

Access individual slots with:

head(qcFlags(result))
#> DataFrame with 6 rows and 1 column
#>     outlier
#>   <logical>
#> 1      TRUE
#> 2      TRUE
#> 3      TRUE
#> 4      TRUE
#> 5      TRUE
#> 6      TRUE
head(doubletScores(result))
#> Batch1_v3 Batch1_v3 Batch1_v3 Batch1_v3 Batch1_v3 Batch1_v3 
#>     0.064     0.064     0.064     0.064     0.064     0.064
batchSummary(result)
#> DataFrame with 2 rows and 5 columns
#>                 batch n_cells_obs cells_loaded doublet_rate_est    protocol
#>           <character>   <integer>    <numeric>        <numeric> <character>
#> Batch1_v3   Batch1_v3         100         8000            0.064      10x_v3
#> Batch2_v2   Batch2_v2         100         5000            0.040      10x_v3

Edge cases

Scenario Behaviour
Single batch (no batch argument) Falls back to global MAD — equivalent to scuttle::isOutlier
Very small batch (<50 cells) Increase shrink_strength to 0.7–0.9
No MT genes found MT metric is silently skipped; QC uses library size and gene count only
All cells flagged Your nmads is too strict, or data has very low variance — try nmads = 4

Session information

sessionInfo()
#> R version 4.6.1 (2026-06-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.35.1 SummarizedExperiment_1.43.0
#>  [3] Biobase_2.73.1              GenomicRanges_1.65.0       
#>  [5] Seqinfo_1.3.0               IRanges_2.47.2             
#>  [7] S4Vectors_0.51.3            BiocGenerics_0.59.7        
#>  [9] generics_0.1.4              MatrixGenerics_1.25.0      
#> [11] matrixStats_1.5.0           scBatchQC_0.99.3           
#> [13] BiocStyle_2.41.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         SparseArray_1.13.2  lattice_0.22-9     
#>  [4] magrittr_2.0.5      digest_0.6.39       evaluate_1.0.5     
#>  [7] grid_4.6.1          RColorBrewer_1.1-3  fastmap_1.2.0      
#> [10] scrapper_1.7.3      jsonlite_2.0.0      Matrix_1.7-5       
#> [13] BiocManager_1.30.27 scales_1.4.0        codetools_0.2-20   
#> [16] jquerylib_0.1.4     abind_1.4-8         cli_3.6.6          
#> [19] rlang_1.2.0         XVector_0.53.0      BiocNeighbors_2.7.2
#> [22] withr_3.0.3         cachem_1.1.0        DelayedArray_0.39.3
#> [25] yaml_2.3.12         otel_0.2.0          S4Arrays_1.13.0    
#> [28] beachmat_2.29.0     tools_4.6.1         parallel_4.6.1     
#> [31] BiocParallel_1.47.0 dplyr_1.2.1         ggplot2_4.0.3      
#> [34] buildtools_1.0.0    vctrs_0.7.3         R6_2.6.1           
#> [37] lifecycle_1.0.5     pkgconfig_2.0.3     pillar_1.11.1      
#> [40] bslib_0.11.0        gtable_0.3.6        Rcpp_1.1.1-1.1     
#> [43] glue_1.8.1          tidyselect_1.2.1    tibble_3.3.1       
#> [46] xfun_0.59           sys_3.4.3           knitr_1.51         
#> [49] farver_2.1.2        htmltools_0.5.9     labeling_0.4.3     
#> [52] rmarkdown_2.31      maketools_1.3.2     compiler_4.6.1     
#> [55] S7_0.2.2