--- title: "Batch-aware QC for multi-sample scRNA-seq with scBatchQC" author: - name: "Subhadip Jana" email: "subhadipjana1409@gmail.com" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_float: true number_sections: true vignette: > %\VignetteIndexEntry{Batch-aware QC for multi-sample scRNA-seq with scBatchQC} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, message = FALSE, warning = FALSE ) ``` # Introduction `scBatchQC` is a Bioconductor package for **batch-aware quality control** of multi-sample single-cell RNA-seq data. It provides calibrated, per-batch QC thresholds using empirical Bayes shrinkage, so you don't over-filter low-depth batches or under-filter high-depth ones. ## The problem In multi-batch single-cell RNA-seq experiments, QC thresholds can be misleading when they are applied globally across all cells or estimated independently for each small batch. This ignores or overreacts to batch-to-batch variation and leads to: - **Over-filtering** of lower-depth batches (good cells removed) - **Under-filtering** of higher-depth batches (bad cells kept) ## What scBatchQC does `scBatchQC` solves this with a **hierarchical empirical Bayes** approach: 1. Computes per-batch QC metric medians and MADs 2. Shrinks per-batch estimates toward a global prior (borrows strength across batches) 3. Sets batch-specific thresholds that are calibrated, not arbitrary 4. Estimates per-batch doublet rates from cells loaded and protocol The result: each batch gets its own QC thresholds, but small batches are stabilised by borrowing information from larger ones. ## How scBatchQC compares to other QC packages - **scater / scuttle** provide widely used tools for computing per-cell QC metrics, visualizing QC distributions, and identifying MAD-based outliers. `scBatchQC` builds on the same core QC metrics, but focuses on multi-batch experiments by estimating per-batch thresholds and shrinking them toward a shared empirical Bayes prior. This stabilizes thresholds for small or low-depth batches while preserving batch-specific calibration. - **miQC** uses a probabilistic model based on mitochondrial read proportion and the number of detected genes to identify low-quality cells. `scBatchQC` instead provides a batch-aware thresholding framework for library size, detected genes, and mitochondrial fraction, with harmonized per-batch summaries returned directly in the `SingleCellExperiment`. - **scDblFinder** detects likely doublets from the expression profiles of individual cells. `scBatchQC::estimateBatchDoubletRate()` is complementary: it estimates the expected doublet rate per batch from experimental metadata such as cells loaded and protocol, which is useful for planning, reporting, and checking expression-based doublet calls. # Installation ```{r install, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("scBatchQC") ``` # Quick start: simulate a two-batch dataset We create a toy dataset with two batches that mimic a common real scenario: one high-depth batch and one lower-depth batch with a few damaged cells. ```{r quickstart} library(scBatchQC) library(SingleCellExperiment) set.seed(123) n_genes <- 500 # Batch 1: high-depth fresh tissue (10x v3) counts_b1 <- matrix( rpois(n_genes * 100, lambda = 12), nrow = n_genes, ncol = 100 ) # Batch 2: lower-depth cryopreserved (10x v2) # with 5 deliberately damaged cells counts_b2 <- matrix( rpois(n_genes * 100, lambda = 5), nrow = n_genes, ncol = 100 ) counts_b2[, 1:5] <- floor(counts_b2[, 1:5] / 10) # Combine and label counts <- cbind(counts_b1, counts_b2) rownames(counts) <- paste0("Gene", seq_len(n_genes)) rownames(counts)[1:30] <- paste0("MT-", seq_len(30)) colnames(counts) <- paste0("Cell", seq_len(200)) sce <- SingleCellExperiment(assays = list(counts = counts)) sce$batch <- rep(c("Batch1_v3", "Batch2_v2"), each = 100) sce ``` # Step 1: Batch-aware QC metrics The central function `batchAwareQCMetrics()` computes three per-cell QC metrics and flags outliers using batch-harmonised thresholds: | Metric | Column added | What it measures | |--------|-------------|------------------| | Library size | `scBatchQC_sum` | Total UMI counts | | Genes detected | `scBatchQC_detected` | Number of expressed genes | | MT fraction | `scBatchQC_subsets_MT_percent` | Mitochondrial % | | **Outlier flag** | `scBatchQC_outlier` | Should this cell be removed? | | Reason | `scBatchQC_outlier_reason` | Which metric(s) failed | ```{r qc_metrics} sce <- batchAwareQCMetrics( sce, batch = "batch", nmads = 3, shrink_strength = 0.5 ) # Columns added grep("^scBatchQC", names(colData(sce)), value = TRUE) ``` ```{r qc_table} # Outlier counts per batch table(Outlier = sce$scBatchQC_outlier, Batch = sce$batch) ``` ## Understanding the key parameters **`nmads`** (default: 3) — Number of MADs from the median to set the threshold. Lower = stricter (more cells flagged). A typical range is 2.5–4. **`shrink_strength`** (default: 0.5) — Controls how much per-batch estimates are pulled toward the global mean: | Value | Behaviour | When to use | |-------|-----------|-------------| | 0 | Pure per-batch thresholds | Many cells per batch (>500) | | 0.5 | Balanced (default) | Most experiments | | 0.7–0.9 | Strong shrinkage | Small batches (<50 cells) | | 1 | Fully pooled (ignores batch) | Single-batch fallback | # Step 2: Visualise QC distributions `plotBatchQC()` shows violin plots of each metric per batch, with threshold lines and outlier highlights: ```{r plot_qc, fig.height = 8, fig.cap = "QC distributions per batch. Dashed red lines = harmonised thresholds. Red points = flagged outliers."} plotBatchQC(sce, batch = "batch") ``` # Step 3: Estimate doublet rates `estimateBatchDoubletRate()` models the expected doublet rate per batch based on how many cells were loaded and the protocol used: ```{r doublet_rates} cells_loaded <- c(Batch1_v3 = 8000, Batch2_v2 = 5000) sce <- estimateBatchDoubletRate( sce, batch = "batch", cells_loaded = cells_loaded, protocol = c(Batch1_v3 = "10x_v3", Batch2_v2 = "10x_v2") ) # Estimated doublet rate per batch tapply(sce$scBatchQC_doublet_rate, sce$batch, unique) ``` The model uses the empirical relationship from 10x Genomics: ~0.8% doublets per 1,000 cells loaded on the Chromium controller. # Step 4: Explore thresholds interactively Before committing to a specific `nmads`, use `harmonizeQCThresholds()` to see how many cells would be flagged at different stringencies — without re-running the full pipeline: ```{r threshold_sweep} sweep <- lapply(c(2, 2.5, 3, 3.5, 4), function(n) { r <- harmonizeQCThresholds(sce, batch = "batch", nmads = n) data.frame( nmads = n, flagged = sum(rowSums(as.data.frame(r$n_flagged))) ) }) do.call(rbind, sweep) ``` **How to choose `nmads`**: Look for the "elbow" — the point where flagged cells stop increasing sharply. Typical scRNA-seq experiments use nmads = 3 (default) and flag 3–10% of cells. # Step 5: Filter and continue Once you're satisfied with the QC, filter and pass to downstream analysis: ```{r filter} sce_clean <- sce[, !sce$scBatchQC_outlier] cat( "Kept", ncol(sce_clean), "of", ncol(sce), "cells (removed", sum(sce$scBatchQC_outlier), ")\n" ) ``` # Storing results: BQCResult container For programmatic access to batch-level summaries, use the `BQCResult` S4 class: ```{r bqc_result} library(S4Vectors) result <- BQCResult( qcFlags = DataFrame(outlier = sce$scBatchQC_outlier), doubletScores = sce$scBatchQC_doublet_rate, batchSummary = estimateBatchDoubletRate( sce, batch = "batch", cells_loaded = cells_loaded, return_sce = FALSE ) ) result ``` Access individual slots with: ```{r accessors} head(qcFlags(result)) head(doubletScores(result)) batchSummary(result) ``` # Edge cases | Scenario | Behaviour | |----------|-----------| | **Single batch** (no `batch` argument) | Falls back to global MAD — equivalent to `scuttle::isOutlier` | | **Very small batch** (<50 cells) | Increase `shrink_strength` to 0.7–0.9 | | **No MT genes found** | MT metric is silently skipped; QC uses library size and gene count only | | **All cells flagged** | Your `nmads` is too strict, or data has very low variance — try `nmads = 4` | # Recommended full workflow ```{r full_workflow, eval = FALSE} # Step 1: batch-aware QC sce <- batchAwareQCMetrics(sce, batch = "batch", nmads = 3) # Step 2: doublet rates sce <- estimateBatchDoubletRate( sce, batch = "batch", cells_loaded = my_cells_loaded ) # Step 3: visualise plotBatchQC(sce, batch = "batch") # Step 4: filter sce_clean <- sce[, !sce$scBatchQC_outlier] # Step 5: downstream (scran, Seurat, Harmony, etc.) ``` # Session information ```{r session} sessionInfo() ```