---
title: "Batch-aware QC for multi-sample scRNA-seq with scBatchQC"
author:
  - name: "Subhadip Jana"
    email: "subhadipjana1409@gmail.com"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Batch-aware QC for multi-sample scRNA-seq with scBatchQC}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.width = 7,
    fig.height = 5,
    message = FALSE,
    warning = FALSE
)
```

# Introduction

`scBatchQC` is a Bioconductor package for **batch-aware quality
control** of multi-sample single-cell RNA-seq data. It provides
calibrated, per-batch QC thresholds using empirical Bayes shrinkage,
so you don't over-filter low-depth batches or under-filter
high-depth ones.

## The problem

In multi-batch single-cell RNA-seq experiments, QC thresholds can be
misleading when they are applied globally across all cells or estimated
independently for each small batch. This ignores or overreacts to
batch-to-batch variation and leads to:
 
- **Over-filtering** of lower-depth batches (good cells removed)
- **Under-filtering** of higher-depth batches (bad cells kept)

## What scBatchQC does

`scBatchQC` solves this with a **hierarchical empirical Bayes**
approach:

1. Computes per-batch QC metric medians and MADs
2. Shrinks per-batch estimates toward a global prior (borrows
   strength across batches)
3. Sets batch-specific thresholds that are calibrated, not arbitrary
4. Estimates per-batch doublet rates from cells loaded and protocol

The result: each batch gets its own QC thresholds, but small batches
are stabilised by borrowing information from larger ones.

## How scBatchQC compares to other QC packages

- **scater / scuttle** provide widely used tools for computing per-cell QC
  metrics, visualizing QC distributions, and identifying MAD-based outliers.
  `scBatchQC` builds on the same core QC metrics, but focuses on multi-batch
  experiments by estimating per-batch thresholds and shrinking them toward a
  shared empirical Bayes prior. This stabilizes thresholds for small or
  low-depth batches while preserving batch-specific calibration.

- **miQC** uses a probabilistic model based on mitochondrial read proportion
  and the number of detected genes to identify low-quality cells. `scBatchQC`
  instead provides a batch-aware thresholding framework for library size,
  detected genes, and mitochondrial fraction, with harmonized per-batch
  summaries returned directly in the `SingleCellExperiment`.

- **scDblFinder** detects likely doublets from the expression profiles of
  individual cells. `scBatchQC::estimateBatchDoubletRate()` is complementary:
  it estimates the expected doublet rate per batch from experimental metadata
  such as cells loaded and protocol, which is useful for planning, reporting,
  and checking expression-based doublet calls.

# Installation

```{r install, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("scBatchQC")
```

# Quick start: simulate a two-batch dataset

We create a toy dataset with two batches that mimic a common real
scenario: one high-depth batch and one lower-depth batch with a few
damaged cells.

```{r quickstart}
library(scBatchQC)
library(SingleCellExperiment)

set.seed(123)
n_genes <- 500

# Batch 1: high-depth fresh tissue (10x v3)
counts_b1 <- matrix(
    rpois(n_genes * 100, lambda = 12),
    nrow = n_genes, ncol = 100
)

# Batch 2: lower-depth cryopreserved (10x v2)
# with 5 deliberately damaged cells
counts_b2 <- matrix(
    rpois(n_genes * 100, lambda = 5),
    nrow = n_genes, ncol = 100
)
counts_b2[, 1:5] <- floor(counts_b2[, 1:5] / 10)

# Combine and label
counts <- cbind(counts_b1, counts_b2)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
rownames(counts)[1:30] <- paste0("MT-", seq_len(30))
colnames(counts) <- paste0("Cell", seq_len(200))

sce <- SingleCellExperiment(assays = list(counts = counts))
sce$batch <- rep(c("Batch1_v3", "Batch2_v2"), each = 100)
sce
```

# Step 1: Batch-aware QC metrics

The central function `batchAwareQCMetrics()` computes three per-cell
QC metrics and flags outliers using batch-harmonised thresholds:

| Metric | Column added | What it measures |
|--------|-------------|------------------|
| Library size | `scBatchQC_sum` | Total UMI counts |
| Genes detected | `scBatchQC_detected` | Number of expressed genes |
| MT fraction | `scBatchQC_subsets_MT_percent` | Mitochondrial % |
| **Outlier flag** | `scBatchQC_outlier` | Should this cell be removed? |
| Reason | `scBatchQC_outlier_reason` | Which metric(s) failed |

```{r qc_metrics}
sce <- batchAwareQCMetrics(
    sce,
    batch = "batch",
    nmads = 3,
    shrink_strength = 0.5
)

# Columns added
grep("^scBatchQC", names(colData(sce)), value = TRUE)
```

```{r qc_table}
# Outlier counts per batch
table(Outlier = sce$scBatchQC_outlier, Batch = sce$batch)
```

## Understanding the key parameters

**`nmads`** (default: 3) — Number of MADs from the median to set the
threshold. Lower = stricter (more cells flagged). A typical range is
2.5–4.

**`shrink_strength`** (default: 0.5) — Controls how much per-batch
estimates are pulled toward the global mean:

| Value | Behaviour | When to use |
|-------|-----------|-------------|
| 0 | Pure per-batch thresholds | Many cells per batch (>500) |
| 0.5 | Balanced (default) | Most experiments |
| 0.7–0.9 | Strong shrinkage | Small batches (<50 cells) |
| 1 | Fully pooled (ignores batch) | Single-batch fallback |

# Step 2: Visualise QC distributions

`plotBatchQC()` shows violin plots of each metric per batch, with
threshold lines and outlier highlights:

```{r plot_qc, fig.height = 8, fig.cap = "QC distributions per batch. Dashed red lines = harmonised thresholds. Red points = flagged outliers."}
plotBatchQC(sce, batch = "batch")
```

# Step 3: Estimate doublet rates

`estimateBatchDoubletRate()` models the expected doublet rate per
batch based on how many cells were loaded and the protocol used:

```{r doublet_rates}
cells_loaded <- c(Batch1_v3 = 8000, Batch2_v2 = 5000)

sce <- estimateBatchDoubletRate(
    sce,
    batch = "batch",
    cells_loaded = cells_loaded,
    protocol = c(Batch1_v3 = "10x_v3", Batch2_v2 = "10x_v2")
)

# Estimated doublet rate per batch
tapply(sce$scBatchQC_doublet_rate, sce$batch, unique)
```

The model uses the empirical relationship from 10x Genomics:
~0.8% doublets per 1,000 cells loaded on the Chromium controller.

# Step 4: Explore thresholds interactively

Before committing to a specific `nmads`, use
`harmonizeQCThresholds()` to see how many cells would be flagged at
different stringencies — without re-running the full pipeline:

```{r threshold_sweep}
sweep <- lapply(c(2, 2.5, 3, 3.5, 4), function(n) {
    r <- harmonizeQCThresholds(sce, batch = "batch", nmads = n)
    data.frame(
        nmads = n,
        flagged = sum(rowSums(as.data.frame(r$n_flagged)))
    )
})
do.call(rbind, sweep)
```

**How to choose `nmads`**: Look for the "elbow" — the point where
flagged cells stop increasing sharply. Typical scRNA-seq experiments
use nmads = 3 (default) and flag 3–10% of cells.

# Step 5: Filter and continue

Once you're satisfied with the QC, filter and pass to downstream
analysis:

```{r filter}
sce_clean <- sce[, !sce$scBatchQC_outlier]
cat(
    "Kept", ncol(sce_clean), "of", ncol(sce),
    "cells (removed", sum(sce$scBatchQC_outlier), ")\n"
)
```

# Storing results: BQCResult container

For programmatic access to batch-level summaries, use the `BQCResult`
S4 class:

```{r bqc_result}
library(S4Vectors)

result <- BQCResult(
    qcFlags = DataFrame(outlier = sce$scBatchQC_outlier),
    doubletScores = sce$scBatchQC_doublet_rate,
    batchSummary = estimateBatchDoubletRate(
        sce,
        batch = "batch",
        cells_loaded = cells_loaded,
        return_sce = FALSE
    )
)
result
```

Access individual slots with:

```{r accessors}
head(qcFlags(result))
head(doubletScores(result))
batchSummary(result)
```

# Edge cases

| Scenario | Behaviour |
|----------|-----------|
| **Single batch** (no `batch` argument) | Falls back to global MAD — equivalent to `scuttle::isOutlier` |
| **Very small batch** (<50 cells) | Increase `shrink_strength` to 0.7–0.9 |
| **No MT genes found** | MT metric is silently skipped; QC uses library size and gene count only |
| **All cells flagged** | Your `nmads` is too strict, or data has very low variance — try `nmads = 4` |

# Recommended full workflow

```{r full_workflow, eval = FALSE}
# Step 1: batch-aware QC
sce <- batchAwareQCMetrics(sce, batch = "batch", nmads = 3)

# Step 2: doublet rates
sce <- estimateBatchDoubletRate(
    sce,
    batch = "batch",
    cells_loaded = my_cells_loaded
)

# Step 3: visualise
plotBatchQC(sce, batch = "batch")

# Step 4: filter
sce_clean <- sce[, !sce$scBatchQC_outlier]

# Step 5: downstream (scran, Seurat, Harmony, etc.)
```

# Session information

```{r session}
sessionInfo()
```