---
title: "Fast donor-weighted DE for multi-donor scRNA-seq with scFastDE"
author:
  - name: "Subhadip Jana"
    email: "subhadipjana1409@gmail.com"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Fast donor-weighted DE for multi-donor scRNA-seq with scFastDE}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse  = TRUE,
    comment   = "#>",
    fig.width = 7,
    fig.height = 5,
    message   = FALSE,
    warning   = FALSE
)
```

# Introduction

Multi-donor single-cell RNA-seq studies are now the norm for disease studies,
clinical trials, and population-level atlases. The standard approach to
differential expression (DE) in these datasets is pseudo-bulk analysis:
aggregate per-cell counts into per-donor profiles, then apply bulk DE tools
like `DESeq2` or `edgeR`.

This works well but has three underappreciated problems:

**Problem 1: Speed.** Bulk DE tools fit one model per gene in a serial R
loop. On a dataset with 30,000 genes and 50 donors, this loop runs 30,000
times. At scale — 100k+ cells, multiple cell types — this becomes the
analysis bottleneck.

**Problem 2: Donor weighting.** Standard pseudo-bulk tools treat a donor
with 5 cells identically to a donor with 500 cells. The 5-cell pseudo-bulk
is far noisier, yet gets equal weight in the DE model.

**Problem 3: Paired designs.** Many studies collect cells from the same
donors under multiple conditions (e.g. before/after treatment, or
stimulated/unstimulated). Naively aggregating by donor alone mixes
conditions together, destroying the treatment signal.

`scFastDE` addresses all three:

1. **Vectorised testing** — the entire pseudo-bulk matrix is passed to
   `limma::lmFit` in one call, which uses LAPACK routines to fit all gene
   models simultaneously.
2. **Cell-count weights** — each sample is weighted by `sqrt(n_cells)`,
   giving principled influence to well-represented samples.
3. **Sparse donor guard** — donors with too few cells are removed before
   aggregation, preventing noisy pseudo-bulks from distorting results.
4. **Auto-detect paired designs** — `fastDE` automatically detects whether
   donors appear in multiple conditions and builds the appropriate linear
   model (paired `~ 0 + condition + donor` or unpaired `~ 0 + condition`).

# Installation

```{r install, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scFastDE")
```

# Quick start: unpaired design

In an **unpaired design**, each donor belongs to exactly one condition
(e.g. Donors 1–6 = healthy, Donors 7–12 = disease). `scFastDE`
automatically detects this and uses a simple `~ 0 + condition` model.

```{r quickstart}
library(scFastDE)
library(SingleCellExperiment)

# Simulate a multi-donor, two-condition dataset (unpaired)
set.seed(2024)
n_genes <- 500
n_cells <- 120

counts <- matrix(rpois(n_genes * n_cells, 8L),
                 nrow = n_genes, ncol = n_cells)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
colnames(counts) <- paste0("Cell", seq_len(n_cells))

# Inject DE signal: first 20 genes upregulated in treatment
counts[1:20, 61:120] <- counts[1:20, 61:120] * 4L

sce <- SingleCellExperiment(assays = list(counts = counts))
sce$donor     <- rep(paste0("Donor", 1:12), each = 10)
sce$cell_type <- "CD4_Tcell"
sce$condition <- rep(c("ctrl", "treat"), each = 60)

sce
```

# Step 1: Filter sparse donors

Before aggregation, remove donor-cell type combinations with too few cells.

```{r filter}
sce <- filterSparseDonors(sce,
                           donor     = "donor",
                           cell_type = "cell_type",
                           min_cells = 5)
ncol(sce)
```

# Step 2: Inspect pseudo-bulk profiles

```{r pseudobulk}
pb <- fastPseudobulk(sce,
                      donor       = "donor",
                      cell_type   = "cell_type",
                      target_type = "CD4_Tcell")

cat("Pseudo-bulk matrix:", nrow(pb$pseudobulk),
    "genes x", ncol(pb$pseudobulk), "donors\n")

cat("\nDonor weights (sqrt of cell count):\n")
print(round(pb$donor_weights, 2))
```

Notice that all donors have equal cell counts here (10 each), so weights
are identical. In real data, weights will vary substantially.

# Step 3: Run differential expression

```{r fastde}
result <- fastDE(sce,
                  donor       = "donor",
                  cell_type   = "cell_type",
                  condition   = "condition",
                  target_type = "CD4_Tcell",
                  min_cells   = 5,
                  min_cpm     = 1,
                  min_donors  = 2)

result
```

```{r de_results}
# Top significant genes
dt <- as.data.frame(deTable(result))
dt_sorted <- dt[order(dt$adj.P.Val), ]
head(dt_sorted[, c("gene", "logFC", "P.Value", "adj.P.Val")], 15)
```

```{r de_summary}
# Summary
sig <- sum(dt$adj.P.Val < 0.05, na.rm = TRUE)
cat(sprintf("\n%d / %d genes significant (FDR < 0.05)\n",
            sig, nrow(dt)))
cat(sprintf("Injected genes recovered: %d / 20\n",
            sum(paste0("Gene", 1:20) %in%
                rownames(dt)[dt$adj.P.Val < 0.05])))
```

# Step 4: Visualise results

```{r volcano, fig.cap = "Volcano plot of DE results. Red = upregulated, blue = downregulated at FDR < 0.05 and |logFC| > 1."}
plotDEResults(result, fdr_thresh = 0.05, lfc_thresh = 1, top_n = 10)
```

# The donor weighting advantage

Here we demonstrate why weighting matters. We compare `fastDE` output
when one donor has far fewer cells than the others.

```{r weighting_demo}
# Create an unbalanced dataset
set.seed(99)
sce_unbal <- sce
# Simulate donor 1 having only 2 cells by keeping just 2
keep <- c(which(sce$donor != "Donor1"),
          which(sce$donor == "Donor1")[1:2])
sce_unbal <- sce_unbal[, keep]

pb_unbal <- fastPseudobulk(sce_unbal,
                             donor       = "donor",
                             cell_type   = "cell_type",
                             target_type = "CD4_Tcell")

cat("Donor weights (unbalanced):\n")
print(round(pb_unbal$donor_weights, 2))
```

Donor1's weight (`sqrt(2) ≈ 1.41`) is much lower than the other donors
(`sqrt(10) ≈ 3.16`), so its noisy pseudo-bulk has proportionally less
influence on the DE estimates — without being discarded entirely.

# Paired design support

Many single-cell experiments use a **paired design** where the same donors
contribute cells under multiple conditions — for example, stimulated vs
unstimulated PBMCs from the same individuals (Kang et al. 2018).

`scFastDE` **automatically detects** paired designs: when the same donor
IDs appear in more than one condition, `fastDE` switches to a paired model
that:

1. Aggregates pseudo-bulk per **donor × condition** pair (not per donor)
2. Builds a `~ 0 + condition + donor` design matrix that blocks on donor
3. Extracts the treatment effect while accounting for inter-donor variation

This is critical because naively aggregating by donor alone would mix
conditions together, destroying the treatment signal entirely.

## Simulated paired design

```{r paired_setup}
# Simulate paired data: SAME 6 donors in BOTH conditions
set.seed(2025)
n_genes <- 500
n_donors <- 6
cells_per_donor_cond <- 15
n_cells <- n_donors * 2 * cells_per_donor_cond  # 180 cells

counts <- matrix(rpois(n_genes * n_cells, 8L),
                 nrow = n_genes, ncol = n_cells)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
colnames(counts) <- paste0("Cell", seq_len(n_cells))

# Inject DE signal: first 25 genes upregulated in stimulated condition
stim_cols <- seq(n_donors * cells_per_donor_cond + 1, n_cells)
counts[1:25, stim_cols] <- counts[1:25, stim_cols] * 4L

sce_paired <- SingleCellExperiment(assays = list(counts = counts))

# KEY: same donors (D1–D6) appear in BOTH conditions
sce_paired$donor <- rep(
    rep(paste0("D", seq_len(n_donors)), each = cells_per_donor_cond),
    2
)
sce_paired$cell_type <- "Monocyte"
sce_paired$condition <- rep(c("ctrl", "stim"),
                             each = n_donors * cells_per_donor_cond)

# Verify: each donor appears in both conditions
cat("Cells per donor and condition:\n")
print(table(sce_paired$donor, sce_paired$condition))
```

## Pseudo-bulk by donor × condition

When the `condition` argument is passed, `fastPseudobulk` creates one
pseudo-bulk column per donor-condition pair:

```{r paired_pseudobulk}
pb_paired <- fastPseudobulk(sce_paired,
                              donor       = "donor",
                              cell_type   = "cell_type",
                              target_type = "Monocyte",
                              condition   = "condition")

cat(sprintf("Pseudo-bulk: %d genes × %d samples\n",
            nrow(pb_paired$pseudobulk),
            ncol(pb_paired$pseudobulk)))
cat("  (6 donors × 2 conditions = 12 samples)\n\n")

cat("Sample info:\n")
print(pb_paired$sample_info)
```

## Running fastDE on paired data

`fastDE` auto-detects the paired design and reports it:

```{r paired_de}
result_paired <- fastDE(sce_paired,
                         donor       = "donor",
                         cell_type   = "cell_type",
                         condition   = "condition",
                         target_type = "Monocyte",
                         min_cells   = 5,
                         min_cpm     = 1,
                         min_donors  = 2)

result_paired
```

Note the output shows `Design: paired` and `Samples: 12` (not 6).

```{r paired_results}
dt_paired <- as.data.frame(deTable(result_paired))
dt_paired_sorted <- dt_paired[order(dt_paired$adj.P.Val), ]

sig_paired <- sum(dt_paired$adj.P.Val < 0.05, na.rm = TRUE)
recovered  <- sum(paste0("Gene", 1:25) %in%
                  rownames(dt_paired)[dt_paired$adj.P.Val < 0.05])

cat(sprintf("Significant genes (FDR < 0.05): %d / %d\n",
            sig_paired, nrow(dt_paired)))
cat(sprintf("Injected DE genes recovered:    %d / 25\n", recovered))
```

```{r paired_volcano, fig.cap = "Volcano plot for paired design. The paired model correctly blocks on donor, recovering the injected DE signal."}
plotDEResults(result_paired, fdr_thresh = 0.05, lfc_thresh = 1,
             top_n = 10)
```

## Why paired design matters

The paired model accounts for donor-level variation (age, genetics, etc.)
as a blocking factor. This has two major benefits:

- **More power**: by removing inter-donor noise, the residual variance
  is smaller, making it easier to detect true condition effects.
- **Correct aggregation**: without condition-aware aggregation, ctrl and
  stim cells from the same donor would be mixed into a single pseudo-bulk,
  washing out the treatment signal completely.

On real data like the Kang et al. 2018 IFN-β PBMC dataset, the paired
model recovers **5,550 DE genes** (including all 10 known IFN-β response
genes with logFCs of 2–8), whereas naively aggregating by donor alone
recovers **zero**.

# Session information

```{r session}
sessionInfo()
```