Fast donor-weighted DE for multi-donor scRNA-seq with scFastDE

Introduction

Multi-donor single-cell RNA-seq studies are now the norm for disease studies, clinical trials, and population-level atlases. The standard approach to differential expression (DE) in these datasets is pseudo-bulk analysis: aggregate per-cell counts into per-donor profiles, then apply bulk DE tools like DESeq2 or edgeR.

This works well but has three underappreciated problems:

Problem 1: Speed. Bulk DE tools fit one model per gene in a serial R loop. On a dataset with 30,000 genes and 50 donors, this loop runs 30,000 times. At scale — 100k+ cells, multiple cell types — this becomes the analysis bottleneck.

Problem 2: Donor weighting. Standard pseudo-bulk tools treat a donor with 5 cells identically to a donor with 500 cells. The 5-cell pseudo-bulk is far noisier, yet gets equal weight in the DE model.

Problem 3: Paired designs. Many studies collect cells from the same donors under multiple conditions (e.g. before/after treatment, or stimulated/unstimulated). Naively aggregating by donor alone mixes conditions together, destroying the treatment signal.

scFastDE addresses all three:

  1. Vectorised testing — the entire pseudo-bulk matrix is passed to limma::lmFit in one call, which uses LAPACK routines to fit all gene models simultaneously.
  2. Cell-count weights — each sample is weighted by sqrt(n_cells), giving principled influence to well-represented samples.
  3. Sparse donor guard — donors with too few cells are removed before aggregation, preventing noisy pseudo-bulks from distorting results.
  4. Auto-detect paired designsfastDE automatically detects whether donors appear in multiple conditions and builds the appropriate linear model (paired ~ 0 + condition + donor or unpaired ~ 0 + condition).

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scFastDE")

Quick start: unpaired design

In an unpaired design, each donor belongs to exactly one condition (e.g. Donors 1–6 = healthy, Donors 7–12 = disease). scFastDE automatically detects this and uses a simple ~ 0 + condition model.

library(scFastDE)
library(SingleCellExperiment)

# Simulate a multi-donor, two-condition dataset (unpaired)
set.seed(2024)
n_genes <- 500
n_cells <- 120

counts <- matrix(rpois(n_genes * n_cells, 8L),
                 nrow = n_genes, ncol = n_cells)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
colnames(counts) <- paste0("Cell", seq_len(n_cells))

# Inject DE signal: first 20 genes upregulated in treatment
counts[1:20, 61:120] <- counts[1:20, 61:120] * 4L

sce <- SingleCellExperiment(assays = list(counts = counts))
sce$donor     <- rep(paste0("Donor", 1:12), each = 10)
sce$cell_type <- "CD4_Tcell"
sce$condition <- rep(c("ctrl", "treat"), each = 60)

sce
#> class: SingleCellExperiment 
#> dim: 500 120 
#> metadata(0):
#> assays(1): counts
#> rownames(500): Gene1 Gene2 ... Gene499 Gene500
#> rowData names(0):
#> colnames(120): Cell1 Cell2 ... Cell119 Cell120
#> colData names(3): donor cell_type condition
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Step 1: Filter sparse donors

Before aggregation, remove donor-cell type combinations with too few cells.

sce <- filterSparseDonors(sce,
                           donor     = "donor",
                           cell_type = "cell_type",
                           min_cells = 5)
ncol(sce)
#> [1] 120

Step 2: Inspect pseudo-bulk profiles

pb <- fastPseudobulk(sce,
                      donor       = "donor",
                      cell_type   = "cell_type",
                      target_type = "CD4_Tcell")

cat("Pseudo-bulk matrix:", nrow(pb$pseudobulk),
    "genes x", ncol(pb$pseudobulk), "donors\n")
#> Pseudo-bulk matrix: 500 genes x 12 donors

cat("\nDonor weights (sqrt of cell count):\n")
#> 
#> Donor weights (sqrt of cell count):
print(round(pb$donor_weights, 2))
#>  Donor1  Donor2  Donor3  Donor4  Donor5  Donor6  Donor7  Donor8  Donor9 Donor10 
#>    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16 
#> Donor11 Donor12 
#>    3.16    3.16

Notice that all donors have equal cell counts here (10 each), so weights are identical. In real data, weights will vary substantially.

Step 3: Run differential expression

result <- fastDE(sce,
                  donor       = "donor",
                  cell_type   = "cell_type",
                  condition   = "condition",
                  target_type = "CD4_Tcell",
                  min_cells   = 5,
                  min_cpm     = 1,
                  min_donors  = 2)

result
#> FDEResult
#>   Genes tested   : 500 
#>   Samples        : 12 
#>   Significant    : 149 (adj.P.Val < 0.05)
#>   Cell type      : CD4_Tcell 
#>   Condition      : condition 
#>   Design         : unpaired
# Top significant genes
dt <- as.data.frame(deTable(result))
dt_sorted <- dt[order(dt$adj.P.Val), ]
head(dt_sorted[, c("gene", "logFC", "P.Value", "adj.P.Val")], 15)
#>          gene    logFC      P.Value    adj.P.Val
#> Gene10 Gene10 1.940020 2.392649e-89 1.196324e-86
#> Gene11 Gene11 1.869497 4.761455e-86 1.190364e-83
#> Gene5   Gene5 2.019865 1.125598e-84 1.875997e-82
#> Gene16 Gene16 1.850108 2.794630e-84 3.493288e-82
#> Gene6   Gene6 1.954038 3.604269e-84 3.604269e-82
#> Gene3   Gene3 1.940079 1.012996e-80 8.441630e-79
#> Gene2   Gene2 1.814565 9.484184e-80 5.977602e-78
#> Gene15 Gene15 1.917421 9.564163e-80 5.977602e-78
#> Gene7   Gene7 1.837197 6.391214e-79 3.550675e-77
#> Gene20 Gene20 1.918719 1.290645e-78 6.453225e-77
#> Gene17 Gene17 1.795905 1.479058e-77 6.722991e-76
#> Gene12 Gene12 1.874153 5.052708e-77 2.105295e-75
#> Gene8   Gene8 1.758143 5.758639e-77 2.214861e-75
#> Gene13 Gene13 1.850643 3.588716e-75 1.281684e-73
#> Gene14 Gene14 1.710195 1.754404e-73 5.848012e-72
# Summary
sig <- sum(dt$adj.P.Val < 0.05, na.rm = TRUE)
cat(sprintf("\n%d / %d genes significant (FDR < 0.05)\n",
            sig, nrow(dt)))
#> 
#> 149 / 500 genes significant (FDR < 0.05)
cat(sprintf("Injected genes recovered: %d / 20\n",
            sum(paste0("Gene", 1:20) %in%
                rownames(dt)[dt$adj.P.Val < 0.05])))
#> Injected genes recovered: 20 / 20

Step 4: Visualise results

plotDEResults(result, fdr_thresh = 0.05, lfc_thresh = 1, top_n = 10)
Volcano plot of DE results. Red = upregulated, blue = downregulated at FDR < 0.05 and |logFC| > 1.

Volcano plot of DE results. Red = upregulated, blue = downregulated at FDR < 0.05 and |logFC| > 1.

The donor weighting advantage

Here we demonstrate why weighting matters. We compare fastDE output when one donor has far fewer cells than the others.

# Create an unbalanced dataset
set.seed(99)
sce_unbal <- sce
# Simulate donor 1 having only 2 cells by keeping just 2
keep <- c(which(sce$donor != "Donor1"),
          which(sce$donor == "Donor1")[1:2])
sce_unbal <- sce_unbal[, keep]

pb_unbal <- fastPseudobulk(sce_unbal,
                             donor       = "donor",
                             cell_type   = "cell_type",
                             target_type = "CD4_Tcell")

cat("Donor weights (unbalanced):\n")
#> Donor weights (unbalanced):
print(round(pb_unbal$donor_weights, 2))
#>  Donor2  Donor3  Donor4  Donor5  Donor6  Donor7  Donor8  Donor9 Donor10 Donor11 
#>    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16    3.16 
#> Donor12  Donor1 
#>    3.16    1.41

Donor1’s weight (sqrt(2) ≈ 1.41) is much lower than the other donors (sqrt(10) ≈ 3.16), so its noisy pseudo-bulk has proportionally less influence on the DE estimates — without being discarded entirely.

Paired design support

Many single-cell experiments use a paired design where the same donors contribute cells under multiple conditions — for example, stimulated vs unstimulated PBMCs from the same individuals (Kang et al. 2018).

scFastDE automatically detects paired designs: when the same donor IDs appear in more than one condition, fastDE switches to a paired model that:

  1. Aggregates pseudo-bulk per donor × condition pair (not per donor)
  2. Builds a ~ 0 + condition + donor design matrix that blocks on donor
  3. Extracts the treatment effect while accounting for inter-donor variation

This is critical because naively aggregating by donor alone would mix conditions together, destroying the treatment signal entirely.

Simulated paired design

# Simulate paired data: SAME 6 donors in BOTH conditions
set.seed(2025)
n_genes <- 500
n_donors <- 6
cells_per_donor_cond <- 15
n_cells <- n_donors * 2 * cells_per_donor_cond  # 180 cells

counts <- matrix(rpois(n_genes * n_cells, 8L),
                 nrow = n_genes, ncol = n_cells)
rownames(counts) <- paste0("Gene", seq_len(n_genes))
colnames(counts) <- paste0("Cell", seq_len(n_cells))

# Inject DE signal: first 25 genes upregulated in stimulated condition
stim_cols <- seq(n_donors * cells_per_donor_cond + 1, n_cells)
counts[1:25, stim_cols] <- counts[1:25, stim_cols] * 4L

sce_paired <- SingleCellExperiment(assays = list(counts = counts))

# KEY: same donors (D1–D6) appear in BOTH conditions
sce_paired$donor <- rep(
    rep(paste0("D", seq_len(n_donors)), each = cells_per_donor_cond),
    2
)
sce_paired$cell_type <- "Monocyte"
sce_paired$condition <- rep(c("ctrl", "stim"),
                             each = n_donors * cells_per_donor_cond)

# Verify: each donor appears in both conditions
cat("Cells per donor and condition:\n")
#> Cells per donor and condition:
print(table(sce_paired$donor, sce_paired$condition))
#>     
#>      ctrl stim
#>   D1   15   15
#>   D2   15   15
#>   D3   15   15
#>   D4   15   15
#>   D5   15   15
#>   D6   15   15

Pseudo-bulk by donor × condition

When the condition argument is passed, fastPseudobulk creates one pseudo-bulk column per donor-condition pair:

pb_paired <- fastPseudobulk(sce_paired,
                              donor       = "donor",
                              cell_type   = "cell_type",
                              target_type = "Monocyte",
                              condition   = "condition")

cat(sprintf("Pseudo-bulk: %d genes × %d samples\n",
            nrow(pb_paired$pseudobulk),
            ncol(pb_paired$pseudobulk)))
#> Pseudo-bulk: 500 genes × 12 samples
cat("  (6 donors × 2 conditions = 12 samples)\n\n")
#>   (6 donors × 2 conditions = 12 samples)

cat("Sample info:\n")
#> Sample info:
print(pb_paired$sample_info)
#>    sample_id donor condition
#> 1  D1___ctrl    D1      ctrl
#> 2  D2___ctrl    D2      ctrl
#> 3  D3___ctrl    D3      ctrl
#> 4  D4___ctrl    D4      ctrl
#> 5  D5___ctrl    D5      ctrl
#> 6  D6___ctrl    D6      ctrl
#> 7  D1___stim    D1      stim
#> 8  D2___stim    D2      stim
#> 9  D3___stim    D3      stim
#> 10 D4___stim    D4      stim
#> 11 D5___stim    D5      stim
#> 12 D6___stim    D6      stim

Running fastDE on paired data

fastDE auto-detects the paired design and reports it:

result_paired <- fastDE(sce_paired,
                         donor       = "donor",
                         cell_type   = "cell_type",
                         condition   = "condition",
                         target_type = "Monocyte",
                         min_cells   = 5,
                         min_cpm     = 1,
                         min_donors  = 2)

result_paired
#> FDEResult
#>   Genes tested   : 500 
#>   Samples        : 12 
#>   Significant    : 343 (adj.P.Val < 0.05)
#>   Cell type      : Monocyte 
#>   Condition      : condition 
#>   Design         : paired

Note the output shows Design: paired and Samples: 12 (not 6).

dt_paired <- as.data.frame(deTable(result_paired))
dt_paired_sorted <- dt_paired[order(dt_paired$adj.P.Val), ]

sig_paired <- sum(dt_paired$adj.P.Val < 0.05, na.rm = TRUE)
recovered  <- sum(paste0("Gene", 1:25) %in%
                  rownames(dt_paired)[dt_paired$adj.P.Val < 0.05])

cat(sprintf("Significant genes (FDR < 0.05): %d / %d\n",
            sig_paired, nrow(dt_paired)))
#> Significant genes (FDR < 0.05): 343 / 500
cat(sprintf("Injected DE genes recovered:    %d / 25\n", recovered))
#> Injected DE genes recovered:    25 / 25
plotDEResults(result_paired, fdr_thresh = 0.05, lfc_thresh = 1,
             top_n = 10)
Volcano plot for paired design. The paired model correctly blocks on donor, recovering the injected DE signal.

Volcano plot for paired design. The paired model correctly blocks on donor, recovering the injected DE signal.

Why paired design matters

The paired model accounts for donor-level variation (age, genetics, etc.) as a blocking factor. This has two major benefits:

  • More power: by removing inter-donor noise, the residual variance is smaller, making it easier to detect true condition effects.
  • Correct aggregation: without condition-aware aggregation, ctrl and stim cells from the same donor would be mixed into a single pseudo-bulk, washing out the treatment signal completely.

On real data like the Kang et al. 2018 IFN-β PBMC dataset, the paired model recovers 5,550 DE genes (including all 10 known IFN-β response genes with logFCs of 2–8), whereas naively aggregating by donor alone recovers zero.

Session information

sessionInfo()
#> R version 4.6.1 (2026-06-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 26.04 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.35.1 SummarizedExperiment_1.43.0
#>  [3] Biobase_2.73.1              GenomicRanges_1.65.0       
#>  [5] Seqinfo_1.3.0               IRanges_2.47.2             
#>  [7] S4Vectors_0.51.3            BiocGenerics_0.59.7        
#>  [9] generics_0.1.4              MatrixGenerics_1.25.0      
#> [11] matrixStats_1.5.0           scFastDE_0.99.3            
#> [13] BiocStyle_2.41.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.10         SparseArray_1.13.2  lattice_0.22-9     
#>  [4] digest_0.6.39       evaluate_1.0.5      grid_4.6.1         
#>  [7] RColorBrewer_1.1-3  fastmap_1.2.0       jsonlite_2.0.0     
#> [10] Matrix_1.7-5        limma_3.69.2        BiocManager_1.30.27
#> [13] scales_1.4.0        codetools_0.2-20    jquerylib_0.1.4    
#> [16] abind_1.4-8         cli_3.6.6           rlang_1.2.0        
#> [19] XVector_0.53.0      withr_3.0.3         cachem_1.1.0       
#> [22] DelayedArray_0.39.3 yaml_2.3.12         otel_0.2.0         
#> [25] S4Arrays_1.13.0     tools_4.6.1         parallel_4.6.1     
#> [28] BiocParallel_1.47.0 ggplot2_4.0.3       vctrs_0.7.3        
#> [31] buildtools_1.0.0    R6_2.6.1            lifecycle_1.0.5    
#> [34] bslib_0.11.0        gtable_0.3.6        glue_1.8.1         
#> [37] statmod_1.5.2       xfun_0.59           sys_3.4.3          
#> [40] knitr_1.51          farver_2.1.2        htmltools_0.5.9    
#> [43] labeling_0.4.3      rmarkdown_2.31      maketools_1.3.2    
#> [46] compiler_4.6.1      S7_0.2.2