---
title: "Dreamlet analysis of single cell RNA-seq"
subtitle: 'Linear (mixed) model analysis of pseudobulk data'
author: "Developed by [Gabriel Hoffman](http://gabrielhoffman.github.io/)"
date: "Run on `r format(Sys.time())`"
documentclass: article
vignette: >
%\VignetteIndexEntry{Dreamlet analysis of single cell RNA-seq}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\usepackage[utf8]{inputenc}
output:
BiocStyle::html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
error = FALSE,
tidy = FALSE,
dev = c("png"),
cache = TRUE
)
```
# Introduction
As the scale of single cell/nucleus RNA-seq has increased, so has the complexity of study designs. Analysis of datasets with simple study designs can be performed using linear model as in the [muscat package](https://bioconductor.org/packages/muscat). Yet analysis of datsets with complex study designs such as repeated measures or many technical batches can benefit from linear mixed model analysis to model to correlation structure between samples. We previously developed [dream](https://doi.org/10.1093/bioinformatics/btaa687) to apply linear mixed models to bulk RNA-seq data using a [limma](https://bioconductor.org/packages/release/bioc/html/limma.html)-style workflow. Dreamlet extends the previous work of dream and [muscat](https://www.nature.com/articles/s41467-020-19894-4) to apply linear mixed models to pseudobulk data. Dreamlet also supports linear models and facilitates application of 1) [variancePartition](https://bioconductor.org/packages/variancePartition) to quantify the contribution of multiple variables to expression variation, and 2) [zenith](https://github.com/GabrielHoffman/zenith) to perform gene set analysis on the differential expression signatures.
# Installation
To install this package, start R (version "4.3") and enter:
```{r install, eval=FALSE}
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
# Select release #1 or #2
# 1) Bioconductor release
BiocManager::install("dreamlet")
# 2) Latest stable release
devtools::install_github("DiseaseNeurogenomics/dreamlet")
```
# Process single cell count data
Here we perform analysis of PBMCs from 8 individuals stimulated with interferon-β ([Kang, et al, 2018, Nature Biotech](https://www.nature.com/articles/nbt.4042)). This is a small dataset that does not have repeated measures or high dimensional batch effects, so the sophisticated features of dreamlet are not strictly necessary. But this gives us an opportunity to walk through a standard dreamlet workflow.
## Preprocess data
Here, single cell RNA-seq data is downloaded from [ExperimentHub](https://bioconductor.org/packages/ExperimentHub/).
```{r preprocess.data}
library(dreamlet)
library(muscat)
library(ExperimentHub)
library(zenith)
library(scater)
# Download data, specifying EH2259 for the Kang, et al study
eh <- ExperimentHub()
sce <- eh[["EH2259"]]
sce$ind = as.character(sce$ind)
# only keep singlet cells with sufficient reads
sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce <- sce[, colData(sce)$multiplets == "singlet"]
# compute QC metrics
qc <- perCellQCMetrics(sce)
# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]
# set variable indicating stimulated (stim) or control (ctrl)
sce$StimStatus <- sce$stim
```
## Aggregate to pseudobulk
Dreamlet, like muscat, performs analysis at the pseudobulk-level by summing raw counts across cells for a given sample and cell type. `aggregateToPseudoBulk` is substantially faster for large on-disk datasets than [`muscat::aggregateData`](https://rdrr.io/bioc/muscat/man/aggregateData.html).
```{r aggregate}
# Since 'ind' is the individual and 'StimStatus' is the stimulus status,
# create unique identifier for each sample
sce$id <- paste0(sce$StimStatus, sce$ind)
# Create pseudobulk data by specifying cluster_id and sample_id
# Count data for each cell type is then stored in the `assay` field
# assay: entry in assayNames(sce) storing raw counts
# cluster_id: variable in colData(sce) indicating cell clusters
# sample_id: variable in colData(sce) indicating sample id for aggregating cells
pb <- aggregateToPseudoBulk(sce,
assay = "counts",
cluster_id = "cell",
sample_id = "id",
verbose = FALSE
)
# one 'assay' per cell type
assayNames(pb)
```
# Voom for pseudobulk
Apply [voom](https://rdrr.io/bioc/limma/man/voom.html)-style normalization for pseudobulk counts within each cell cluster using [voomWithDreamWeights](https://gabrielhoffman.github.io/variancePartition/reference/voomWithDreamWeights.html) to handle random effects (if specified).
```{r dreamlet, fig.width=8, fig.height=8}
# Normalize and apply voom/voomWithDreamWeights
res.proc <- processAssays(pb, ~StimStatus, min.count = 5)
# the resulting object of class dreamletProcessedData stores
# normalized data and other information
res.proc
```
`processAssays()` retains samples with at least `min.cells` in a given cell type. While dropping a few samples usually is not a problem, in some cases dropping sames can mean that a variable included in the regression formula no longer has any variation. For example, dropping all stimulated samples from analysis of a given cell type would be mean the variable `StimStatus` has no variation and is perfectly colinear with the intercept term. This colinearity issue is detected internally and variables with these problem are dropped from the regression formula for that particular cell type. The number of samples retained and the resulting formula used in each cell type can be accessed as follows. In this analysis, samples are dropped from 3 cell types but the original formula remains valid in each case.
```{r details}
# view details of dropping samples
details(res.proc)
```
Here the mean-variance trend from voom is shown for each cell type. Cell types with sufficient number of cells and reads show a clear mean-variance trend. While in rare cell types like megakaryocytes, fewer genes have sufficient reads and the trend is less apparent.
```{r voom.plots, fig.height=7}
# show voom plot for each cell clusters
plotVoom(res.proc)
# Show plots for subset of cell clusters
# plotVoom( res.proc[1:3] )
# Show plots for one cell cluster
# plotVoom( res.proc[["B cells"]])
```
# Variance partitioning
The [variancePartition](http://bioconductor.org/packages/variancePartition/) package uses linear and linear mixed models to quanify the contribution of multiple sources of expression variation at the gene-level. For each gene it fits a linear (mixed) model and evalutes the fraction of expression variation explained by each variable.
Variance fractions can be visualized at the gene-level for each cell type using a bar plot, or genome-wide using a violin plot.
```{r vp}
# run variance partitioning analysis
vp.lst <- fitVarPart(res.proc, ~StimStatus)
```
```{r vp.barplt, fig.height=3}
# Show variance fractions at the gene-level for each cell type
genes <- vp.lst$gene[2:4]
plotPercentBars(vp.lst[vp.lst$gene %in% genes, ])
```
```{r vp.violin, fig.height=7}
# Summarize variance fractions genome-wide for each cell type
plotVarPart(vp.lst, label.angle = 60)
```
# Differential expression
Since the normalized expression data and metadata are stored within `res.proc`, only the regression formula remains to be specified. Here we only included the stimulus status, but analyses of larger datasets can include covariates and random effects. With formula `~ StimStatus`, an intercept is fit and coefficient `StimStatusstim` log fold change between simulated and controls.
```{r test}
# Differential expression analysis within each assay,
# evaluated on the voom normalized data
res.dl <- dreamlet(res.proc, ~StimStatus)
# names of estimated coefficients
coefNames(res.dl)
# the resulting object of class dreamletResult
# stores results and other information
res.dl
```
## Volcano plots
The volcano plot can indicate the strength of the differential expression signal with each cell type. Red points indicate FDR < 0.05.
```{r plotVolcano, fig.width=8, fig.height=8}
plotVolcano(res.dl, coef = "StimStatusstim")
```
## Gene-level heatmap
For each cell type and specified gene, show z-statistic from `dreamlet` analysis. Grey indicates that insufficient reads were observed to include the gene in the analysis.
```{r plotGeneHeatmap, fig.height=3}
genes <- c("ISG20", "ISG15")
plotGeneHeatmap(res.dl, coef = "StimStatusstim", genes = genes)
```
## Extract results
Each entry in `res.dl` stores a model fit by [`dream()`](https://rdrr.io/bioc/variancePartition/man/dream-method.html), and results can be extracted using `topTable()` as in `limma` by specifying the coefficient of interest. The results shows the gene name, log fold change, average expression, t-statistic, p-value, FDR (i.e. adj.P.Val).
```{r extract}
# results from full analysis
topTable(res.dl, coef = "StimStatusstim")
# only B cells
topTable(res.dl[["B cells"]], coef = "StimStatusstim")
```
## Forest plot
A forest plot shows the log fold change and standard error of a given gene across all cell types. The color indicates the FDR.
```{r forrest}
plotForest(res.dl, coef = "StimStatusstim", gene = "ISG20")
```
## Box plot
Examine the expression of ISG20 between stimulation conditions within CD14+ Monocytes. Use `extractData()` to create a `tibble` with gene expression data and metadata from `colData()` from one cell type.
```{r boxplot}
# get data
df <- extractData(res.proc, "CD14+ Monocytes", genes = "ISG20")
# set theme
thm <- theme_classic() +
theme(aspect.ratio = 1, plot.title = element_text(hjust = 0.5))
# make plot
ggplot(df, aes(StimStatus, ISG20)) +
geom_boxplot() +
thm +
ylab(bquote(Expression ~ (log[2] ~ CPM))) +
ggtitle("ISG20")
```
## Advanced used of contrasts
A hypothesis test of the _difference_ between two or more coefficients can be performed using contrasts. The contrast matrix is evaluated for each cell type in the backend, so only the contrast string must be supplied to `dreamlet()`.
```{r dreamlet.contrasts}
# create a contrasts called 'Diff' that is the difference between expression
# in the stimulated and controls.
# More than one can be specified
contrasts <- c(Diff = "StimStatusstim - StimStatusctrl")
# Evalaute the regression model without an intercept term.
# Instead estimate the mean expression in stimulated, controls and then
# set Diff to the difference between the two
res.dl2 <- dreamlet(res.proc, ~ 0 + StimStatus, contrasts = contrasts)
# see estimated coefficients
coefNames(res.dl2)
# Volcano plot of Diff
plotVolcano(res.dl2[1:2], coef = "Diff")
```
This new `Diff` variable can then be used downstream for analysis asking for a coefficient. But note that since there is no intercept term in this model, the meaning of `StimStatusstim` changes here. When the formula is ` 0 + StimStatus` then `StimStatusstim` is the mean expression in stimulated samples.
For further information about using contrasts see [makeContrastsDream()](https://gabrielhoffman.github.io/variancePartition/reference/makeContrastsDream.html) and [vignette](https://gabrielhoffman.github.io/variancePartition/articles/dream.html#advanced-hypothesis-testing-1).
# Gene set analysis
While standard enrichment methods like Fishers exact test, requires specifying a FDR cutoff to identify differentially expressed genes. However, dichotomizing differential expression results is often too simple and ignores the quantitative variation captured by the differential expression test statistics. Here we use `zenith`, a wrapper for [`limma::camera`](https://rdrr.io/bioc/limma/man/camera.html), to perform gene set analysis using the full spectrum of differential expression test statistics. `zenith/camera` is a [competetive test](https://www.nature.com/articles/nrg.2016.29) that compares the mean test statistic for genes in a given gene set, to genes not in that set while accounting for correlation between genes.
Here, `zenith_gsa` takes a `dreamletResult` object, the coefficient of interest, and gene sets as a `GeneSetCollection` object from [GSEABase](https://bioconductor.org/packages/GSEABase/).
```{r zenith}
# Load Gene Ontology database
# use gene 'SYMBOL', or 'ENSEMBL' id
# use get_MSigDB() to load MSigDB
# Use Cellular Component (i.e. CC) to reduce run time here
go.gs <- get_GeneOntology("CC", to = "SYMBOL")
# Run zenith gene set analysis on result of dreamlet
res_zenith <- zenith_gsa(res.dl, coef = "StimStatusstim", go.gs)
# examine results for each ell type and gene set
head(res_zenith)
```
## Heatmap of top genesets
```{r heatmap, fig.height=10, fig.width=8}
# for each cell type select 5 genesets with largest t-statistic
# and 1 geneset with the lowest
# Grey boxes indicate the gene set could not be evaluted because
# to few genes were represented
plotZenithResults(res_zenith, 5, 1)
```
### All gene sets with FDR < 30%
Here, show all genes with FDR < 5% in any cell type
```{r heatmap2, fig.height=10, fig.width=8}
# get genesets with FDR < 30%
# Few significant genesets because uses Cellular Component (i.e. CC)
gs <- unique(res_zenith$Geneset[res_zenith$FDR < 0.3])
# keep only results of these genesets
df <- res_zenith[res_zenith$Geneset %in% gs, ]
# plot results, but with no limit based on the highest/lowest t-statistic
plotZenithResults(df, Inf, Inf)
```
# Comparing expression in clusters
Identifying genes that are differentially expressed between cell clusters incorporates a paired analysis design, since each individual is observed for each cell cluster.
```{r dreamletCompareClusters}
# test differential expression between B cells and the rest of the cell clusters
ct.pairs <- c("CD4 T cells", "rest")
fit <- dreamletCompareClusters(pb, ct.pairs, method = "fixed")
# The coefficient 'compare' is the value logFC between test and baseline:
# compare = cellClustertest - cellClusterbaseline
df_Bcell <- topTable(fit, coef = "compare")
head(df_Bcell)
```
# Gene-cluster specificity
Evaluate the specificity of each gene for each cluster, retaining only highly expressed genes:
```{r cellTypeSpecificity}
df_cts <- cellTypeSpecificity(pb)
# retain only genes with total CPM summed across cell type > 100
df_cts <- df_cts[df_cts$totalCPM > 100, ]
# Violin plot of specificity score for each cell type
plotViolin(df_cts)
```
Highlight expression fraction for most specific gene from each cell type:
```{r plotPercentBars}
genes <- rownames(df_cts)[apply(df_cts, 2, which.max)]
plotPercentBars(df_cts, genes = genes)
dreamlet::plotHeatmap(df_cts, genes = genes)
```
# Session Info
```{r session, echo=FALSE}
sessionInfo()
```