muscat
For details on the concept and technicalities of DS analysis, and the methods presented here, consider having a look at our publication:
Crowell HL, Soneson C*, Germain P-L*, Calini D,
Collin L, Raposo C, Malhotra D, and Robinson MD:
muscat detects subpopulation-specific state transitions from
multi-sample multi-condition single-cell transcriptomics data.
Nature Communications 11, 6077 (2020).
DOI: 10.1038/s41467-020-19894-4
A fundamental task in the analysis of single-cell RNA-sequencing
(scRNA-seq) data is the identification of systematic transcriptional
changes (Stegle, Teichmann, and Marioni
2015). Such analyses are a critical step in the understanding of
molecular responses, and have applications in development, in
perturbation studies or in disease.
Most of the current scRNA-seq differential expression (DE) analysis
methods are designed to test one set of cells against another (or more
generally, multiple sets together), and can be used to compare cell
clusters (e.g., for identifying marker genes) or across conditions
(cells from one condition versus another) (Soneson and Robinson 2018). In such statistical
models, the cells are the experimental units and thus represent the
population that inferences will extrapolate to.
Using established terminology, we refer to cell identity as
the combination of cell type, a stable molecular signature, and
cell state, a transient snapshot of a cell’s molecular events
(Wagner, Regev, and Yosef 2016; Trapnell
2015). This classification is inherently arbitrary, but still
provides a basis for biological interpretation and a framework for
discovering interesting expression patterns from scRNA-seq datasets. For
example, T cells could be defined as a single (albeit diverse) cell type
or could be divided into discrete subtypes, if relevant information to
categorize each cell at this level were available. In either case, the
framework presented here would be able to focus on the cell type of
interest and look for changes (in expression) across samples.
Given the emergence of multi-sample multi-group scRNA-seq datasets, the
goal becomes making sample-level inferences (i.e., experimental units
are samples). Thus, differential state (DS) analysis is defined as
following a given cell type across a set of samples (e.g., individuals)
and experimental conditions (e.g., treatments), in order to identify
cell-type-specific responses, i.e., changes in cell state. DS analysis:
i) should be able to detect diluted changes that only affect a single
cell type, a subset of cell types or even a subset of a single
subpopulation; and, ii) is intended to be orthogonal to clustering or
cell type assignment.
The starting point for a DS analysis is a (sparse) matrix of gene expression, either as counts or some kind of normalized data, where rows = genes and columns = cells. Each cell additionally has a cluster (subpopulation) label as well as a sample label; metadata should accompany the list of samples, such that they can be organized into comparable groups with sample-level replicates (e.g., via a design matrix).
The approach presented here is modular and thus subpopulation labels could originate from an earlier step in the analysis, such as clustering (Duò, Robinson, and Soneson 2018; Freytag et al. 2018), perhaps after integration (Butler et al. 2018; Stuart et al. 2019) or after labeling of clusters (Diaz-Mejia et al. 2019) or after cell-level type assignment (Zhang et al. 2019).
For this vignette, we will use a SingleCellExperiment (SCE) containing 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained before and after 6h-treatment with IFN-β (Kang et al. 2018). The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.
The Kang et al. (2018) dataset has been
made available through Bioconductor’s ExperimentHub
and can
be loaded into R as follows: We first initialize a Hub instance to
search for and load available data with the ExperimentHub
function, and store the complete list of records in the variable
eh
. Using query
, we then retrieve any records
that match our keyword(s) of interest, as well as their corresponding
accession ID (EH1234).
## ExperimentHub with 3 records
## # snapshotDate(): 2024-10-24
## # $dataprovider: NCI_GDC, GEO
## # $species: Homo sapiens
## # $rdataclass: character, SingleCellExperiment, BSseq
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH1661"]]'
##
## title
## EH1661 | Whole Genome Bisulfit Sequencing Data for 47 samples
## EH1662 | Whole Genome Bisulfit Sequencing Data for 47 samples
## EH2259 | Kang18_8vs8
Finally, we load the data of interest into R via [[
and
the corresponding accession ID. The dataset contains >35,000 genes
and ~29,000 cells:
## class: SingleCellExperiment
## dim: 35635 29065
## metadata(0):
## assays(1): counts
## rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
## rowData names(2): ENSEMBL SYMBOL
## colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
## TTTGCATGTCTTAC-1
## colData names(5): ind stim cluster cell multiplets
## reducedDimNames(1): TSNE
## mainExpName: NULL
## altExpNames(0):
The scater package (McCarthy et al. 2017) provides a variety of tools for preprocessing and quality control of single-cell transcriptomic data. For completeness, we will apply some minimal filtering steps to
For more thorough preprocessing, we refer to the Quality control with scater vignette.
## [1] 18890 29065
We use perCellQCMetrics
to compute various per-cell
quality control metrics, and proceed with filtering cells and genes as
noted above:
# calculate per-cell quality control (QC) metrics
library(scater)
qc <- perCellQCMetrics(sce)
# remove cells with few or many detected genes
ol <- isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
sce <- sce[, !ol]
dim(sce)
## [1] 18890 26820
## [1] 7118 26820
Finally, we use logNormCounts
to calculate log2-transformed normalized
expression values by dividing each count by its size factor, adding a
pseudo-count of 1, and log-transforming1.
Alternatively, expression values could be obtained via
vst
(variance stabilizing transformation) from the sctransform
package (Hafemeister and Satija 2019),
which returns Pearson residuals from a regularized negative binomial
regression model that can be interpreted as normalized expression
values:
By default, scater’s
functions will try to access the assay data specified via argument
exprs_values
(default logcounts
) for
e.g. visualization and dimension reduction. When an alternative assay
such as the vstresiduals
above should be used, it is thus
necessary to explicitly specify this, for example, via
runUMAP(sce, exprs_values = "vstresiduals")
to compute UMAP
cell embeddings on the assay data compute above.
muscat
expects a certain format of the input SCE.
Specifically, the following cell metadata (colData
) columns
have to be provided:
"sample_id"
: unique sample identifiers (e.g.,
PeterPan_ref1, Nautilus_trt3, …)"cluster_id"
: subpopulation (cluster) assignments
(e.g., T cells, monocytes, …)"group_id"
: experimental group/condition (e.g.,
control/treatment, healthy/diseased, …)sce$id <- paste0(sce$stim, sce$ind)
(sce <- prepSCE(sce,
kid = "cell", # subpopulation assignments
gid = "stim", # group IDs (ctrl/stim)
sid = "id", # sample IDs (ctrl/stim.1234)
drop = TRUE)) # drop all other colData columns
## class: SingleCellExperiment
## dim: 7118 26820
## metadata(1): experiment_info
## assays(2): counts logcounts
## rownames(7118): NOC2L HES4 ... S100B PRMT2
## rowData names(2): ENSEMBL SYMBOL
## colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
## TTTGCATGTCTTAC-1
## colData names(3): cluster_id sample_id group_id
## reducedDimNames(1): TSNE
## mainExpName: NULL
## altExpNames(0):
For consistency and easy accession throughout this vignette, we will store cluster and sample IDs, as well as the number of clusters and samples into the following simple variables:
As we will be aggregating measurements at the cluster-sample level,
it is of particular importance to check the number of cells captured for
each such instance. While aggregateData
(see Section
@ref(sec-agg)) allows excluding cluster-sample combinations with less
than a threshold number of cells, clusters or samples with overall very
low cell-counts may be excluded from further analysis at this point
already.
For the Kang et al. (2018) dataset, for example, one might consider removing the Dendritic cells and Megakaryocytes clusters, as these contain less than 50 cells across all samples.
##
## B cells CD14+ Monocytes CD4 T cells CD8 T cells Dendritic cells
## ctrl101 113 186 336 95 5
## ctrl1015 476 783 919 226 11
## ctrl1016 144 419 526 671 10
## ctrl1039 30 116 202 30 5
## ctrl107 51 222 197 31 2
## ctrl1244 134 429 1215 82 28
## ctrl1256 240 383 1136 156 12
## ctrl1488 234 317 1343 78 17
## stim101 144 222 437 121 20
## stim1015 357 683 814 153 17
## stim1016 129 361 426 600 10
## stim1039 39 154 318 40 7
## stim107 56 185 217 22 7
## stim1244 94 318 980 46 19
## stim1256 211 369 1047 133 16
## stim1488 283 370 1658 73 34
##
## FCGR3A+ Monocytes Megakaryocytes NK cells
## ctrl101 81 12 84
## ctrl1015 232 25 208
## ctrl1016 126 15 151
## ctrl1039 28 5 20
## ctrl107 29 5 49
## ctrl1244 53 19 131
## ctrl1256 50 15 275
## ctrl1488 99 21 120
## stim101 126 7 120
## stim1015 222 24 224
## stim1016 124 12 239
## stim1039 36 13 32
## stim107 36 4 51
## stim1244 35 14 136
## stim1256 73 21 257
## stim1488 139 35 187
The dimension reductions (DR) available within the SCE can be
accessed via reducedDims
from the scater
package. The data provided by Kang et al.
(2018) already contains t-SNE coordinates; however, we can of
course compute additional dimension reductions using one of scater’s
runX
functions:
Using scater’s
plotReducedDim
function, we can plot t-SNE and UMAP
representations colored by cluster and group IDs, respectively. We
additionally create a small wrapper function, .plot_dr()
,
to improve the readability of color legends and simplify the plotting
theme:
# wrapper to prettify reduced dimension plots
.plot_dr <- function(sce, dr, col)
plotReducedDim(sce, dimred = dr, colour_by = col) +
guides(fill = guide_legend(override.aes = list(alpha = 1, size = 3))) +
theme_minimal() + theme(aspect.ratio = 1)
For our dataset, the t-SNE and UMAP colored by
cluster_id
s show that cell-populations are well-separated
from one another. IFN-β
stimulation manifests as a severe shift in the low-dimensional
projection of cells when coloring by group_id
s, indicating
widespread, genome-scale transcriptional changes.
# downsample to max. 100 cells per cluster
cs_by_k <- split(colnames(sce), sce$cluster_id)
cs100 <- unlist(sapply(cs_by_k, function(u)
sample(u, min(length(u), 100))))
# plot t-SNE & UMAP colored by cluster & group ID
for (dr in c("TSNE", "UMAP"))
for (col in c("cluster_id", "group_id"))
.plot_dr(sce[, cs100], dr, col)
To test for state changes across conditions, we will consider two types of approaches: i) mixed models that act directly on cell-level measurements; and ii) aggregation-based methods that act on pseudobulk data. For both approaches, each gene is tested for state changes in each cluster. Thus, a total of #(genes) × #(clusters) tests will be performed per comparison of interest. The following schematic summarizes the data representation considered by cell- and sample-level approaches, respectively:
In order to leverage existing robust bulk RNA-seq DE frameworks, such as edgeR (Robinson, McCarthy, and Smyth 2010), DESeq2 (Love, Huber, and Anders 2014), and limma (Ritchie et al. 2015), we first aggregate measurements for each sample (in each cluster) to obtain pseudobulk data.
In general, aggregateData()
will aggregate the data by
the colData
variables specified with argument
by
, and return a SingleCellExperiment
containing pseudobulk data.
For DS analysis, measurements must be aggregated at the cluster-sample
level (default by = c("cluster_id", "sample_id"
). In this
case, the returned SingleCellExperiment
will contain one
assay per cluster, where rows = genes and columns = samples. Arguments
assay
and fun
specify the input data and
summary statistic, respectively, to use for aggregation.
While, in principle, various combinations of input data
(raw/(log-)normalized counts, CPM ect.) and summary statistics (sum,
mean, median) could be applied, we here default to the sum of raw
counts:
pb <- aggregateData(sce,
assay = "counts", fun = "sum",
by = c("cluster_id", "sample_id"))
# one sheet per subpopulation
assayNames(pb)
## [1] "B cells" "CD14+ Monocytes" "CD4 T cells"
## [4] "CD8 T cells" "Dendritic cells" "FCGR3A+ Monocytes"
## [7] "Megakaryocytes" "NK cells"
## NOC2L HES4 ISG15 TNFRSF18 TNFRSF4 SDF4
## ctrl101 12 0 13 8 1 13
## ctrl1015 37 4 136 55 16 42
## ctrl1016 13 3 42 10 0 15
## ctrl1039 2 1 16 3 1 1
## ctrl107 7 0 6 3 4 5
## ctrl1244 24 7 25 30 11 7
## ctrl1256 22 1 65 30 11 17
## ctrl1488 19 1 34 27 6 23
## stim101 12 8 1362 3 1 9
## stim1015 34 27 4022 21 2 19
## stim1016 7 8 1271 2 4 6
## stim1039 3 3 393 1 1 2
## stim107 4 2 556 1 0 3
## stim1244 13 10 830 8 2 5
## stim1256 14 10 2235 10 2 6
## stim1488 18 5 2927 13 1 19
Prior to conducting any formal testing, we can compute a multi-dimensional scaling (MDS) plot of aggregated signal to explore overall sample similarities.
pbMDS
takes as input any SCE containg PB data as
returned by aggregateData
, and computes MDS dimensions
using edgeR.
Ideally, such a representation of the data should separate both clusters
and groups from one another. Vice versa, samples from the same cluster
or group should cluster together.
In our MDS plot on pseudo-bulk counts (Fig. @ref(fig:pb-mds)), we can observe that the first dimension (MDS1) clearly separates cell populations (clusters), while the second (MDS2) separates control and stimulated samples (groups). Furthermore, the two T-cell clusters fall close to each other.
If you’re not satisfied with how the plot looks, here’s an example of
how to modify the ggplot
-object from above in various
ways:
# use very distinctive shaping of groups & change cluster colors
pb_mds <- pb_mds +
scale_shape_manual(values = c(17, 4)) +
scale_color_manual(values = RColorBrewer::brewer.pal(8, "Set2"))
# change point size & alpha
pb_mds$layers[[1]]$aes_params$size <- 5
pb_mds$layers[[1]]$aes_params$alpha <- 0.6
pb_mds
Once we have assembled the pseudobulk data, we can test for DS using
pbDS
. By default, a ∼ group_id
model is fit, and the last coefficient of the linear model is tested to
be equal to zero.
# run DS analysis
res <- pbDS(pb, verbose = FALSE)
# access results table for 1st comparison
tbl <- res$table[[1]]
# one data.frame per cluster
names(tbl)
## [1] "B cells" "CD14+ Monocytes" "CD4 T cells"
## [4] "CD8 T cells" "Dendritic cells" "FCGR3A+ Monocytes"
## [7] "Megakaryocytes" "NK cells"
## gene cluster_id logFC logCPM F p_val p_adj.loc p_adj.glb
## 1 NOC2L B cells -0.29336 6.5 2.0e+00 1.7e-01 3.2e-01 3.0e-01
## 2 ISG15 B cells 5.47362 12.2 9.5e+02 1.9e-21 6.2e-19 2.4e-18
## 3 TNFRSF18 B cells -1.35217 6.4 2.6e+01 3.0e-05 2.1e-04 1.9e-04
## 4 CPSF3L B cells -0.14902 6.1 4.2e-01 5.2e-01 6.8e-01 6.6e-01
## 5 AURKAIP1 B cells -0.01113 7.8 8.1e-03 9.3e-01 9.6e-01 9.6e-01
## 6 MRPL20 B cells 0.23430 7.4 2.5e+00 1.3e-01 2.5e-01 2.3e-01
Depening on the complexity of the experimental design (e.g., when
there are more than two groups present), comparison(s) of interest may
need to be specified explicitly. We can provide pbDS
with a
design matrix capturing the experimental design using
model.matrix
(package stats), and a contrast
matrix that specifies our comparison of interesting using
makeContrasts
from the limma
package. Alternatively, the comparison(s) of interest (or a list
thereof) can be specified with via coefs
(see
?glmQLFTest
for details). For the Kang et al. (2018) dataset, we want to carry out
a single comparison of stimulated against control samples, thus placing
"ctrl"
on the right-hand side as the reference
condition:
Alternative to the above sample-level approach, we fit (for each
gene) a mixed model (MM) to the cell-level measurement data.
muscat
provides implementations of MM that use 3 main
approaches:
In each case, a ∼ 1 + group_id + (1 | sample_id) model is
fit for each gene, optimizing the log-likelihood (i.e.,
REML = FALSE
). P-values are calculated using the estimates
of degrees of freedom specifying by argument df
(default
"Satterthwaite"
). Fitting, testing and moderation are
applied subpopulation-wise. For differential testing, mmDS
will only consider:
n_cells
cells (default 10)
in at least n_samples
samples (default 2)min_count
(default 1) in at
least min_cells
(default 20)Mixed model based approaches can be run directly on cell-level measurements, and do not require prior aggregation:
To get a general overview of the differential testing results, we first filter them to retain hits FDR < 5% and abs(logFC) > 1, and count the number and frequency of differential findings by cluster. Finally, we can view the top hits (lowest adj. p-value) in each cluster.
# filter FDR < 5%, abs(logFC) > 1 & sort by adj. p-value
tbl_fil <- lapply(tbl, function(u) {
u <- dplyr::filter(u, p_adj.loc < 0.05, abs(logFC) > 1)
dplyr::arrange(u, p_adj.loc)
})
# nb. of DS genes & % of total by cluster
n_de <- vapply(tbl_fil, nrow, numeric(1))
p_de <- format(n_de / nrow(sce) * 100, digits = 3)
data.frame("#DS" = n_de, "%DS" = p_de, check.names = FALSE)
## #DS %DS
## B cells 238 3.344
## CD14+ Monocytes 1046 14.695
## CD4 T cells 323 4.538
## CD8 T cells 94 1.321
## Dendritic cells 117 1.644
## FCGR3A+ Monocytes 385 5.409
## Megakaryocytes 28 0.393
## NK cells 171 2.402
# view top 2 hits in each cluster
top2 <- bind_rows(lapply(tbl_fil, top_n, 2, p_adj.loc))
format(top2[, -ncol(top2)], digits = 2)
## gene cluster_id logFC logCPM F p_val p_adj.loc p_adj.glb
## 1 MGST3 B cells -1.1 6.5 12.5 0.00162 0.0073 0.0064
## 2 DNAJC15 B cells 1.1 7.1 12.5 0.00161 0.0073 0.0063
## 3 MCM5 B cells -1.1 6.4 10.0 0.00401 0.0161 0.0138
## 4 CCL3L1 CD14+ Monocytes 1.1 5.8 9.2 0.00627 0.0118 0.0201
## 5 SERPINB2 CD14+ Monocytes -1.1 7.2 8.2 0.00910 0.0164 0.0275
## 6 CPNE2 CD4 T cells -1.1 4.1 8.2 0.00853 0.0406 0.0261
## 7 HOPX CD4 T cells -1.0 4.1 8.2 0.00870 0.0412 0.0265
## 8 TUBA4A CD8 T cells -1.1 7.3 17.8 0.00031 0.0025 0.0015
## 9 XBP1 CD8 T cells 1.0 8.3 15.9 0.00056 0.0043 0.0025
## 10 RALA Dendritic cells -1.0 9.1 8.6 0.01010 0.0301 0.0300
## 11 H2AFZ Dendritic cells -1.0 10.2 7.3 0.01617 0.0445 0.0445
## 12 DNAJC15 FCGR3A+ Monocytes 1.2 7.1 8.5 0.00809 0.0181 0.0249
## 13 CBWD1 FCGR3A+ Monocytes 1.1 6.6 7.3 0.01343 0.0275 0.0381
## 14 TMEM123 Megakaryocytes 1.1 9.7 17.7 0.00077 0.0077 0.0033
## 15 NT5C3A Megakaryocytes 1.3 9.7 14.1 0.00195 0.0181 0.0075
## 16 FOXP1 NK cells -1.0 6.7 10.8 0.00293 0.0149 0.0106
## 17 ITM2C NK cells -1.1 7.4 10.6 0.00326 0.0160 0.0116
Besides filter DS results based on magnitude (logFCs) and
significance (FDR), it is often worthwhile to also consider the
expression frequencies of each gene, i.e., the fraction of cells that
express a given gene in each sample and/or group.
muscat
provides wrapper, calcExprFreqs
to
compute cluster-sample/-group wise expression frequencies. Here, a gene
is considered to be expressed when the specified measurement value
(argument assay
) falls above a certain threshold (argument
th
). Note that, assay = "counts"
and
th = 0
(default) amounts to the fraction of cells for which
a respective gene has been detected.
calcExprFreqs
will return a SingleCellExperiment
object, where sheets (assays) = clusters, rows = genes, and columns =
samples (and groups, if group_id
s are present in the
colData
of the input SCE).
## [1] "B cells" "CD14+ Monocytes" "CD4 T cells"
## [4] "CD8 T cells" "Dendritic cells" "FCGR3A+ Monocytes"
## [7] "Megakaryocytes" "NK cells"
## NOC2L HES4 ISG15 TNFRSF18 TNFRSF4
## ctrl101 0.09734513 0.000000000 0.08849558 0.061946903 0.008849558
## ctrl1015 0.07773109 0.008403361 0.18067227 0.090336134 0.021008403
## ctrl1016 0.08333333 0.013888889 0.20833333 0.048611111 0.000000000
## ctrl1039 0.06666667 0.033333333 0.33333333 0.100000000 0.033333333
## ctrl107 0.09803922 0.000000000 0.07843137 0.058823529 0.019607843
## ctrl1244 0.12686567 0.037313433 0.10447761 0.126865672 0.029850746
## ctrl1256 0.08750000 0.004166667 0.18750000 0.091666667 0.025000000
## ctrl1488 0.06837607 0.004273504 0.11965812 0.076923077 0.012820513
## stim101 0.08333333 0.055555556 0.98611111 0.020833333 0.006944444
## stim1015 0.08963585 0.072829132 0.99719888 0.044817927 0.005602241
## stim1016 0.03875969 0.054263566 1.00000000 0.007751938 0.023255814
## stim1039 0.05128205 0.051282051 1.00000000 0.025641026 0.025641026
## stim107 0.05357143 0.035714286 1.00000000 0.017857143 0.000000000
## stim1244 0.11702128 0.095744681 0.98936170 0.053191489 0.021276596
## stim1256 0.06635071 0.037914692 0.99526066 0.037914692 0.004739336
## stim1488 0.05653710 0.014134276 0.99646643 0.031802120 0.003533569
## ctrl 0.08509142 0.009845288 0.15963432 0.084388186 0.018284107
## stim 0.07235339 0.050266565 0.99543031 0.033511043 0.008377761
We can use the obtained frequencies to, for instance, only retain genes that are expressed in an average of 10% of cells in at least 1 group:
gids <- levels(sce$group_id)
frq10 <- vapply(as.list(assays(frq)),
function(u) apply(u[, gids] > 0.1, 1, any),
logical(nrow(sce)))
t(head(frq10))
## NOC2L HES4 ISG15 TNFRSF18 TNFRSF4 SDF4
## B cells FALSE FALSE TRUE FALSE FALSE FALSE
## CD14+ Monocytes FALSE TRUE TRUE FALSE FALSE TRUE
## CD4 T cells FALSE FALSE TRUE FALSE FALSE FALSE
## CD8 T cells FALSE FALSE TRUE FALSE FALSE FALSE
## Dendritic cells FALSE TRUE TRUE FALSE TRUE TRUE
## FCGR3A+ Monocytes FALSE TRUE TRUE FALSE FALSE FALSE
## Megakaryocytes FALSE FALSE TRUE FALSE FALSE FALSE
## NK cells FALSE FALSE TRUE TRUE FALSE TRUE
tbl_fil2 <- lapply(kids, function(k)
dplyr::filter(tbl_fil[[k]],
gene %in% names(which(frq10[, k]))))
# nb. of DS genes & % of total by cluster
n_de <- vapply(tbl_fil2, nrow, numeric(1))
p_de <- format(n_de / nrow(sce) * 100, digits = 3)
data.frame("#DS" = n_de, "%DS" = p_de, check.names = FALSE)
## #DS %DS
## B cells 227 3.189
## CD14+ Monocytes 622 8.738
## CD4 T cells 155 2.178
## CD8 T cells 94 1.321
## Dendritic cells 117 1.644
## FCGR3A+ Monocytes 383 5.381
## Megakaryocytes 28 0.393
## NK cells 163 2.290
Especially when testing multiple contrasts or coefficients, the
results returned by runDS
may become very complex and
unhandy for exploration or exporting. Results can be formatted using
resDS
, which provides two alternative modes for formatting:
bind = "row"/"col"
.
When bind = "row"
, results from all comparisons will be
merged vertically (analogous to do.call("rbind", ...)
) into
a tidy format table, with column contrast/coef
specifying
the comparison.
Otherwise, bind = "col"
, results will be merged
horizontally into a single wide table where all results for a given gene
and cluster are kept in one row. An identifier of the respective
contrast of coefficient is then appended to the column names. This
format is useful when wanting to view a specific gene’s behavior across,
for example, multiple treatments, but will become messy when
many comparisons are included.
Expression frequencies computed with calcExprFreqs
, as
well as cluster-sample level avg. CPM, can be included in the results by
setting frq/cpm = TRUE
. Alternatively, if the former have
been pre-computed, they can be supplied directly as an input to
resDS
(see example below).
# tidy format; attach pre-computed expression frequencies
resDS(sce, res, bind = "row", frq = frq)
# big-table (wide) format; attach CPMs
resDS(sce, res, bind = "col", cpm = TRUE)
Alternatively, if expression frequencies have not been pre-computed
with calcExprFreqs
, they may be added to the results table
directly by specifying frq = TRUE
:
DS analysis aims at identifying population-specific changes in state (or expression) across conditions. In this setting, key questions of interest arise, e.g., which genes are DE in only a single (or very few) clusters? How many DE genes are shared between clusters? In summary, what is the general concordance in differential findings between clusters?
To gain an impression of the between-cluster (dis-)agreement on DE genes, we generate an UpSet-plot that visualizes the number of DE genes that are shared across or unique to certain clusters:
An UpSet plot as the one above tells us, for instance, that 185 genes are differential for all subpopulations; 387 across both Monocytes clusters; and 159 only in the B cells cluster.
The code chunk generates a set of t-SNEs colored by gene expression for the top-8 DS genes. To match the affected cells to their cluster and experimental group, see the t-SNEs colored by cluster and group ID from above.
# pull top-8 DS genes across all clusters
top8 <- bind_rows(tbl_fil) %>%
slice_min(p_adj.loc, n = 8,
with_ties = FALSE) %>%
pull("gene")
# for ea. gene in 'top8', plot t-SNE colored by its expression
ps <- lapply(top8, function(g)
.plot_dr(sce[, cs100], "TSNE", g) +
ggtitle(g) + theme(legend.position = "none"))
# arrange plots
plot_grid(plotlist = ps, ncol = 4, align = "vh")
For changes of high interest, we can view the cell-level expression
profiles of a specific gene across samples or groups using
plotExpression
(scater
package). Here, we generate violin plots for the top-6 DS genes (lowest
adj. p-value) in the B cells cluster2.
plotExpression(sce[, sce$cluster_id == "B cells"],
features = tbl_fil$`B cells`$gene[seq_len(6)],
x = "sample_id", colour_by = "group_id", ncol = 3) +
guides(fill = guide_legend(override.aes = list(size = 5, alpha = 1))) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Especially when wanting to gain an overview of numerous DE testing results for many clusters, both dimension reduction and cell-level visualizations require a lot of space can become cumbersome to interpret. In this setting, it is thus recommended to visualize aggregated measures, e.g., mean expressions by cluster sample.
Alternatively, pbHeatmap
provides a set of options
regarding which cluster(s), gene(s), and comparison to include
(arguments k
, g
and c
,
respectively). For example, the following options render a heatmap
visualizing the top 20 DS genes for the B cells cluster:
Similarly, we can visualize the cluster-sample means of a single gene of interest across all clusters in order to identify cell-types that are affected similarly by different experimental conditions:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] UpSetR_1.4.0 scater_1.35.0
## [3] scuttle_1.17.0 SingleCellExperiment_1.29.1
## [5] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [7] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [9] IRanges_2.41.1 S4Vectors_0.45.2
## [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [13] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [15] BiocFileCache_2.15.0 dbplyr_2.5.0
## [17] BiocGenerics_0.53.3 generics_0.1.3
## [19] purrr_1.0.2 muscat_1.21.0
## [21] limma_3.63.2 ggplot2_3.5.1
## [23] dplyr_1.1.4 cowplot_1.1.3
## [25] patchwork_1.3.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.2 filelock_1.0.3
## [4] bitops_1.0-9 tibble_3.2.1 lifecycle_1.0.4
## [7] Rdpack_2.6.2 edgeR_4.5.0 doParallel_1.0.17
## [10] globals_0.16.3 lattice_0.22-6 MASS_7.3-61
## [13] backports_1.5.0 magrittr_2.0.3 sass_0.4.9
## [16] rmarkdown_2.29 jquerylib_0.1.4 yaml_2.3.10
## [19] sctransform_0.4.1 DBI_1.2.3 buildtools_1.0.0
## [22] minqa_1.2.8 RColorBrewer_1.1-3 abind_1.4-8
## [25] zlibbioc_1.52.0 EnvStats_3.0.0 glmmTMB_1.1.10
## [28] rappdirs_0.3.3 circlize_0.4.16 GenomeInfoDbData_1.2.13
## [31] ggrepel_0.9.6 pbkrtest_0.5.3 irlba_2.3.5.1
## [34] listenv_0.9.1 maketools_1.3.1 parallelly_1.39.0
## [37] codetools_0.2-20 DelayedArray_0.33.2 tidyselect_1.2.1
## [40] shape_1.4.6.1 UCSC.utils_1.3.0 farver_2.1.2
## [43] lme4_1.1-35.5 ScaledMatrix_1.15.0 viridis_0.6.5
## [46] jsonlite_1.8.9 GetoptLong_1.0.5 BiocNeighbors_2.1.1
## [49] iterators_1.0.14 foreach_1.5.2 tools_4.4.2
## [52] progress_1.2.3 Rcpp_1.0.13-1 blme_1.0-6
## [55] glue_1.8.0 gridExtra_2.3 SparseArray_1.7.2
## [58] xfun_0.49 mgcv_1.9-1 DESeq2_1.47.1
## [61] withr_3.0.2 numDeriv_2016.8-1.1 BiocManager_1.30.25
## [64] fastmap_1.2.0 boot_1.3-31 fansi_1.0.6
## [67] caTools_1.18.3 digest_0.6.37 rsvd_1.0.5
## [70] mime_0.12 R6_2.5.1 colorspace_2.1-1
## [73] Cairo_1.6-2 gtools_3.9.5 RSQLite_2.3.8
## [76] RhpcBLASctl_0.23-42 utf8_1.2.4 tidyr_1.3.1
## [79] variancePartition_1.37.1 data.table_1.16.2 corpcor_1.6.10
## [82] prettyunits_1.2.0 httr_1.4.7 S4Arrays_1.7.1
## [85] uwot_0.2.2 pkgconfig_2.0.3 gtable_0.3.6
## [88] blob_1.2.4 ComplexHeatmap_2.23.0 XVector_0.47.0
## [91] sys_3.4.3 remaCor_0.0.18 htmltools_0.5.8.1
## [94] TMB_1.9.15 clue_0.3-66 scales_1.3.0
## [97] png_0.1-8 fANCOVA_0.6-1 reformulas_0.4.0
## [100] knitr_1.49 reshape2_1.4.4 rjson_0.2.23
## [103] curl_6.0.1 nlme_3.1-166 nloptr_2.1.1
## [106] cachem_1.1.0 GlobalOptions_0.1.2 stringr_1.5.1
## [109] BiocVersion_3.21.1 KernSmooth_2.23-24 parallel_4.4.2
## [112] vipor_0.4.7 AnnotationDbi_1.69.0 pillar_1.9.0
## [115] grid_4.4.2 vctrs_0.6.5 gplots_3.2.0
## [118] BiocSingular_1.23.0 beachmat_2.23.2 cluster_2.1.6
## [121] beeswarm_0.4.0 evaluate_1.0.1 mvtnorm_1.3-2
## [124] cli_3.6.3 locfit_1.5-9.10 compiler_4.4.2
## [127] rlang_1.1.4 crayon_1.5.3 future.apply_1.11.3
## [130] labeling_0.4.3 plyr_1.8.9 ggbeeswarm_0.7.2
## [133] stringi_1.8.4 viridisLite_0.4.2 BiocParallel_1.41.0
## [136] Biostrings_2.75.1 lmerTest_3.1-3 munsell_0.5.1
## [139] aod_1.3.3 Matrix_1.7-1 hms_1.1.3
## [142] bit64_4.5.2 future_1.34.0 KEGGREST_1.47.0
## [145] statmod_1.5.0 rbibutils_2.3 memoise_2.0.1
## [148] broom_1.0.7 bslib_0.8.0 bit_4.5.0
Note that, in this workflow, expression values are used for visualization only, and that differential analyses are performed on pseudobulks (section @ref(sec-pbDS)) or the count data directly (section @ref(sec-mmDS)).↩︎
Note that, as DS testing is done at the cluster-level, we need to subset the cells that have been assigned to the corresponding cluster for plotting.↩︎