CATALYST
Most of the pipeline and visualizations presented herein have been adapted from Nowicka et al. (2019)’s “CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets” available here.
# load required packages
library(CATALYST)
library(cowplot)
library(flowCore)
library(diffcyt)
library(scater)
library(SingleCellExperiment)
PBMC_fs
:flowSet
holding PBMCs samples from 4 patients, each
containing between 500 and 1000 cells. For each sample, the expression
of 10 cell surface and 14 signaling markers was measured before (REF)
and upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor
crosslinking for 30’, resulting in a total of 8 samples.PBMC_panel
:fcs_colname
column), its targeted protein marker
(antigen
column), and the marker_class
(“type”
or “state”).PBMC_md
:file_name
, sample_id
,
condition
, and patient_id
.## A flowSet with 8 experiments.
##
## column names(24): CD3(110:114)Dd CD45(In115)Dd ... HLA-DR(Yb174)Dd
## CD7(Yb176)Dd
## fcs_colname antigen marker_class
## 1 CD3(110:114)Dd CD3 type
## 2 CD45(In115)Dd CD45 type
## 3 pNFkB(Nd142)Dd pNFkB state
## 4 pp38(Nd144)Dd pp38 state
## 5 CD4(Nd145)Dd CD4 type
## 6 CD20(Sm147)Dd CD20 type
## file_name sample_id condition patient_id
## 1 PBMC_patient1_BCRXL.fcs BCRXL1 BCRXL Patient1
## 2 PBMC_patient1_Ref.fcs Ref1 Ref Patient1
## 3 PBMC_patient2_BCRXL.fcs BCRXL2 BCRXL Patient2
## 4 PBMC_patient2_Ref.fcs Ref2 Ref Patient2
## 5 PBMC_patient3_BCRXL.fcs BCRXL3 BCRXL Patient3
## 6 PBMC_patient3_Ref.fcs Ref3 Ref Patient3
The code snippet below demonstrates how to construct a
flowSet
from a set of FCS files. However, we also give the
option to directly specify the path to a set of FCS files (see next
section).
# download exemplary set of FCS files
url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow"
zip <- "PBMC8_fcs_files.zip"
download.file(paste0(url, "/", zip), destfile = zip, mode = "wb")
unzip(zip)
# read in FCS files as flowSet
fcs <- list.files(pattern = ".fcs$")
fs <- read.flowSet(fcs, transformation = FALSE, truncate_max_range = FALSE)
Data used and returned throughout differential analysis are held in
objects of the SingleCellExperiment
class. To bring the data into the appropriate format,
prepData()
requires the following inputs:
x
: a flowSet
holding the raw measurement
data, or a character string that specifies a path to a set of FCS
files.panel
: a 2 column data.frame that contains for each
marker of interest i) its column name in the raw input data, and ii) its
targeted protein marker.md
: a data.frame with columns describing the
experimental design.Optionally, features
will specify which columns
(channels) to keep from the input data. Here, we keep all measurement
parameters (default value features = NULL
).
## class: SingleCellExperiment
## dim: 24 5428
## metadata(2): experiment_info chs_by_fcs
## assays(2): counts exprs
## rownames(24): CD3 CD45 ... HLA-DR CD7
## rowData names(3): channel_name marker_name marker_class
## colnames: NULL
## colData names(3): sample_id condition patient_id
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
We provide flexibility in the way the panel and metadata table can be
set up. Specifically, column names are allowed to differ from the
example above, and multiple factors (patient ID, conditions, batch etc.)
can be specified. Arguments panel_cols
and
md_cols
should then be used to specify which columns hold
the required information. An example is given below:
# alter panel column names
panel2 <- PBMC_panel
colnames(panel2)[1:2] <- c("channel_name", "marker")
# alter metadata column names & add 2nd condition
md2 <- PBMC_md
colnames(md2) <- c("file", "sampleID", "cond1", "patientID")
md2$cond2 <- rep(c("A", "B"), 4)
# construct SCE
prepData(PBMC_fs, panel2, md2,
panel_cols = list(channel = "channel_name", antigen = "marker"),
md_cols = list(file = "file", id = "sampleID",
factors = c("cond1", "cond2", "patientID")))
Note that, independent of the input panel and metadata tables, the
constructor will fix the names of mandatory slots for latter data
accession (sample_id
in the rowData
,
channel_name
and marker_name
in the
colData
). The md
table will be stored under
experiment_info
inside the metadata
.
cluster
: FlowSOM clustering &
ConsensusClusterPlus metaclusteringCATALYST
provides a simple wrapper to perform high resolution
FlowSOM
clustering and lower resolution
ConsensusClusterPlus
metaclustering. By default, the data
will be initially clustered into xdim = 10
x
ydim = 10
= 100 groups. Secondly, the function will
metacluster populations into 2 through maxK
(default 20)
clusters. To make analyses reproducible, the random seed may be set via
seed
. By default, if the
rowData(sce)$marker_class
column is specified, the set of
markers with marker class "type"
will be used for
clustering (argument features = "type"
). Alternatively, the
markers that should be used for clustering can be specified
manually.
Let K = xdim
x ydim
be the number of FlowSOM
clusters. cluster
will add information to the following
slots of the input SingleCellExperiment
:
rowData
:
marker_class
: factor "type"
or
"state"
. Specifies whether a marker has been used for
clustering or not, respectively.colData
:
cluster_id
: cluster ID as inferred by FlowSOM.
One of 1, …, K.metadata
:
SOM_codes
: a table with dimensions K x (# type
markers). Contains the SOM codes.cluster_codes
: a table with dimensions K x
(maxK
+ 1). Contains the cluster codes for all
metaclusterings.delta_area
: a ggplot
object (see below for
details).mergeClusters
: Manual cluster mergingProvided with a 2 column data.frame containing
old_cluster
and new_cluster
IDs,
mergeClusters
allows for manual cluster merging of any
clustering available within the input SingleCellExperiment
(i.e. the xdim
x ydim
FlowSOM
clusters, and any of the 2-maxK
ConsensusClusterPlus
metaclusters). For latter accession (visualization, differential
testing), the function will assign a unique ID (specified with
id
) to each merging, and add a column to the
cluster_codes
inside the metadata
slot of the
input SingleCellExperiment
.
## # A tibble: 6 × 2
## old_cluster new_cluster
## <dbl> <chr>
## 1 1 B-cells IgM+
## 2 2 surface-
## 3 3 NK cells
## 4 4 CD8 T-cells
## 5 5 B-cells IgM-
## 6 6 monocytes
sce <- mergeClusters(sce, k = "meta20", table = merging_table, id = "merging1")
head(cluster_codes(sce))[, seq_len(10)]
## som100 meta2 meta3 meta4 meta5 meta6 meta7 meta8 meta9 meta10
## 1 1 1 1 1 1 1 1 1 1 1
## 2 2 1 1 1 1 1 1 1 1 1
## 3 3 1 1 1 1 1 1 1 1 1
## 4 4 1 1 1 1 1 1 2 2 2
## 5 5 1 1 1 1 1 1 2 2 2
## 6 6 1 1 1 1 1 1 2 2 2
The delta area represents the amount of extra cluster stability gained when clustering into k groups as compared to k-1 groups. It can be expected that high stability of clusters can be reached when clustering into the number of groups that best fits the data. The “natural” number of clusters present in the data should thus corresponds to the value of k where there is no longer a considerable increase in stability (pleateau onset). For more details, the user can refer to the original description of the consensus clustering method (Monti et al. 2003).
plotCounts
: Number of cells measured per sampleThe number of cells measured per sample may be plotted with
plotCounts
. This plot should be used as a guide together
with other readouts to identify samples where not enough cells were
assayed. Here, the grouping of samples (x-axis) is controlled by
group_by
; bars can be colored by a an additional cell
metadata variable (argument color_by
):
As opposed to plotting absolute cell counts, argument
prop
can be used to visualize relative abundances
(frequencies) instead:
pbMDS
: Pseudobulk-level MDS plotA multi-dimensional scaling (MDS) plot on aggregated measurement
values may be rendered with pbMDS
. Such a plot will give a
sense of similarities between cluster and/or samples in an unsupervised
way and of key difference in expression before conducting any formal
testing.
Arguments by
, assay
and fun
control the aggregation strategy, allowing to compute pseudobulks by
sample (by = "sample_id"
), cluster
(by = "cluster_id"
) or cluster-sample instances
(by = "both"
) using the specified assay
data
and summarry statistic (argument fun
)1. When
by != "sample_id"
, i.e., when aggregating by cluster or
cluster-sample, argument k
specifies the clustering to use.
The features to include in the computation of reduced dimensions may be
specified via argument features
.
Arguments color_by
, label_by
,
shape_by
can be used to color, label, shape pseudobulk
instances by cell metadata variables of interest. Moreover,
size_by = TRUE
will scale point sizes proportional to the
number of cells that went into aggregation. Finally, a custom color
palette may be supplied to argument pal.
A multi-dimensional scaling (MDS) plot on median marker expression by sample has the potential to reveal global proteomic differences across conditions or other experimental metadata. Here, we color points by condition (to reveal treatment effects) and further shape them by patient (to highlight patient effects). In our example, we can see a clear horizontal (MDS dim. 1) separation between reference (REF) and stimulation condition (BCRXL), while patients are, to a lesser extent, separated vertically (MDS dim. 2):
Complementary to the visualize above, we can generate an MDS plot on pseudobulks computed for each cluster-sample instance. Here, we color point by cluster (to highlight similarity between cell subpopulations), and shape them by condition (to reveal subpopulation-specific expression changes across conditions). In this example, we can see that cluster-sample instances of the same cell subpopulations group together. Meanwhile, most subpopulations exhibit a shift between instances where samples come from different treatment groups:
clrDR
: Reduced dimension plot on CLR of
proportionsA dimensionality reduction plot on centered log-ratios (CLR) of
sample/cluster proportions across clusters/samples can be rendered with
clrDR
. Here, we view each sample (cluster) as a vector of
cluster (sample) proportions. Complementary to pbMDS
, such
a plot will give a sense of similarities between samples/clusters in
terms of their composition.
Centered log-ratio
Let si = s1, ..., sS
denote one of S samples, ki = k1, ..., kK
one of K clusters, and pk(si)
be the proportion of cells from sample si in cluster
k. The centered log-ratio
(CLR) on a given sample’s cluster composition is then defined as:
$$\text{clr}_{sk} = \log p_k(s_i) - \frac{1}{K}\sum_{i=1}^K \log p_k(s_i)$$
Thus, each sample s gives a
vector with length K with mean
0, and the CLRs computed across all instances can be represented as a
matrix with dimensions S × K.
We can embed the CLR matrix into a lower dimensional space in which
points represent samples; or embed its transpose, in which case points
represent samples. Distances in this lower-dimensional space will then
represent the similarity in cluster compositions between samples and the
in sample compositions between clusters, respectively.
Dimensionality reduction
In principle, clrDR
allows any dimension reduction to be
applied on the CLRs, with dims
(default
c(1, 2)
) specifying which dimensions to visualize. The
default method dr = "PCA"
will include the percentage of
variance explained by each principal component (PC) in the axis labels.
Noteworthily, distances between points in the lower-dimensional
space are meaningful only for linear DR methods (PCA and MDS), and
results obtained from other methods should be interpreted with caution.
The output plot’s aspect ratio should thus be kept as is for PCA and
MDS; meanwhile, non-linear DR methods can use
aspect.ratio = 1
, rendering a squared plot.
Interpreting loadings
For dr = "PCA"
, PC loadings will be represented as arrows
that may be interpreted as follows: 0° (180°) between vectors indicates
a strong positive (negative) relation between them, while vectors that
are orthogonal to one another (90°) are roughly independent.
When a vector points towards a given quadrant, the variability in
proportions for the points within this quadrant are largely driven by
the corresponding variable. Here, only the relative orientation of
loading vectors to each other and to the PC axes is meaningful; however,
the sign of loadings (i.e., whether an arrow points left or right) can
be flipped when re-computing PCs.
Aesthetics
Cell metadata variables to color points and PC loading arrows by are
determined by arguments point/arrow_col
, with
point/arrow_pal
specifying the color palettes to use for
each layer. For example, rather than coloring samples by their unique
identifiers, we may choose to use their condition for coloring instead
to highlight differences across groups. In addition, point sizes may be
scaled by the number of cells in a given sample/cluster (when
by = "sample/cluster_id"
) via setting
size_by = TRUE
.
Argument arrow_len
(default 0.5) controls the length of PC
loading vectors relative to the largest absolute xy-coordinate. When
specified, PC loading vectors will be re-scaled to improve their
visibility: A value of 1 will stretch vectors such that the largest
loading will touch on the outer most point. Importantly, while absolute
arrow lengths are not interpretable, their relative length is.
We here visualize the first two PCs computed on CLRs of sample proportions across clusters: small distances between samples mean similar cluster compositions between them, while large distances are indicative of differences in cluster proportions. In our example, PC 1 clearly separates treatment groups:
As an alternative to the plot above, we can visualize the first two PCs computed on the CLR matrix’ transpose. Here, we can observe that most of the variability in cluster-compositions across samples is driven by BCRXL samples (PC1):
plotExprHeatmap
: Heatmap of aggregated marker
expressionsplotExprHeatmap
allows generating heatmaps of aggregated
(pseudobulk) marker expressions. Argument assay
(default
"exprs"
) and fun
(default
"median"
) control the aggregation. Depending on argument
by
, aggregation will be performed by
"sample_id"
, "cluster"
, or
"both"
.
Scaling
plotExprHeatmap
supports various scaling strategies that
are controlled by arguments scale
and q
, and
will greatly alter the visualization and its interpretation2:
When scale = "first"
, the specified assay data will
be scaled between 0 and 1 using lower (q
) and upper
(1-q
) quantiles as boundaries, where q = 0.01
by default. This way, while losing information on absolut expression
values, marker expressions will stay comparable, and visualization is
improved in cases where the expression range varies greatly between
markers.
When scale = "last"
, assay data will be aggregated
first and scaled subsequently. Thus, each marker’s value range will be
[0,1]. While all comparability between markers is lost, such scaling
will improve seeing differences across, e.g., samples or
clusters.
When scale = "never"
, no scaling (and quantile
trimming) is applied. The resulting heatmap will thus display
raw pseudobulk data (e.g., median expression values).
Hierarchical clustering
Various arguments control whether rows/columns should be hierarchically
clustered (and re-ordered) accordingly (row/col_clust
), and
whether to include the resulting dendrograms in the heatmap
(row/col_dend
). Here, clustering is performed using
dist
followed by hclust
on the assay data
matrix with the specified method
as distance metric
(default euclidean
) and linkage
for
agglomeration (default average
).
Cell count annotation
Optionally, argument bars
specifies whether to include a
barplot of cell counts, which can further be annotated with relative
abundances (%) via perc = TRUE
. These will correspong to
cell counts by cluster (when by != "sample_id"
), and by
sample otherwise.
Sample and cluster annotations
By default (row/col_anno = TRUE
), for axes corresponding to
samples (y-axis for by = "sample_id"
and x-axis for
by = "both"
), annotations will be drawn for all non-numeric
cell metadata variables. Alternatively, a specific subset of annotations
can be included for only a subset of variables by specifying
row/col_anno
to be a character vector in (see
examples).
For axes corresponding to clusters (y-axis for
by = "cluster_id"
and "both"
), annotations
will be drawn for the specified clustering(s) (arguments k
and m
).
The following examples shall cover the 3 different modes of
plotExprHeatmap
:
# scale each marker between 0 and 1
# (after aggregation & without trimming)
plotExprHeatmap(sce, features = "state",
scale = "last", q = 0, bars = FALSE)
When by != "sample_id"
, the clustering to use for
aggregation is specified via k
, and an additional
metaclusting may be included as an annotation (argument
m
):
# medians of scaled & trimmed type-marker expressions by cluster
plotExprHeatmap(sce, features = "type",
by = "cluster_id", k = "meta12", m = "meta8",
scale = "first", q = 0.01, perc = TRUE, col_dend = FALSE)
Finally, we can visualize the aggregated expression of specific
markers by cluster and sample via by = "both"
3:
plotPbExprs
: Pseudobulk expression boxplotA combined boxplot and jitter of aggregated marker intensities can be
generated via plotPbExprs()
. Here, argument
features
(default "state"
, which is equivalent
to state_markers(sce)
) controls which markers to include.
features = NULL
will include all markers (and is equivalent
to rownames(sce)
).
The specified assay
values (default "exprs"
)
will be aggregated using fun
(default
"median"
) as summary statistic, resulting in one pseudobulk
value per sample, or cluster-sample (when one of facet_by
,
color_by
or group_by
is set to
"cluster_id"
).
In order to compare medians for each cluster, and potentially
identify changes across conditions early on, we specify
facet = "cluster_id"
:
Alternatively, we can facet the above plot by antigen
in
order to compare marker expressions calculated over all cells across
conditions:
Thirdly, we can investigate how consistent type-markers are expressed
across clusters. To this end, we specify
color_by = "cluster_id"
in order to aggregate expression
values by both clusters and samples. The resulting plot gives an
indication how good our selection of type markers is: Ideally,
their expression should be fairly specific to a subset of clusters.
plotPbExprs(sce, k = "meta10", features = "type",
group_by = "cluster_id", color_by = "sample_id",
size_by = TRUE, geom = "points", jitter = FALSE, ncol = 5)
Finally, we can investigate how variable state-marker expressions are
across clusters by setting group_by = "cluster_id"
(to
aggregate by both, samples and clusters) and
color_by = "condition"
(to additionally group samples by
treatment). This type of visualization yields similar information as the
plot above: We can observe that there are global shifts in expression
for a set of markers (e.g., pNFkB, pp38), and rather
subpopulation-specific changes for others (e.g., pS6).
plotClusterExprs
: Marker-densities by clusterDistributions of marker intensities (arcsinh-transformed) across cell
populations of interest can be plotted with
plotClusterExprs
. We specify features = "type"
(equivalent to type_markers(sce)
), to include type-markers
only. Here, blue densities (top row) are calculated over all cells and
serve as a reference.
plotAbundances
: Relative population abundancesRelative population abundances for any clustering of interest can be
plotted with plotAbundances
. Argument by
will
specify whether to plot proportions for each sample or cluster;
group_by
determines the grouping within each panel as well
ascolor coding.
by = "sample_id"
, the function displays each
sample’s cell type composition, and the size of a given stripe reflects
the proportion of the corresponding cell type the given sample. Argument
group_by
then specifies the facetting.by = "cluster_id"
, argument group_by
then specifies the grouping and color coding.plotFreqHeatmap
: Heatmap of cluster fequenciesComplementary to plotAbundances
, a heatmap of relative
cluster abundances by cluster can be generated with
plotFreqHeatmap
. By default
(normalize = TRUE
), frequencies will we standarized for
each cluster, across samples. Analogous to plotExprHeatmap
(Sec. @ref(plotExprHeatmap)), arguments row/col_clust/dend
control hierarchical clustering or rows (clusters) and columns
(samples), and whether the resulting dendrograms should be display;
k
specifies the clustering across which to compute
abundances, and m
a secondary clustering for display.
Again, bars
and perc
can be used to include a
labelled cell count barplot.
plotMultiHeatmap
: Multi-panel HeatmapsplotMultiHeatmap
provides flexible options to combine
expression and cluster frequency heatmaps generated with
plotExprHeatmap
(Sec. @ref(plotExprHeatmap)) and
plotFreqHeatmap
(Sec. @ref(plotFreqHeatmap)), respectively,
with arguments hm1
and hm2
controlling the
panel contents.
Panel contents
In its 1st panel, plotMultiHeatmap
will display pseudobulks
by cluster. Here, hm1
may be used to specify a set of
markers (subset of rownames(sce)
), or
"type"/"state"
for type/state_markers(sce)
when marker_classes(sce)
have been specified. The 1st
heatmap can be turned off altogether by setting
hm1 = FALSE
.
Anologously, for hm2 = "type"/"state"
, an expression
heatmap of type-/state-markers will be displayed as 2nd heatmap (see
Sec. @ref(plotMultiHeatmap-example1)). hm2 = "abundances"
will render cluster frequencies by sample (see Sec.
@ref(plotMultiHeatmap-example2)). As opposed to argument
hm1
, however, when hm2
specifies one or
multiple marker(s), a separate heatmap of pseudobulks by cluster-sample
will be drawn for each marker (see Sec.
@ref(plotMultiHeatmap-example3)).
Row and column annotations
The clustering to aggregate by is specified with argument
k
. Optionally, a metaclustering of interest may be provided
via m
; here, clustering m
is merely included
as an additional annotation (and not used for any computation). Thus,
m
serves to visually inspect the quality a lower-resolution
clustering or manual merging.
When an x-axis corresponds to samples, plotMultiHeatmap
will include column annotations for cell metadata variables (columns in
colData
) that map uniquely to each sample (e.g., condition,
patient ID). These annotations can be omitted via
col_anno = FALSE
, or reduced (e.g.,
col_anno = "condition"
to include only a single
annotation); by default (col_anno = TRUE
), all available
metadata is included.
Argument handling
Most of plotMultiHeatmap
’s arguments take a single value as
input. For example, row annotations and dendrograms are automatically
removed for all but the 1st panel. Nevertheless, a subset of arguments
may be set uniquely for each heatmap; these are:
scale
and q
controlling the scaling
strategy for expression heatmapscol_clust/dend
specifying whether or not to
column-cluster each heatmap and draw the resulting dendrogramsWhen any of these arguments is of length one, the specified value
will be recycled for both heatmaps. E.g., setting
scale = "never"
will have all expression heatmaps show
unscaled data.
To demonstrate plotMultiHeatmap
’s basic functionality
and handling of arguments, we plot expression heatmaps for both type-
(hm1 = "type"
) and state-markers
(hm2 = "state"
), and choose to include a column dendrogram
for the 2nd heatmap only:
As a second example, we plot an expession heatmap for a selection of
markers (here, those starting with “CD”:
hm1 = c("CDx", "CDy", ...
)4 next to the relative
cluster abundances across samples (hm2 = "abundances"
). We
also add a barplot for the cell counts in each cluster
(bars = TRUE
) along with labels for their relative
abundance (perc = TRUE
):
In this final example, we view a selection of markers side-by-side,
and omit the 1st panel (hm1 = FALSE
). We also retain the
ordering of samples (column order) across panels
(col_clust = FALSE
). In this case,
plotMultiHeatmap
will drop column names (sample IDs) for
all but the first panel to avoid repeating these labels and overcrowding
the plot. Lastly, we set scale = "never"
to visualize
raw (unscaled) pseudobulks (= median expressions by
cluster-sample):
The number of cells in cytometry data is typically large, and for
visualization of cells in a two-dimensional space it is often sufficient
to run dimension reductions on a subset of the data. Thus,
CATALYST
provides the wrapper function runDR
to apply any of the dimension reductions available from
BiocStyle::Biocpkg("scater")
using
features
;
either a subset of rownames(.)
or, e.g.,
"type"
for type_markers(.)
(if
marker_classes(.)
have been specified).cells
;
either NULL
for all cells, or n
to sample a
random subset of n cells per sample.To make results reproducible, the random seed should be set via
set.seed
prior to computing reduced
dimensions:
Alternatively, dimension reductions can be computed using one of
scater’s
runX
functions (X = "TSNE", "UMAP", ...
). Note
that, by default, scater
expects expression values to be
stored in the logcounts
assay of the SCE; specification of
exprs_values = "exprs"
is thus required:
DRs available within the SCE can be viewed via
reducedDimNames
and accessed with
reducedDim(s)
:
## [1] "UMAP"
## [,1] [,2]
## [1,] 1.08818090 -1.3844522
## [2,] 0.48954475 -5.5837769
## [3,] -6.84185374 1.5087381
## [4,] -0.07974524 -1.3480611
## [5,] 2.60971939 1.2238499
## [6,] -2.74877203 0.3680429
While scater’s
plotReducedDim
function can be used to visualize DRs,
CATALYST
provides the plotDR
wrapper,
specifically to allow for coloring cells by the various clusterings
available, and to support facetting by metadata factors (e.g.,
experimental condition, sample IDs):
SCEs constructed with prepData
can be filtered using the
filterSCE
function, which allows for filtering of both
cells and markers according to conditional statements in
dplyr
-style. When filtering on cluster_id
s,
argument k
specifies which clustering to use (the default
NULL
uses colData
column
"cluster_id"
). Two examples are given below:
##
## Ref1 BCRXL1
## 881 528
CATALYST
has been designed to be compatible with the diffcyt
package (Weber et al. 2019), which
implements statistical methods for differential discovery in
high-dimensional cytometry (including flow cytometry, mass cytometry or
CyTOF, and oligonucleotide-tagged cytometry) using high-resolution
clustering and moderated tests. The input to the diffcyt
pipeline can either be raw data, or a SingleCellExperiment
object. We give an exmaple of the latter below.
Please refer to the diffcyt
vignette
and R documentation (??diffcyt
) for more detailed
information.
# create design & contrast matrix
design <- createDesignMatrix(ei(sce), cols_design = "condition")
contrast <- createContrast(c(0, 1))
# test for
# - differential abundance (DA) of clusters
# - differential states (DS) within clusters
res_DA <- diffcyt(sce, clustering_to_use = "meta10",
analysis_type = "DA", method_DA = "diffcyt-DA-edgeR",
design = design, contrast = contrast, verbose = FALSE)
res_DS <- diffcyt(sce, clustering_to_use = "meta10",
analysis_type = "DS", method_DS = "diffcyt-DS-limma",
design = design, contrast = contrast, verbose = FALSE)
# extract result tables
tbl_DA <- rowData(res_DA$res)
tbl_DS <- rowData(res_DS$res)
plotDiffHeatmap
: Heatmap of differential testing
resultsDifferential testing results returned by diffcyt
can be displayed with the plotDiffHeatmap
function.
For differential abundance (DA) tests, plotDiffHeatmap
will display relative cluster abundances by samples; for differential
state (DS) tests, plotDiffHeatmap
will display aggregated
marker expressions by sample.
Filtering
The results to retain for visualization can be filtered via
fdr
: threshold on adjusted p-values below
which to keep a resultlfc
: threshold on absolute logFCs above which
to keep a resultThe number of top findings to display can be specified with
top_n
(default 20). When all = TRUE
,
significance and logFC filtering will be skipped, and all
top_n
results are shown.
Annotations
Analogous to plotFreq/Expr/MuliHeatmap
, when
col_anno = TRUE
, plotDiffHeatmap
will include
column annotations for cell metadata variables (columns in
colData
) that map uniquely to each sample (e.g., condition,
patient ID). These annotations can be omitted via
col_anno = FALSE
, or reduced (e.g.,
col_anno = "condition"
to include only a single
annotation).
When row_anno = TRUE
, cluster (DA) and cluster-marker
instances (DS) will be marked as significant if their adjusted
p-value falls below the threshold value specified with fdr
.
A second annotation will be drawn for the logFCs.
Normalization
When normalize = TRUE
, the heatmap will display Z-score
normalized values. For DA, cluster frequencies will be
arcsine-square-root scaled prior to normalization. While losing
information on absolution frequency/expression values, this option will
make differences across samples and conditions more notable.
We here set all = TRUE
to display top-20 DA analysis
results, without filtering on adjusted p-values and logFCs. Since
differential testing was performed on 10 clusters only, this will simply
include all available results.
By setting fdr = 0.05
despite not filtering on
significance, we can control the right-hand side annotation:
Via setting fdr = 0.05
, we here display the top DS
analysis results in terms of significance. Alternative to the example
above, we sort these according their logFCs
(sort_by = "lfc"
), and include only a selected sample
annotation (col_anno = "condition"
):
As an alternative to leaving the selection of markers and clusters to
their ordering (significance), we can visualize a specific subset of
results (e.g., a selected marker or cluster) using
filterSCE
:
Heatmap and annotation colors are controlled via arguments
hm_pal
and fdr/lfc_pal
, respectively. Here’s
an example how these can be modified:
Conversion from SCE to flowFrame
s/flowSet
,
which in turn can be writting to FCS files using flowCore’s
write.FCS
function, is not straightforward. It is not
recommended to directly write FCS via
write.FCS(flowFrame(t(assay(sce))))
, as this can lead to
invalid FCS files or the data being shown on an inappropriate scale in
e.g. Cytobank. Instead, CATALYST
provides the
sce2fcs
function to facilitate correct back-conversion.
sce2fcs
allows specification of a colData
column to split the SCE by (argument split_by
), e.g., to
split the data by cluster; whether to keep or drop any cell metadata
(argument keep_cd
) and dimension reductions (argument
keep_dr
) available within the object; and which assay data
to use (argument assay
)5:
# store final clustering in cell metadata
sce$mm <- cluster_ids(sce, "merging1")
# convert to 'flowSet' with one frame per cluster
(fs <- sce2fcs(sce, split_by = "mm"))
## A flowSet with 8 experiments.
##
## column names(24): CD3(110:114)Dd CD45(In115)Dd ... HLA-DR(Yb174)Dd
## CD7(Yb176)Dd
# split check: number of cells per barcode ID
# equals number of cells in each 'flowFrame'
all(c(fsApply(fs, nrow)) == table(sce$mm))
## [1] TRUE
## [1] "B-cells IgM+" "B-cells IgM-" "CD4 T-cells" "CD8 T-cells" "DC"
## [6] "NK cells" "monocytes" "surface-"
Having converted out SCE to a flowSet
, we can write out
each of its flowFrame
s to an FCS file with a meaningful
filename that retains the cluster of origin:
While FlowSOM
has proven to perform well in systematic comparisons of clustering
algorithms for CyTOF data (Weber and Robinson
2016; Freytag et al. 2018), it is not the only method out there.
Here we demonstrate how clustering assignments from another clustering
method, say, Rphenograph, could be
incorporated into the SCE to make use of the visualizations available in
CATALYST
. Analogous to the example below, virtually any
clustering algorithm could be applied, however, with the following
limitation:
The ConsensusClusterPlus
metaclusterings applied to the
initial FlowSOM
clustering by CATALYST
’s
cluster
function have a hierarchical cluster structure.
Thus, clustering IDs can be matched from a higher resolution (e.g. 100
SOM clusters) to any lower resolution (e.g., 2 through 20 metaclusters).
This is not guaranteed for other clustering algorithms. Thus, we store
only a single resolution in the cell metadata column
cluster_id
, and a single column under metadata
slot cluster_codes
containing the unique cluster IDs.
Adding additional resolutions to the cluster_codes
will
fail if cluster IDs can not be matched uniquely across conditions, which
will be the case for any non-hierarchical clustering method.
# subset type-marker expression matrix
es <- assay(sce, "exprs")
es <- es[type_markers(sce), ]
# run clustering method X
# (here, we just split the cells into
# equal chunks according to CD33 expression)
cs <- split(seq_len(ncol(sce)), cut(es["CD33", ], nk <- 10))
kids <- lapply(seq_len(nk), function(i) {
rep(i, length(cs[[i]]))
})
kids <- factor(unlist(kids))
# store cluster IDs in cell metadata & codes in metadata
foo <- sce
foo$cluster_id[unlist(cs)] <- unlist(kids)
metadata(foo)$cluster_codes <- data.frame(
custom = factor(levels(kids), levels = levels(kids)))
# tabulate cluster assignments
table(cluster_ids(foo, "custom"))
##
## 1 2 3 4 5 6 7 8 9 10
## 5 34 427 3326 611 436 335 176 64 14
Most of CATALYST
’s plotting functions return
ggplot
objects whose aesthetics can (in general) be
modified easily. However, while e.g. theme aesthetics and color scales
can simply be added to the plot, certain modifications can be achieved
only through overwriting elements stored in the object, and thus require
a decent understanding of its structure.
Other functions (plotExprHeatmap
,
plotMultiHeatmap
and plotDiffHeatmap
) generate
objects of the Heatmap
of HeatmapList
class
from the ComplexHeatmap
package, and are harder to modify once created. Therefore,
CATALYST
tries to expose a reasonable amount of arguments
to the user that control key aesthetics such as the palettes used for
coloring clusters and heatmaps.
The examples below serve to illustrate how some less exposed
ggplot
aesthetics can be modified in retrospect, and the
effects of different arguments that control visualization of ComplexHeatmap
outputs.
ggplot
sp <- plotPbExprs(sce, k = "meta4", facet_by = "cluster_id")
# facetting layout is 2x2; plot all side-by-side instead
p$facet$params$nrow <- 1
# remove points
p$layers <- p$layers[-1]
# overwrite default colors
p <- p + scale_color_manual(values = c("royalblue", "orange"))
# remove x-axis title, change angle & decrease size of labels
(p + labs(x = NULL) + theme(axis.text.x = element_text(angle = 90, size = 8, vjust = 0.5)))
ComplexHeatmap
splotMultiHeatmap(sce,
k = "meta8",
m = "meta4",
hm2 = "abundances",
# include all dendrograms
row_dend = TRUE,
col_dend = TRUE,
# exclude sample annotations
col_anno = FALSE,
# primary & merging cluster palettes
k_pal = hcl.colors(8, "Vik"),
m_pal = hcl.colors(4, "Tropic"),
# 1st & 2nd heatmap coloring
hm1_pal = c("grey95", "blue"),
hm2_pal = c("grey95", "red3"))
# minimal heatmap
plotExprHeatmap(sce,
row_anno = FALSE, # don't annotate samples
row_clust = FALSE, # keep samples in original order
col_clust = FALSE, # keep markers in original order
bin_anno = FALSE, # don't annotate bins
bars = FALSE, # don't include sample sizes
scale = "last", # aggregate, then scale
hm_pal = hcl.colors(10, "YlGnBu", rev = TRUE))
# complete heatmap
plotExprHeatmap(sce, row_anno = TRUE, # annotate samples
row_clust = TRUE, col_clust = TRUE, # cluster samples/markers
row_dend = TRUE, col_dend = TRUE, # include dendrograms
bin_anno = TRUE, # annotate bins with value
bars = TRUE, perc = TRUE, # include barplot of sample sizes
hm_pal = c("grey95", "orange"))
ComplexHeatmap
sWhile plotMultiHeatmap
provides a convenient way to
combine pseudoublk expression heatmaps across clusters or cluster-sample
combinations with heatmaps of relative cluster abundances,
Heatmap
objects can be, in principle, combined arbitrarily
with a few notes of caution:
Heatmap
should use the same clustering for
aggregation (argument k
).Heatmap
should have unique identifier (slot
@name
); otherwise, a warning is given.Heatmap
’s legend should have a unique title (slot
@matrix_color_mapping@name
);# specify clustering to aggregate by
k <- "meta11"
# median type-marker expression by cluster
p1 <- plotExprHeatmap(sce, features = "type",
by = "cluster_id", k = k, m = "meta7")
# median state-marker expression by cluster
p2 <- plotExprHeatmap(sce, features = "state",
by = "cluster_id", k = k, row_anno = FALSE)
# relative cluster abundances by sample
p3 <- plotFreqHeatmap(sce, k = k, perc = TRUE,
row_anno = FALSE, col_clust = FALSE)
# make legend titles unique
p1@name <- p1@matrix_color_mapping@name <- "type"
p2@name <- p2@matrix_color_mapping@name <- "state"
p1 + p2 + p3
# specify clustering to aggregate by
k <- "meta9"
# relative cluster abundances by sample
p <- plotFreqHeatmap(sce, k = k,
bars = FALSE, hm_pal = c("white", "black"),
row_anno = FALSE, col_clust = FALSE, col_anno = FALSE)
# specify unique coloring
cs <- c(pp38 = "maroon", pBtk = "green4")
# median expression of selected markers by cluster-sample
for (f in names(cs)) {
q <- plotExprHeatmap(sce, features = f,
by = "both", k = k, scale = "never",
row_anno = FALSE, col_clust = FALSE,
hm_pal = c("white", cs[f]))
# make identifier & legend title unique
q@name <- q@matrix_color_mapping@name <- f
# remove column annotation names
for (i in seq_along(q@top_annotation@anno_list))
q@top_annotation@anno_list[[i]]@name_param$show <- FALSE
# remove redundant sample names
q@column_names_param$show <- FALSE
p <- p + q
}
# add heatmap of median expression across all features
p + plotExprHeatmap(sce, features = NULL,
by = "cluster_id", k = k, row_anno = FALSE)
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.35.0 ggplot2_3.5.1
## [3] scuttle_1.17.0 diffcyt_1.27.0
## [5] flowCore_2.19.0 cowplot_1.1.3
## [7] CATALYST_1.31.2 SingleCellExperiment_1.29.1
## [9] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [11] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [13] IRanges_2.41.1 S4Vectors_0.45.2
## [15] BiocGenerics_0.53.3 generics_0.1.3
## [17] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [19] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 shape_1.4.6.1
## [5] magrittr_2.0.3 ggbeeswarm_0.7.2
## [7] TH.data_1.1-2 nloptr_2.1.1
## [9] farver_2.1.2 rmarkdown_2.29
## [11] GlobalOptions_0.1.2 zlibbioc_1.52.0
## [13] vctrs_0.6.5 Cairo_1.6-2
## [15] minqa_1.2.8 rstatix_0.7.2
## [17] htmltools_0.5.8.1 S4Arrays_1.7.1
## [19] plotrix_3.8-4 BiocNeighbors_2.1.1
## [21] broom_1.0.7 SparseArray_1.7.2
## [23] Formula_1.2-5 sass_0.4.9
## [25] bslib_0.8.0 plyr_1.8.9
## [27] sandwich_3.1-1 zoo_1.8-12
## [29] cachem_1.1.0 buildtools_1.0.0
## [31] igraph_2.1.1 lifecycle_1.0.4
## [33] iterators_1.0.14 pkgconfig_2.0.3
## [35] rsvd_1.0.5 Matrix_1.7-1
## [37] R6_2.5.1 fastmap_1.2.0
## [39] GenomeInfoDbData_1.2.13 clue_0.3-66
## [41] digest_0.6.37 colorspace_2.1-1
## [43] ggnewscale_0.5.0 irlba_2.3.5.1
## [45] ggpubr_0.6.0 beachmat_2.23.2
## [47] labeling_0.4.3 cytolib_2.19.0
## [49] fansi_1.0.6 colorRamps_2.3.4
## [51] nnls_1.6 httr_1.4.7
## [53] polyclip_1.10-7 abind_1.4-8
## [55] compiler_4.4.2 withr_3.0.2
## [57] doParallel_1.0.17 ConsensusClusterPlus_1.71.0
## [59] backports_1.5.0 BiocParallel_1.41.0
## [61] viridis_0.6.5 carData_3.0-5
## [63] ggforce_0.4.2 ggsignif_0.6.4
## [65] MASS_7.3-61 drc_3.0-1
## [67] DelayedArray_0.33.2 rjson_0.2.23
## [69] FlowSOM_2.15.0 gtools_3.9.5
## [71] tools_4.4.2 vipor_0.4.7
## [73] beeswarm_0.4.0 glue_1.8.0
## [75] nlme_3.1-166 grid_4.4.2
## [77] Rtsne_0.17 cluster_2.1.6
## [79] reshape2_1.4.4 gtable_0.3.6
## [81] tidyr_1.3.1 data.table_1.16.2
## [83] ScaledMatrix_1.15.0 BiocSingular_1.23.0
## [85] car_3.1-3 utf8_1.2.4
## [87] XVector_0.47.0 ggrepel_0.9.6
## [89] foreach_1.5.2 pillar_1.9.0
## [91] stringr_1.5.1 limma_3.63.2
## [93] circlize_0.4.16 splines_4.4.2
## [95] dplyr_1.1.4 tweenr_2.0.3
## [97] lattice_0.22-6 FNN_1.1.4.1
## [99] survival_3.7-0 RProtoBufLib_2.19.0
## [101] tidyselect_1.2.1 locfit_1.5-9.10
## [103] ComplexHeatmap_2.23.0 maketools_1.3.1
## [105] knitr_1.49 gridExtra_2.3
## [107] edgeR_4.5.0 xfun_0.49
## [109] statmod_1.5.0 stringi_1.8.4
## [111] UCSC.utils_1.3.0 boot_1.3-31
## [113] yaml_2.3.10 evaluate_1.0.1
## [115] codetools_0.2-20 tibble_3.2.1
## [117] BiocManager_1.30.25 cli_3.6.3
## [119] uwot_0.2.2 munsell_0.5.1
## [121] jquerylib_0.1.4 Rcpp_1.0.13-1
## [123] png_0.1-8 XML_3.99-0.17
## [125] parallel_4.4.2 lme4_1.1-35.5
## [127] viridisLite_0.4.2 mvtnorm_1.3-2
## [129] scales_1.3.0 ggridges_0.5.6
## [131] purrr_1.0.2 crayon_1.5.3
## [133] GetoptLong_1.0.5 rlang_1.1.4
## [135] multcomp_1.4-26
By default, median expression values are computed.↩︎
Regardless of the chosen scaling strategy, row and column clusterings will be performed on unscaled data.↩︎
In this case, only a single features is allowed as input.↩︎
hm1 = NULL
would include all
markers.↩︎
Only count-like data should be written to FCS files and is guaranteed to show with appropriate scale in Cytobank!↩︎
plotCounts
:
Number of cells measured per samplepbMDS
:
Pseudobulk-level MDS plotclrDR
:
Reduced dimension plot on CLR of proportionsplotExprHeatmap
: Heatmap of
aggregated marker expressionsplotPbExprs
:
Pseudobulk expression boxplotplotClusterExprs
:
Marker-densities by clusterplotAbundances
:
Relative population abundancesplotFreqHeatmap
: Heatmap of
cluster fequenciesplotMultiHeatmap
:
Multi-panel Heatmaps