This vignette will demonstrate a full single-cell lineage analysis
workflow, with particular emphasis on the processes of lineage
reconstruction and pseudotime inference. We will make use of the
slingshot
package proposed in (Street et al. 2017) and show how it may be
applied in a broad range of settings.
The goal of slingshot
is to use clusters of cells to
uncover global structure and convert this structure into smooth lineages
represented by one-dimensional variables, called “pseudotime.” We
provide tools for learning cluster relationships in an unsupervised or
semi-supervised manner and constructing smooth curves representing each
lineage, along with visualization methods for each step.
The minimal input to slingshot
is a matrix representing
the cells in a reduced-dimensional space and a vector of cluster labels.
With these two inputs, we then:
getLineages
function.getCurves
function.We will work with two simulated datasets in this vignette. The first
(referred to as the “single-trajectory” dataset) is generated below and
designed to represent a single lineage in which one third of the genes
are associated with the transition. This dataset will be contained in a
SingleCellExperiment
object (Lun and
Risso 2017) and will be used to demonstrate a full
“start-to-finish” workflow.
# generate synthetic count data representing a single lineage
means <- rbind(
# non-DE genes
matrix(rep(rep(c(0.1,0.5,1,2,3), each = 300),100),
ncol = 300, byrow = TRUE),
# early deactivation
matrix(rep(exp(atan( ((300:1)-200)/50 )),50), ncol = 300, byrow = TRUE),
# late deactivation
matrix(rep(exp(atan( ((300:1)-100)/50 )),50), ncol = 300, byrow = TRUE),
# early activation
matrix(rep(exp(atan( ((1:300)-100)/50 )),50), ncol = 300, byrow = TRUE),
# late activation
matrix(rep(exp(atan( ((1:300)-200)/50 )),50), ncol = 300, byrow = TRUE),
# transient
matrix(rep(exp(atan( c((1:100)/33, rep(3,100), (100:1)/33) )),50),
ncol = 300, byrow = TRUE)
)
counts <- apply(means,2,function(cell_means){
total <- rnbinom(1, mu = 7500, size = 4)
rmultinom(1, total, cell_means)
})
rownames(counts) <- paste0('G',1:750)
colnames(counts) <- paste0('c',1:300)
sce <- SingleCellExperiment(assays = List(counts = counts))
The second dataset (the “bifurcating” dataset) consists of a matrix
of coordinates (as if obtained by PCA, ICA, diffusion maps, etc.) along
with cluster labels generated by k-means clustering. This dataset
represents a bifurcating trajectory and it will allow us to demonstrate
some of the additional functionality offered by
slingshot
.
library(slingshot, quietly = FALSE)
data("slingshotExample")
rd <- slingshotExample$rd
cl <- slingshotExample$cl
dim(rd) # data representing cells in a reduced dimensional space
## [1] 140 2
## [1] 140
To begin our analysis of the single lineage dataset, we need to reduce the dimensionality of our data and filtering out uninformative genes is a typical first step. This will greatly improve the speed of downstream analyses, while keeping the loss of information to a minimum.
For the gene filtering step, we retained any genes robustly expressed in at least enough cells to constitute a cluster, making them potentially interesting cell-type marker genes. We set this minimum cluster size to 10 cells and define a gene as being “robustly expressed” if it has a simulated count of at least 3 reads.
Another important early step in most RNA-Seq analysis pipelines is
the choice of normalization method. This allows us to remove unwanted
technical or biological artifacts from the data, such as batch,
sequencing depth, cell cycle effects, etc. In practice, it is valuable
to compare a variety of normalization techniques and compare them along
different evaluation criteria, for which we recommend the
scone
package (Cole and Risso
2018). We also note that the order of these steps may change
depending on the choice of method. ZINB-WaVE (Risso et al. 2018) performs dimensionality
reduction while accounting for technical variables and MNN (Haghverdi et al. 2018) corrects for batch
effects after dimensionality reduction.
Since we are working with simulated data, we do not need to worry about batch effects or other potential confounders. Hence, we will proceed with full quantile normalization, a well-established method which forces each cell to have the same distribution of expression values.
The fundamental assumption of slingshot
is that cells
which are transcriptionally similar will be close to each other in some
reduced-dimensional space. Since we use Euclidean distances in
constructing lineages and measuring pseudotime, it is important to have
a low-dimensional representation of the data.
There are many methods available for this task and we will
intentionally avoid the issue of determining which is the “best” method,
as this likely depends on the type of data, method of collection,
upstream computational choices, and many other factors. We will
demonstrate two methods of dimensionality reduction: principal
components analysis (PCA) and uniform manifold approximation and
projection (UMAP, via the uwot
package).
When performing PCA, we do not scale the genes by their variance because we do not believe that all genes are equally informative. We want to find signal in the robustly expressed, highly variable genes, not dampen this signal by forcing equal variance across genes. When plotting, we make sure to set the aspect ratio, so as not to distort the perceived distances.
pca <- prcomp(t(log1p(assays(sce)$norm)), scale. = FALSE)
rd1 <- pca$x[,1:2]
plot(rd1, col = rgb(0,0,0,.5), pch=16, asp = 1)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:S4Vectors':
##
## expand
rd2 <- uwot::umap(t(log1p(assays(sce)$norm)))
colnames(rd2) <- c('UMAP1', 'UMAP2')
plot(rd2, col = rgb(0,0,0,.5), pch=16, asp = 1)
We will add both dimensionality reductions to the
SingleCellExperiment
object, but continue our analysis
focusing on the PCA results.
The final input to slingshot
is a vector of cluster
labels for the cells. If this is not provided, slingshot
will treat the data as a single cluster and fit a standard principal
curve. However, we recommend clustering the cells even in datasets where
only a single lineage is expected, as it allows for the potential
discovery of novel branching events.
The clusters identified in this step will be used to determine the global structure of the underlying lineages (that is, their number, when they branch off from one another, and the approximate locations of those branching events). This is different than the typical goal of clustering single-cell data, which is to identify all biologically relevant cell types present in the dataset. For example, when determining global lineage structure, there is no need to distinguish between immature and mature neurons since both cell types will, presumably, fall along the same segment of a lineage.
For our analysis, we implement two clustering methods which similarly
assume that Euclidean distance in a low-dimensional space reflect
biological differences between cells: Gaussian mixture modeling and
k-means. The former is
implemented in the mclust
package (Scrucca et al. 2016) and features an automated
method for determining the number of clusters based on the Bayesian
information criterion (BIC).
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:mgcv':
##
## mvn
cl1 <- Mclust(rd1)$classification
colData(sce)$GMM <- cl1
library(RColorBrewer)
plot(rd1, col = brewer.pal(9,"Set1")[cl1], pch=16, asp = 1)
While k-means does not have a similar functionality, we have shown in (Street et al. 2017) that simultaneous principal curves are quite robust to the choice of k, so we select a k of 4 somewhat arbitrarily. If this is too low, we may miss a true branching event and if it is too high or there is an abundance of small clusters, we may begin to see spurious branching events.
At this point, we have everything we need to run
slingshot
on our simulated dataset. This is a two-step
process composed of identifying the global lineage structure with a
cluster-based minimum spanning tree (MST) and fitting simultaneous
principal curves to describe each lineage.
These two steps can be run separately with the
getLineages
and getCurves
functions, or
together with the wrapper function, slingshot
(recommended). We will use the wrapper function for the analysis of the
single-trajectory dataset, but demonstrate the usage of the individual
functions later, on the bifurcating dataset.
The slingshot
wrapper function performs both steps of
trajectory inference in a single call. The necessary inputs are a
reduced dimensional matrix of coordinates and a set of cluster labels.
These can be separate objects or, in the case of the single-trajectory
data, elements contained in a SingleCellExperiment
object.
To run slingshot
with the dimensionality reduction
produced by PCA and cluster labels identified by Gaussian mixutre
modeling, we would do the following:
As noted above, if no clustering results are provided, it is assumed
that all cells are part of the same cluster and a single curve will be
constructed. If no dimensionality reduction is provided,
slingshot
will use the first element of the list returned
by reducedDims
.
The output is a SingleCellExperiment
object with
slingshot
results incorporated. All of the results are
stored in a PseudotimeOrdering
object, which is added to
the colData
of the original object and can be accessed via
colData(sce)$slingshot
. Additionally, all inferred
pseudotime variables (one per lineage) are added to the
colData
, individually. To extract all
slingshot
results in a single object, we can use either the
as.PseudotimeOrdering
or as.SlingshotDataSet
functions, depending on the form in which we want it.
PseudotimeOrdering
objects are an extension of
SummarizedExperiment
objects, which are flexible containers
that will be useful for most purposes. SlingshotDataSet
objects are primarily used for visualization, as several plotting
methods are included with the package. Below, we visuzalize the inferred
lineage for the single-trajectory data with points colored by
pseudotime.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 8.631 21.121 21.414 34.363 43.185
library(grDevices)
colors <- colorRampPalette(brewer.pal(11,'Spectral')[-6])(100)
plotcol <- colors[cut(sce$slingPseudotime_1, breaks=100)]
plot(reducedDims(sce)$PCA, col = plotcol, pch=16, asp = 1)
lines(SlingshotDataSet(sce), lwd=2, col='black')
We can also see how the lineage structure was intially estimated by
the cluster-based minimum spanning tree by using the type
argument.
After running slingshot
, we are often interested in
finding genes that change their expression over the course of
development. We will demonstrate this type of analysis using the
tradeSeq
package (Van den Berge et
al. 2020).
For each gene, we will fit a general additive model (GAM) using a
negative binomial noise distribution to model the (potentially
nonlinear) relationshipships between gene expression and pseudotime. We
will then test for significant associations between expression and
pseudotime using the associationTest
.
library(tradeSeq)
# fit negative binomial GAM
sce <- fitGAM(sce)
# test for dynamic expression
ATres <- associationTest(sce)
We can then pick out the top genes based on p-values and visualize their expression over developmental time with a heatmap. Here we use the top 250 most dynamically expressed genes.
Here, we provide further details and highlight some additional
functionality of the slingshot
package. We will use the
included slingshotExample
dataset for illustrative
purposes. This dataset was designed to represent cells in a low
dimensional space and comes with a set of cluster labels generated by
k-means clustering. Rather
than constructing a full SingleCellExperiment
object, which
requires gene-level data, we will use the low-dimensional matrix of
coordinates directly and provide the cluster labels as an additional
argument.
The getLineages
function takes as input an
n
× p
matrix
and a vector of clustering results of length n
. It maps
connections between adjacent clusters using a minimum spanning tree
(MST) and identifies paths through these connections that represent
lineages. The output of this function is a
PseudotimeOrdering
containing the inputs as well as the
inferred MST (represented by an igraph
object) and lineages
(ordered vectors of cluster names).
This analysis can be performed in an entirely unsupervised manner or
in a semi-supervised manner by specifying known initial and terminal
point clusters. If we do not specify a starting point,
slingshot
selects one based on parsimony, maximizing the
number of clusters shared between lineages before a split. If there are
no splits or multiple clusters produce the same parsimony score, the
starting cluster is chosen arbitrarily.
In the case of our simulated data, slingshot
selects
Cluster 1 as the starting cluster. However, we generally recommend the
specification of an initial cluster based on prior knowledge (either
time of sample collection or established gene markers). This
specification will have no effect on how the MST is constructed, but it
will impact how branching curves are constructed.
## class: PseudotimeOrdering
## dim: 140 2
## metadata(3): lineages mst slingParams
## pathStats(2): pseudotime weights
## cellnames(140): cell-1 cell-2 ... cell-139 cell-140
## cellData names(2): reducedDim clusterLabels
## pathnames(2): Lineage1 Lineage2
## pathData names(0):
plot(rd, col = brewer.pal(9,"Set1")[cl], asp = 1, pch = 16)
lines(SlingshotDataSet(lin1), lwd = 3, col = 'black')
At this step, slingshot
also allows for the
specification of known endpoints. Clusters which are specified as
terminal cell states will be constrained to have only one connection
when the MST is constructed (ie., they must be leaf nodes). This
constraint could potentially impact how other parts of the tree are
drawn, as shown in the next example where we specify Cluster 3 as an
endpoint.
lin2 <- getLineages(rd, cl, start.clus= '1', end.clus = '3')
plot(rd, col = brewer.pal(9,"Set1")[cl], asp = 1, pch = 16)
lines(SlingshotDataSet(lin2), lwd = 3, col = 'black', show.constraints = TRUE)
This type of supervision can be useful for ensuring that results are consistent with prior biological knowledge. Specifically, it can prevent known terminal cell fates from being categorized as transitory states.
There are a few additional arguments we could have passed to
getLineages
for greater control:
dist.method
is a character specifying how the distances
between clusters should be computed. The default value is
"slingshot"
, which uses the distance between cluster
centers, normalized by their full, joint covariance matrix. In the
presence of small clusters (fewer cells than the dimensionality of the
data), this will automatically switch to using the diagonal joint
covariance matrix. Other options include simple
(Euclidean), scaled.full
, scaled.diag
, and
mnn
(mutual nearest neighbor-based distance).omega
is a granularity parameter, allowing the user to
set an upper limit on connection distances. It represents the distance
between every real cluster and an artificial .OMEGA
cluster, which is removed after fitting the MST. This can be useful for
identifying outlier clusters which are not a part of any lineage or for
separating distinct trajectories. This may be a numeric argument or a
logical value. A value of TRUE
indicates that a heuristic
of 1.5
(median edge length of unsupervised MST)
should be used,
implying that the maximum allowable distance will be 3 times that distance.After constructing the MST, getLineages
identifies paths
through the tree to designate as lineages. At this stage, a lineage will
consist of an ordered set of cluster names, starting with the root
cluster and ending with a leaf. The output of getLineages
is a PseudotimeOrdering
which holds all of the inputs as
well as a list of lineages and some additional information on how they
were constructed. This object is then used as the input to the
getCurves
function.
In order to model development along these various lineages, we will
construct smooth curves with the function getCurves
. Using
smooth curves based on all the cells eliminates the problem of cells
projecting onto vertices of piece-wise linear trajectories and makes
slingshot
more robust to noise in the clustering
results.
In order to construct smooth lineages, getCurves
follows
an iterative process similar to that of principal curves presented in
(Hastie and Stuetzle 1989). When there is
only a single lineage, the resulting curve is simply the principal curve
through the center of the data, with one adjustment: the initial curve
is constructed with the linear connections between cluster centers
rather than the first prinicpal component of the data. This adjustment
adds stability and typically hastens the algorithm’s convergence.
When there are two or more lineages, we add an additional step to the algorithm: averaging curves near shared cells. Both lineages should agree fairly well on cells that have yet to differentiate, so at each iteration we average the curves in the neighborhood of these cells. This increases the stability of the algorithm and produces smooth branching lineages.
## class: PseudotimeOrdering
## dim: 140 2
## metadata(4): lineages mst slingParams curves
## pathStats(2): pseudotime weights
## cellnames(140): cell-1 cell-2 ... cell-139 cell-140
## cellData names(2): reducedDim clusterLabels
## pathnames(2): Lineage1 Lineage2
## pathData names(0):
plot(rd, col = brewer.pal(9,"Set1")[cl], asp = 1, pch = 16)
lines(SlingshotDataSet(crv1), lwd = 3, col = 'black')
The output of getCurves
is an updated
PseudotimeOrdering
which now contains the simultaneous
principal curves and additional information on how they were fit. The
slingPseudotime
function extracts a cells-by-lineages
matrix of pseudotime values for each cell along each lineage, with
NA
values for cells that were not assigned to a particular
lineage. The slingCurveWeights
function extracts a similar
matrix of weights assigning each cell to one or more lineages.
The curves objects can be accessed using the slingCurves
function, which will return a list of principal_curve
objects. These objects consist of the following slots:
s
: the matrix of points that make up the curve. These
correspond to the orthogonal projections of the data points.ord
: indices which can be used to put the cells along a
curve in order based their projections.lambda
: arclengths along the curve from the beginning
to each cell’s projection. A matrix of these values is returned by the
slingPseudotime
function.dist_ind
: the squared distances between data points and
their projections onto the curve.dist
: the sum of the squared projection distances.w
: the vector of weights along this lineage. Cells that
were unambiguously assigned to this lineage will have a weight of
1
, while cells assigned to other lineages will have a
weight of 0
. It is possible for cells to have weights of
1
(or very close to 1) for multiple lineages, if they are
positioned before a branching event. A matrix of these values is
returned by the slingCurveWeights
function.For large datasets, we highgly recommend using the
approx_points
argument with slingshot
(or
getCurves
). This allows the user to specify the resolution
of the curves (ie. the number of unique points). While the MST
construction operates on clusters, the process of iteratively projecting
all points onto one or more curves may become computationally burdensome
as the size of the dataset grows. For this reason, we set the default
value for approx_points
to either 150 or the number of cells in the dataset,
whichever is smaller. This should greatly ease the computational cost of
exploratory analysis while having minimal impact on the resulting
trajectories.
For maximally “dense” curves, set approx_points = FALSE
,
and the curves will have as many points as there are cells in the
dataset. However, note that each projection step in the iterative
curve-fitting process will now have a computational complexity
proportional to n2
(where n is the number of
cells). In the presence of branching lineages, these dense curves will
also affect the complexity of curve averaging and shrinkage.
We recommend a value of 100-200, hence the default value of 150. Note that restricting the number of
unique points along the curve does not impose a similar limit
on the number of unique pseudotime values, as demonstrated below. Even
with an unrealistically low value of 5
for approx_points
, we still see a full color gradient from
the pseudotime values:
sce5 <- slingshot(sce, clusterLabels = 'GMM', reducedDim = 'PCA',
approx_points = 5)
colors <- colorRampPalette(brewer.pal(11,'Spectral')[-6])(100)
plotcol <- colors[cut(sce5$slingPseudotime_1, breaks=100)]
plot(reducedDims(sce5)$PCA, col = plotcol, pch=16, asp = 1)
lines(SlingshotDataSet(sce5), lwd=2, col='black')
In some cases, we are interested in identifying multiple, disjoint
trajectories. Slingshot handles this problem in the initial MST
construction by introducing an artificial cluster, called
omega
. This artificial cluster is separated from every real
cluster by a fixed length, meaning that the maximum distance between any
two real clusters is twice this length. Practically, this sets a limit
on the maximum edge length allowable in the MST. Setting
omega = TRUE
will implement a rule of thumb whereby the
maximum allowable edge length is equal to 3 times the median edge length of the MST
constructed without the artificial cluster (note: this is equivalent to
saying that the default value for omega_scale
is 1.5).
rd2 <- rbind(rd, cbind(rd[,2]-12, rd[,1]-6))
cl2 <- c(cl, cl + 10)
pto2 <- slingshot(rd2, cl2, omega = TRUE, start.clus = c(1,11))
plot(rd2, pch=16, asp = 1,
col = c(brewer.pal(9,"Set1"), brewer.pal(8,"Set2"))[cl2])
lines(SlingshotDataSet(pto2), type = 'l', lwd=2, col='black')
After fitting the MST, slingshot
proceeds to fit
simultaneous principal curves as usual, with each trajectory being
handled separately.
Sometimes, we may want to use only a subset of the cells to determine
a trajectory, or we may get new data that we want to project onto an
existing trajectory. In either case, we will need a way to determine the
positions of new cells along a previously constructed trajectory. For
this, we can use the predict
function (since this function
is not native to slingshot
, see
?`predict,PseudotimeOrdering-method`
for
documentation).
# our original PseudotimeOrdering
pto <- sce$slingshot
# simulate new cells in PCA space
newPCA <- reducedDim(sce, 'PCA') + rnorm(2*ncol(sce), sd = 2)
# project onto trajectory
newPTO <- slingshot::predict(pto, newPCA)
This will yield a new, hybrid object with the trajectories (curves)
from the original data, but the pseudotime values and weights for the
new cells. For reference, the original cells are shown in grey below,
but they are not included in the output from predict
.
newplotcol <- colors[cut(slingPseudotime(newPTO)[,1], breaks=100)]
plot(reducedDims(sce)$PCA, col = 'grey', bg = 'grey', pch=21, asp = 1,
xlim = range(newPCA[,1]), ylim = range(newPCA[,2]))
lines(SlingshotDataSet(sce), lwd=2, col = 'black')
points(slingReducedDim(newPTO), col = newplotcol, pch = 16)
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] mclust_6.1.1 uwot_0.2.2
## [3] Matrix_1.7-1 mgcv_1.9-1
## [5] nlme_3.1-166 RColorBrewer_1.1-3
## [7] slingshot_2.15.0 TrajectoryUtils_1.15.0
## [9] princurve_2.1.6 SingleCellExperiment_1.29.1
## [11] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [13] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [15] IRanges_2.41.1 S4Vectors_0.45.2
## [17] BiocGenerics_0.53.3 generics_0.1.3
## [19] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 SparseArray_1.7.2
## [3] lattice_0.22-6 digest_0.6.37
## [5] magrittr_2.0.3 sparseMatrixStats_1.19.0
## [7] evaluate_1.0.1 grid_4.4.2
## [9] fastmap_1.2.0 jsonlite_1.8.9
## [11] BiocManager_1.30.25 httr_1.4.7
## [13] UCSC.utils_1.3.0 codetools_0.2-20
## [15] jquerylib_0.1.4 abind_1.4-8
## [17] cli_3.6.3 rlang_1.1.4
## [19] crayon_1.5.3 XVector_0.47.0
## [21] splines_4.4.2 cachem_1.1.0
## [23] DelayedArray_0.33.2 yaml_2.3.10
## [25] FNN_1.1.4.1 S4Arrays_1.7.1
## [27] tools_4.4.2 GenomeInfoDbData_1.2.13
## [29] buildtools_1.0.0 R6_2.5.1
## [31] lifecycle_1.0.4 zlibbioc_1.52.0
## [33] irlba_2.3.5.1 pkgconfig_2.0.3
## [35] bslib_0.8.0 glue_1.8.0
## [37] Rcpp_1.0.13-1 xfun_0.49
## [39] sys_3.4.3 knitr_1.49
## [41] htmltools_0.5.8.1 igraph_2.1.1
## [43] DelayedMatrixStats_1.29.0 rmarkdown_2.29
## [45] maketools_1.3.1 compiler_4.4.2