tradeSeq
is an
R
package that allows analysis of gene expression along
trajectories. While it has been developed and applied to single-cell
RNA-sequencing (scRNA-seq) data, its applicability extends beyond that,
and also allows the analysis of, e.g., single-cell ATAC-seq data along
trajectories or bulk RNA-seq time series datasets. For every gene in the
dataset, tradeSeq
fits a generalized additive model (GAM)
by building on the mgcv
R package. It then allows
statistical inference on the GAM by assessing contrasts of the
parameters of the fitted GAM model, aiding in interpreting complex
datasets. All details about the tradeSeq
model and
statistical tests are described in our preprint (Van den Berge et al. 2020).
In this vignette, we analyze a subset of the data from (Paul et al. 2015). A
SingleCellExperiment
object of the data has been provided
with the tradeSeq
package and can be retrieved as shown below. The data and UMAP reduced
dimensions were derived from following the Monocle
3 vignette.
In this vignette, we use tradeSeq
downstream of
slinghsot
(Street et al. 2018)
but it can be used downstream of any trajectory. In particular, if you
want to use tradeSeq
downstream of
monocle3
(Cao et al. 2019),
please refer to our Monocle
vignette.
To install the package, simply run
library(tradeSeq)
library(RColorBrewer)
library(SingleCellExperiment)
library(slingshot)
# For reproducibility
RNGversion("3.5.0")
palette(brewer.pal(8, "Dark2"))
data(countMatrix, package = "tradeSeq")
counts <- as.matrix(countMatrix)
rm(countMatrix)
data(crv, package = "tradeSeq")
data(celltype, package = "tradeSeq")
Here we fit the tradeSeq
negative binomial generalized
additive model (NB-GAM). Please see the fitGAM
vignette for an extensive description on how to fit the models, tune
its options and modify its output.
We first need to decide on the number of knots. This is done using the evaluateK function. This takes a little time to run so it is not run here.
For more explanation on the output from evaluateK,
we refer users to the fitGAM
vignette. Here, we pick nknots = 6
.
We then fit the models by running the fitGAM function. By default, the gene-wise NB-GAM estimates one smoother for every lineage using the negative binomial distribution. Please refer to the fitGAM vignette Additional to add additional covariates to the model, speed up computation or allow for custom normalization, amongst others.
set.seed(7)
pseudotime <- slingPseudotime(crv, na = FALSE)
cellWeights <- slingCurveWeights(crv)
sce <- fitGAM(counts = counts, pseudotime = pseudotime, cellWeights = cellWeights,
nknots = 6, verbose = FALSE)
The model may be hard to fit for some genes, and hence the fitting
procedure may not converge for all of the genes in a dataset, especially
in datasets with complex patterns and/or many lineages. You can check
the convergence of each gene as shown below, where a TRUE
value corresponds to a converged model fit, and a FALSE
value corresponds to a model that hasn’t been able to converge
fully.
##
## TRUE
## 240
A first exploration of the data analysis may consist of checking
whether gene expression is associated with a particular lineage. The
statistical test performed here, implemented in the
associationTest
function, is testing the null hypothesis
that all smoother coefficients are equal to each other. This can be
interpreted as testing whether the average gene expression is
significantly changing along pseudotime.
## waldStat df pvalue meanLogFC
## Acin1 115.6657 10 0.000000e+00 0.3155838
## Actb NA NA NA 0.5610723
## Ak2 84.7587 10 5.828671e-14 0.7030388
## Alad NA NA NA 1.0476606
## Alas1 NA NA NA 1.1210974
## Aldoa 203.7472 10 0.000000e+00 0.4340672
In order to discover marker genes of the progenitor or differentiated
cell population, researchers may be interested in assessing differential
expression between the progenitor cell population (i.e., the starting
point of a lineage) with the differentiated cell type population (i.e.,
the end point of a lineage). The function startVsEndTest
uses a Wald test to assess the null hypothesis that the average
expression at the starting point of the smoother (progenitor population)
is equal to the average expression at the end point of the smoother
(differentiated population). The test basically involves a comparison
between two smoother coefficients for every lineage. The function
startVsEndTest
performs a global test across all lineages
by default (i.e. it compares the start and end positions for all
lineages simultaneously), but you can also assess all lineages
separately by setting lineages=TRUE
. Below, we adopt an
omnibus test across the two lineages.
We can visualize the estimated smoothers for the third most significant gene.
oStart <- order(startRes$waldStat, decreasing = TRUE)
sigGeneStart <- names(sce)[oStart[3]]
plotSmoothers(sce, counts, gene = sigGeneStart)
Alternatively, we can color the cells in UMAP space with that gene’s expression.
The startVsEndTest
compares two points on a lineage, and
by default it is comparing the start point with the end point. However,
this is a specific form of a more general capability of the
startVsEndTest
to compare any two points on any lineage. If
the interest lies in comparing any two custom pseudotime values, one can
specify this using the pseudotimeValues
arguments in
startVsEndTest
. For example, below we’d like to compare the
expression for each gene at pseudotime values of 0.8 and 0.1.
tradeSeq
can discover marker genes for the
differentiated cell types by comparing the average expression between
end points of the lineage-specific smoothers. This is implemented in the
diffEndTest
function. By default, diffEndTest
performs a global test, testing the null hypothesis that the average
expression at the endpoints is equal for all lineages using a
multivariate Wald test. If more than two lineages are present, one can
assess all pairwise comparisons using the pairwise=TRUE
argument.
We can plot the most significant gene using the
plotSmoothers
function.
o <- order(endRes$waldStat, decreasing = TRUE)
sigGene <- names(sce)[o[1]]
plotSmoothers(sce, counts, sigGene)
Alternatively, we can color the cells in UMAP space with that gene’s expression.
Asides from testing at the level of the differentiated cell type,
researchers may be interested in assessing the expression pattern of a
gene over pseudotime. The function patternTest
implements a
statistical method that checks whether the smoothed gene expression is
equal along pseudotime between two or multiple lineages. In practice, we
use 2 * k points with k being the number of knots as
defined in fitGAM
, equally distributed along pseudotime,
that are compared between two (or multiple) lineages, and this number
can be changed using the nPoints
argument.
patternRes <- patternTest(sce)
oPat <- order(patternRes$waldStat, decreasing = TRUE)
head(rownames(patternRes)[oPat])
## [1] "Mpo" "Prtn3" "Car2" "Ctsg" "Elane" "Calr"
The patternTest
has an argument eigenThresh
that corresponds to an eigenvalue threshold for deciding on the rank of
the variance-covariance matrix of the contrasts defined by the
patternTest
. Lower values are more lenient to adding more
information but also decrease computational stability. While the default
used to correspond to 1e-8
, we have recently increased this
to 1e-2
for increased computational stability.
patternTest
with
diffEndTest
resultsWe find genes at the top that are also ranked as DE for the
differentiated cell type. What is especially interesting are genes that
have different expression patterns but no different expression at the
differentiated cell type level. We therefore sort the genes according to
the sum of square of their rank in increasing Wald statistics for the
patternTest
and their rank in decreasing Wald statistics
for the diffEndTest
.
library(ggplot2)
patternRes$Gene <- rownames(patternRes)
patternRes$pattern <- patternRes$waldStat
patternRes <- patternRes[, c("Gene", "pattern")]
endRes$Gene <- rownames(endRes)
endRes$end <- endRes$waldStat
endRes <- endRes[, c("Gene", "end")]
compare <- merge(patternRes, endRes, by = "Gene", all = FALSE)
compare$transientScore <-
rank(-compare$end, ties.method = "min")^2 + rank(compare$pattern, ties.method = "random")^2
ggplot(compare, aes(x = log(pattern), y = log(end))) +
geom_point(aes(col = transientScore)) +
labs(x = "patternTest Wald Statistic (log scale)",
y = "diffEndTest Wald Statistic (log scale)") +
scale_color_continuous(low = "yellow", high = "red") +
theme_classic()
Or, we can visualize the expression in UMAP space of the top gene.
topTransient <- compare[which.max(compare$transientScore), "Gene"]
plotSmoothers(sce, counts, gene = topTransient)
Interestingly, we recover the Irf8 gene in the top 5 genes according to that ranking.
## [1] "Mt1" "Irf8" "Nedd4" "Hint1" "Eif4g1"
We can also plot the Irf8 gene.
Another question of interest is to find a list of genes that are
differentially expressed between lineages at a particular region,
e.g. around the separation of two or multiple lineages. The function
earlyDETest
implements a statistical method to test the
null hypothesis of whether the average gene expression smoothers are
equal between lineages in a region defined by two user-specified knots.
Again, the knots can be visualized with the plotGeneCount
function. By selecting the region covering the first two knot points to
test for differential patterns between the lineages, we check which
genes are behaving differently around the bifurcation point.
plotGeneCount(curve = crv, counts = counts,
clusters = apply(slingClusterLabels(crv), 1, which.max),
models = sce)
earlyDERes <- earlyDETest(sce, knots = c(1, 2))
oEarly <- order(earlyDERes$waldStat, decreasing = TRUE)
head(rownames(earlyDERes)[oEarly])
## [1] "Car1" "Car2" "Klf1" "Vim" "Mt1" "Ermap"
In large datasets with thousands of single cells, small fold changes
can become statistically significant due to the very high sample size.
However, these differences may not always be biologically meaningful. To
tackle this problem, all DE tests in tradeSeq
have an
l2fc
argument which specifies the absolute value of the
log2 fold-change cut-off to test against. For example, setting
l2fc=log2(2)
will test which genes have a fold change that
is significantly higher than 2 or significantly lower than 1/2 within or
between the lineages of interest.
While tradeSeq
provides readily implemented
functionality to cluster genes using the clusterExperiment
package, users may want to cluster genes using other clustering methods.
Fitted values from the tradeSeq
models, which can
subsequently be used for clustering, can be extracted using the
predictSmooth
or predictCells
functions.
predictCells
predicts the estimated expression for each
cell (i.e. it is a fitted value for that cell) on the count scale.
predictSmooth
, instead, returns values of the estimated
smoother on a grid of pseudotimes. Either of these may be used as input
to a clustering method; roughly, predictSmooth
may be
considered as a cleaner version of predictCells
. By default
predictSmooth
returns a data.frame
specifying
the estimated smoother for each gene. A matrix containing these values,
where each row is a gene, and each column is a point along the lineage
on the uniform grid, can be obtained by setting tidy=FALSE
in predictSmooth
.
tradeSeq provides the functionality to cluster genes according to
their expression pattern along the lineages with the
clusterExpressionPatterns
function. A number of equally
spaced points for every lineage are selected to perform the clustering,
and the number of points can be selected with the nPoints
argument. The genes
argument specifies which genes you want
to cluster (e.g., all genes with differential expression patterns).
Here, we use 20 points along each lineage to cluster the first 40 genes
in the dataset. The clustering itself occurs by the
clusterExperiment
package (Risso et
al. 2018), hence the user may select any clustering algorithm
that’s built into that package, or custom clustering algorithms
implemented by the user. For a list of built-in clustering algorithms
within clusterExperiment
, run
clusterExperiment::listBuiltInFunctions()
on the command
line.
library(clusterExperiment)
nPointsClus <- 20
clusPat <- clusterExpressionPatterns(sce, nPoints = nPointsClus,
genes = rownames(counts)[1:100])
## 36 parameter combinations, 36 use sequential method, 36 use subsampling method
## Running Clustering on Parameter Combinations...
## done.
The first 4 clusters can be visualized using the normalized expression upon which the clustering is based. Please note that the code below would only works for a trajectory with two lineages. Modify the code appropriately if using with a dataset with 3 lineages or more.
cUniq <- unique(clusterLabels)
cUniq <- cUniq[!cUniq == -1] # remove unclustered genes
for (xx in cUniq[1:4]) {
cId <- which(clusterLabels == xx)
p <- ggplot(data = data.frame(x = 1:nPointsClus,
y = rep(range(clusPat$yhatScaled[cId, ]),
nPointsClus / 2)),
aes(x = x, y = y)) +
geom_point(alpha = 0) +
labs(title = paste0("Cluster ", xx), x = "Pseudotime", y = "Normalized expression") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
for (ii in 1:length(cId)) {
geneId <- rownames(clusPat$yhatScaled)[cId[ii]]
p <- p +
geom_line(data = data.frame(x = rep(1:nPointsClus, 2),
y = clusPat$yhatScaled[geneId, ],
lineage = rep(0:1, each = nPointsClus)),
aes(col = as.character(lineage), group = lineage), lwd = 1.5)
}
p <- p + guides(color = FALSE) +
scale_color_manual(values = c("orange", "darkseagreen3"),
breaks = c("0", "1"))
print(p)
}
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
A number of tests have been implemented in tradeSeq, but researchers may be interested in other hypotheses that current implementations may not be able to address. We therefore welcome contributions on GitHub on novel tests based on the tradeSeq model. Similarly, you may also request novel tests to be implemented in tradeSeq by the developers, preferably by adding an issue on the GitHub repository. If we feel that the suggested test is widely applicable, we will implement it in tradeSeq.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## Random number generation:
## RNG: Mersenne-Twister
## Normal: Inversion
## Sample: Rounding
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] clusterExperiment_2.27.0 ggplot2_3.5.1
## [3] mgcv_1.9-1 nlme_3.1-166
## [5] slingshot_2.14.0 TrajectoryUtils_1.14.0
## [7] princurve_2.1.6 SingleCellExperiment_1.28.0
## [9] SummarizedExperiment_1.36.0 Biobase_2.67.0
## [11] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [13] IRanges_2.41.0 S4Vectors_0.44.0
## [15] BiocGenerics_0.53.1 generics_0.1.3
## [17] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [19] RColorBrewer_1.1-3 tradeSeq_1.21.0
## [21] knitr_1.48 rmarkdown_2.28
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9 magrittr_2.0.3
## [4] farver_2.1.2 zlibbioc_1.52.0 vctrs_0.6.5
## [7] locfdr_1.1-8 memoise_2.0.1 htmltools_0.5.8.1
## [10] S4Arrays_1.6.0 progress_1.2.3 Rhdf5lib_1.28.0
## [13] rhdf5_2.50.0 SparseArray_1.6.0 sass_0.4.9
## [16] bslib_0.8.0 plyr_1.8.9 cachem_1.1.0
## [19] uuid_1.2-1 buildtools_1.0.0 igraph_2.1.1
## [22] lifecycle_1.0.4 iterators_1.0.14 pkgconfig_2.0.3
## [25] rsvd_1.0.5 Matrix_1.7-1 R6_2.5.1
## [28] fastmap_1.2.0 GenomeInfoDbData_1.2.13 digest_0.6.37
## [31] colorspace_2.1-1 AnnotationDbi_1.69.0 phylobase_0.8.12
## [34] irlba_2.3.5.1 RSQLite_2.3.7 beachmat_2.23.0
## [37] labeling_0.4.3 fansi_1.0.6 httr_1.4.7
## [40] abind_1.4-8 compiler_4.4.1 rngtools_1.5.2
## [43] bit64_4.5.2 withr_3.0.2 doParallel_1.0.17
## [46] BiocParallel_1.41.0 DBI_1.2.3 viridis_0.6.5
## [49] highr_0.11 HDF5Array_1.35.1 MASS_7.3-61
## [52] DelayedArray_0.33.1 zinbwave_1.28.0 tools_4.4.1
## [55] rncl_0.8.7 ape_5.8 glue_1.8.0
## [58] rhdf5filters_1.18.0 grid_4.4.1 gridBase_0.4-7
## [61] cluster_2.1.6 reshape2_1.4.4 ade4_1.7-22
## [64] gtable_0.3.6 tidyr_1.3.1 hms_1.1.3
## [67] BiocSingular_1.23.0 ScaledMatrix_1.14.0 xml2_1.3.6
## [70] utf8_1.2.4 XVector_0.46.0 foreach_1.5.2
## [73] pillar_1.9.0 stringr_1.5.1 limma_3.63.0
## [76] genefilter_1.89.0 softImpute_1.4-1 splines_4.4.1
## [79] dplyr_1.1.4 lattice_0.22-6 survival_3.7-0
## [82] bit_4.5.0 annotate_1.85.0 tidyselect_1.2.1
## [85] registry_0.5-1 locfit_1.5-9.10 Biostrings_2.75.0
## [88] maketools_1.3.1 pbapply_1.7-2 gridExtra_2.3
## [91] edgeR_4.4.0 xfun_0.48 statmod_1.5.0
## [94] NMF_0.28 stringi_1.8.4 UCSC.utils_1.2.0
## [97] yaml_2.3.10 evaluate_1.0.1 codetools_0.2-20
## [100] kernlab_0.9-33 tibble_3.2.1 BiocManager_1.30.25
## [103] cli_3.6.3 xtable_1.8-4 munsell_0.5.1
## [106] jquerylib_0.1.4 Rcpp_1.0.13 png_0.1-8
## [109] XML_3.99-0.17 parallel_4.4.1 blob_1.2.4
## [112] RNeXML_2.4.11 prettyunits_1.2.0 viridisLite_0.4.2
## [115] scales_1.3.0 purrr_1.0.2 crayon_1.5.3
## [118] rlang_1.1.4 KEGGREST_1.47.0