PhenoPath models gene expression expression y in terms of a latent pathway score (pseudotime) z. Uniquely, the evolution of genes along the trajectory isn’t common to each gene but can be perturbed by an additional sample-specific covariate β. For example, this could be the mutational status of each sample or a drug that each sample was exposed to.
This software infers both the latent pathway scores zn and interaction coefficients βpg for samples n = 1, …, N, covariates p = 1, …, P and genes G = 1, …, G.
Inference is performed using co-ordinate ascent mean field variational inference. This attempts to find a set of approximating distributions q(θ) = ∏iqi(θi) for each variable i by minimising the KL divergence between the KL divergence between the variational distribution and the true posterior. For a good overview of variational inference see Blei, Kucukelbir, and McAuliffe (2016).
We can simulate data according to the PhenoPath model via a call to
simulate_phenopath()
:
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
## tibble, or `as.data.frame()` to convert to a data frame.
## ℹ The deprecated feature was likely used in the phenopath package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## ℹ The deprecated feature was likely used in the tibble package.
## Please report the issue at <https://github.com/tidyverse/tibble/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This returns a list with four entries:
## List of 4
## $ parameters: tibble [40 × 4] (S3: tbl_df/tbl/data.frame)
## ..$ alpha : num [1:40] 1 1 1 -1 1 1 1 -1 1 1 ...
## ..$ lambda: num [1:40] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ beta : num [1:40] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ regime: chr [1:40] "de" "de" "de" "de" ...
## $ y : num [1:100, 1:40] 1.001 0.999 1.001 1.002 0.999 ...
## $ x : num [1:100] 1 1 1 1 1 1 1 1 1 1 ...
## $ z : num [1:100] -0.713 -0.3512 1.6085 -0.0218 0.0426 ...
## NULL
parameters
is a data frame with the simulated
parameters, with a column for each of the parameters α, β and λ, and a row for each gene. There is
an additional column specifying from which regime that gene has been
simulated (see ?simulate_phenopath
for details).y
is the N × G matrix of gene
expressionx
is the N-length vector of covariatesz
is the true latent pseudotimesBy default this simulates the model for N = 100 cells and G = 40 genes.
For 8 representative genes we can visualise what the expression looks like over pseudotime:
genes_to_extract <- c(1,3,11,13,21,23,31,33)
expression_df <- as.data.frame(sim$y[,genes_to_extract])
names(expression_df) <- paste0("gene_", genes_to_extract)
df_gex <- as_tibble(expression_df) %>%
mutate(x = factor(sim[['x']]), z = sim[['z']]) %>%
gather(gene, expression, -x, -z)
ggplot(df_gex, aes(x = z, y = expression, color = x)) +
geom_point() +
facet_wrap(~ gene, nrow = 2) +
scale_color_brewer(palette = "Set1")
We see for the first two genes there is differential expression only, genes 3 and 4 show a pseudotime dependence, genes 5 and 6 show pseudotime-covariate interactions (but marginally no differential expression), while genes 7 and 8 show differential expression, pseudotime dependence and pseudotime-covariate interactions.
We can further plot this in PCA space, coloured by both covariate and pseudotime to get an idea of how these look in the reduced dimension:
PhenoPath fits models with a call to the phenopath
function, which requires at least an expression matrix y
and a covariate vector x
. The expression data should
represent something comparable to logged counts, such as log2(TPM + 1).
Note that buy default PhenoPath centre-scales the input expression.
For use with SummarizedExperiment
s see the section on input formats. For this example we
choose an ELBO tolerance of 1e-6
and ELBO calculations
thinned by 40
. A discussion of how to control variational
inference can be found below.
## Iteration ELBO Change (%)
## [ 40 ] -371.028993280215 Inf
## [ 80 ] -370.889836821025 0.000937990512105118
## [ 120 ] -370.810424921065 0.000535394197569484
## [ 160 ] -370.764023874668 0.000312874519971683
## [ 200 ] -370.736371282354 0.000186470727285231
## [ 240 ] -370.719625484182 0.000112927648152733
## [ 280 ] -370.709354564873 6.92653097539588e-05
## [ 320 ] -370.702991977764 4.29089274084775e-05
## [ 360 ] -370.699020180316 2.67858642149206e-05
## [ 400 ] -370.696526275701 1.68190449510297e-05
## [ 440 ] -370.694953393122 1.06076611305052e-05
## [ 480 ] -370.693958071552 6.7125559262458e-06
## [ 520 ] -370.693326646958 4.25840275431553e-06
## [ 560 ] -370.692925319371 2.70660403704523e-06
## [ 600 ] -370.692669877555 1.72273312430862e-06
## [ 640 ] -370.692507117574 1.09767514679963e-06
## [ 680 ] -370.692403328497 6.99967650996422e-07
## PhenoPath fit with 100 cells and 40 genes
The phenopath
function will print progress on
iterations, ELBO, and % change in ELBO that may be turned off by setting
verbose = FALSE
in the call.
Once the model has been fit it is important to check convergence with
a call to plot_elbo(fit)
to ensure the ELBO is
approximately flat:
The posterior mean estimates of the pseudotimes z sit in fit$m_z
that
can be extracted using the trajectory
function. We can
visualise these compared to both the true pseudotimes and the first
principal component of the data:
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We see that this has high correlation with the true pseudotimes:
## [1] -0.9976558
Next, we’re interested in the interactions between the latent space and the covariates. There are three functions to help us here:
interaction_effects
retrieves the posterior interaction
effect sizesinteraction_sds
retrieves the posterior interaction
standard deviationssignificant_interactions
applies a Bayesian significant
test to the interaction parametersNote that if P = 1 (ie there is only one covariate) each of these will return a vector, while if P > 1 then a matrix is returned.
Alternatively, one can call the interactions
function
that returns a data frame with the following entries:
feature
The feature (usually gene)covariate
The covariate, specified from the order
originally supplied to the call to phenopath
interaction_effect_size
The effect size of the
interaction (β from the
statistical model)significant
Boolean for whether the interaction effect
is significantly different from 0chi
The precision of the ARD prior on βpathway_loading
The pathway loading λ, showing the overall effect for
each gene marginalised over the covariateIn our simulated data above, the first 20 genes were simulated with no interaction effect while the latter 20 were simulated with interaction effects. We can pull out the interaction effect sizes, standard deviations, and significance test results into a data frame and plot:
gene_names <- paste0("gene", seq_len(ncol(fit$m_beta)))
df_beta <- data_frame(beta = interaction_effects(fit),
beta_sd = interaction_sds(fit),
is_sig = significant_interactions(fit),
gene = gene_names)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
df_beta$gene <- fct_relevel(df_beta$gene, gene_names)
ggplot(df_beta, aes(x = gene, y = beta, color = is_sig)) +
geom_point() +
geom_errorbar(aes(ymin = beta - 2 * beta_sd, ymax = beta + 2 * beta_sd)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
axis.title.x = element_blank()) +
ylab(expression(beta)) +
scale_color_brewer(palette = "Set2", name = "Significant")
A typical analysis might follow by graphing the largest effect size; we can do this as follows:
which_largest <- which.max(df_beta$beta)
df_large <- data_frame(
y = sim[['y']][, which_largest],
x = factor(sim[['x']]),
z = sim[['z']]
)
ggplot(df_large, aes(x = z, y = y, color = x)) +
geom_point() +
scale_color_brewer(palette = "Set1") +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
SummarizedExperiment
as inputAlternatively you might have expression values in an
SummarizedExperiment
. For single-cell data it is highly
recommended to use the SummarizedExperiment
in which case data is stored in a class derived from
SummarizedExperiment
.
We’ll first construct an example SummarizedExperiment
using our previous simulation data:
suppressPackageStartupMessages(library(SummarizedExperiment))
exprs_mat <- t(sim$y)
pdata <- data.frame(x = sim$x)
sce <- SummarizedExperiment(assays = list(exprs = exprs_mat),
colData = pdata)
sce
## class: SummarizedExperiment
## dim: 40 100
## metadata(0):
## assays(1): exprs
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(1): x
Note that PhenoPath will use by default is in the exprs
assay of a SummarizedExperiment
(ie
assay(sce, "exprs")
) as gene expression values, which can
be changed using the sce_assay
string in the column to
phenopath
.
We can then pass the x covariates to PhenoPath in three ways:
colData(sce)
to
usecolData(sce)
For our example, these three look like
Note that if the column of the SCESet is a factor it is coerced into a one-hot encoding. The intercept term is then removed as this is taken care of by the λ coefficient automatically, and the columns centre-scaled.
The posterior distribution is naturally multi-modal and the use of
variational inference means we essentially dive straight into the
nearest local maximum. Therefore, correct initialisation of the latent
space is important. This is controlled through the z_init
argument.
PhenoPath allows for three different initialisations:
For our example these three look like
There are several parameters that tune the coordinate ascent variational inference (CAVI):
maxiter
maximum number of iterations to run CAVI
forelbo_tol
the percentage change in the ELBO
below which the model is considered convergedthin
Computing the ELBO is expensive, so only compute
the ELBO every thin
iterationsFor example:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [3] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [5] IRanges_2.41.2 S4Vectors_0.45.2
## [7] BiocGenerics_0.53.3 generics_0.1.3
## [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [11] phenopath_1.31.0 forcats_1.0.0
## [13] tidyr_1.3.1 ggplot2_3.5.1
## [15] dplyr_1.1.4 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.49 bslib_0.8.0
## [4] lattice_0.22-6 vctrs_0.6.5 tools_4.4.2
## [7] tibble_3.2.1 pkgconfig_2.0.3 Matrix_1.7-1
## [10] RColorBrewer_1.1-3 lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [13] compiler_4.4.2 farver_2.1.2 munsell_0.5.1
## [16] codetools_0.2-20 htmltools_0.5.8.1 sys_3.4.3
## [19] buildtools_1.0.0 sass_0.4.9 yaml_2.3.10
## [22] pillar_1.10.0 crayon_1.5.3 jquerylib_0.1.4
## [25] cachem_1.1.0 DelayedArray_0.33.3 abind_1.4-8
## [28] nlme_3.1-166 tidyselect_1.2.1 digest_0.6.37
## [31] purrr_1.0.2 maketools_1.3.1 labeling_0.4.3
## [34] splines_4.4.2 fastmap_1.2.0 grid_4.4.2
## [37] colorspace_2.1-1 cli_3.6.3 SparseArray_1.7.2
## [40] magrittr_2.0.3 S4Arrays_1.7.1 withr_3.0.2
## [43] scales_1.3.0 UCSC.utils_1.3.0 rmarkdown_2.29
## [46] XVector_0.47.1 httr_1.4.7 evaluate_1.0.1
## [49] knitr_1.49 mgcv_1.9-1 rlang_1.1.4
## [52] Rcpp_1.0.13-1 glue_1.8.0 BiocManager_1.30.25
## [55] jsonlite_1.8.9 R6_2.5.1