Introduction to PhenoPath

Overview of PhenoPath

The PhenoPath model

PhenoPath models gene expression expression y in terms of a latent pathway score (pseudotime) z. Uniquely, the evolution of genes along the trajectory isn’t common to each gene but can be perturbed by an additional sample-specific covariate β. For example, this could be the mutational status of each sample or a drug that each sample was exposed to.

This software infers both the latent pathway scores zn and interaction coefficients βpg for samples n = 1, …, N, covariates p = 1, …, P and genes G = 1, …, G.

Mean-field variational inference

Inference is performed using co-ordinate ascent mean field variational inference. This attempts to find a set of approximating distributions q(θ) = ∏iqi(θi) for each variable i by minimising the KL divergence between the KL divergence between the variational distribution and the true posterior. For a good overview of variational inference see Blei, Kucukelbir, and McAuliffe (2016).

Example on simulated data

Simulating data

We can simulate data according to the PhenoPath model via a call to simulate_phenopath():

set.seed(123L)
sim <- simulate_phenopath()
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## ℹ The deprecated feature was likely used in the phenopath package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## ℹ The deprecated feature was likely used in the tibble package.
##   Please report the issue at <https://github.com/tidyverse/tibble/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

This returns a list with four entries:

print(str(sim))
## List of 4
##  $ parameters: tibble [40 × 4] (S3: tbl_df/tbl/data.frame)
##   ..$ alpha : num [1:40] 1 1 1 -1 1 1 1 -1 1 1 ...
##   ..$ lambda: num [1:40] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ beta  : num [1:40] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ regime: chr [1:40] "de" "de" "de" "de" ...
##  $ y         : num [1:100, 1:40] 1.001 0.999 1.001 1.002 0.999 ...
##  $ x         : num [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##  $ z         : num [1:100] -0.713 -0.3512 1.6085 -0.0218 0.0426 ...
## NULL
  • parameters is a data frame with the simulated parameters, with a column for each of the parameters α, β and λ, and a row for each gene. There is an additional column specifying from which regime that gene has been simulated (see ?simulate_phenopath for details).
  • y is the N × G matrix of gene expression
  • x is the N-length vector of covariates
  • z is the true latent pseudotimes

By default this simulates the model for N = 100 cells and G = 40 genes.

For 8 representative genes we can visualise what the expression looks like over pseudotime:

genes_to_extract <- c(1,3,11,13,21,23,31,33)
expression_df <- as.data.frame(sim$y[,genes_to_extract])
names(expression_df) <- paste0("gene_", genes_to_extract)

df_gex <- as_tibble(expression_df) %>% 
  mutate(x = factor(sim[['x']]), z = sim[['z']]) %>% 
  gather(gene, expression, -x, -z)

ggplot(df_gex, aes(x = z, y = expression, color = x)) +
  geom_point() +
  facet_wrap(~ gene, nrow = 2) +
  scale_color_brewer(palette = "Set1")

We see for the first two genes there is differential expression only, genes 3 and 4 show a pseudotime dependence, genes 5 and 6 show pseudotime-covariate interactions (but marginally no differential expression), while genes 7 and 8 show differential expression, pseudotime dependence and pseudotime-covariate interactions.

We can further plot this in PCA space, coloured by both covariate and pseudotime to get an idea of how these look in the reduced dimension:

pca_df <- as_tibble(as.data.frame(prcomp(sim$y)$x[,1:2])) %>% 
  mutate(x = factor(sim[['x']]), z = sim[['z']])

ggplot(pca_df, aes(x = PC1, y = PC2, color = x)) +
  geom_point() + scale_colour_brewer(palette = "Set1")

ggplot(pca_df, aes(x = PC1, y = PC2, color = z)) +
  geom_point()

Fit PhenoPath model

PhenoPath fits models with a call to the phenopath function, which requires at least an expression matrix y and a covariate vector x. The expression data should represent something comparable to logged counts, such as log2(TPM + 1). Note that buy default PhenoPath centre-scales the input expression.

For use with SummarizedExperiments see the section on input formats. For this example we choose an ELBO tolerance of 1e-6 and ELBO calculations thinned by 40. A discussion of how to control variational inference can be found below.

fit <- phenopath(sim$y, sim$x, elbo_tol = 1e-6, thin = 40)
## Iteration    ELBO    Change (%) 
## [ 40 ]    -371.028993280215   Inf 
## [ 80 ]    -370.889836821025   0.000937990512105118 
## [ 120 ]   -370.810424921065   0.000535394197569484 
## [ 160 ]   -370.764023874668   0.000312874519971683 
## [ 200 ]   -370.736371282354   0.000186470727285231 
## [ 240 ]   -370.719625484182   0.000112927648152733 
## [ 280 ]   -370.709354564873   6.92653097539588e-05 
## [ 320 ]   -370.702991977764   4.29089274084775e-05 
## [ 360 ]   -370.699020180316   2.67858642149206e-05 
## [ 400 ]   -370.696526275701   1.68190449510297e-05 
## [ 440 ]   -370.694953393122   1.06076611305052e-05 
## [ 480 ]   -370.693958071552   6.7125559262458e-06 
## [ 520 ]   -370.693326646958   4.25840275431553e-06 
## [ 560 ]   -370.692925319371   2.70660403704523e-06 
## [ 600 ]   -370.692669877555   1.72273312430862e-06 
## [ 640 ]   -370.692507117574   1.09767514679963e-06 
## [ 680 ]   -370.692403328497   6.99967650996422e-07
print(fit)
## PhenoPath fit with 100 cells and 40 genes

The phenopath function will print progress on iterations, ELBO, and % change in ELBO that may be turned off by setting verbose = FALSE in the call.

Once the model has been fit it is important to check convergence with a call to plot_elbo(fit) to ensure the ELBO is approximately flat:

plot_elbo(fit)

Examining results

The posterior mean estimates of the pseudotimes z sit in fit$m_z that can be extracted using the trajectory function. We can visualise these compared to both the true pseudotimes and the first principal component of the data:

qplot(sim$z, trajectory(fit)) +
  xlab("True z") + ylab("Phenopath z")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
qplot(sim$z, pca_df$PC1) +
  xlab("True z") + ylab("PC1")

We see that this has high correlation with the true pseudotimes:

cor(sim$z, trajectory(fit))
## [1] -0.9976558

Next, we’re interested in the interactions between the latent space and the covariates. There are three functions to help us here:

  • interaction_effects retrieves the posterior interaction effect sizes
  • interaction_sds retrieves the posterior interaction standard deviations
  • significant_interactions applies a Bayesian significant test to the interaction parameters

Note that if P = 1 (ie there is only one covariate) each of these will return a vector, while if P > 1 then a matrix is returned.

Alternatively, one can call the interactions function that returns a data frame with the following entries:

  • feature The feature (usually gene)
  • covariate The covariate, specified from the order originally supplied to the call to phenopath
  • interaction_effect_size The effect size of the interaction (β from the statistical model)
  • significant Boolean for whether the interaction effect is significantly different from 0
  • chi The precision of the ARD prior on β
  • pathway_loading The pathway loading λ, showing the overall effect for each gene marginalised over the covariate

In our simulated data above, the first 20 genes were simulated with no interaction effect while the latter 20 were simulated with interaction effects. We can pull out the interaction effect sizes, standard deviations, and significance test results into a data frame and plot:

gene_names <- paste0("gene", seq_len(ncol(fit$m_beta)))
df_beta <- data_frame(beta = interaction_effects(fit),
                      beta_sd = interaction_sds(fit),
                      is_sig = significant_interactions(fit),
                      gene = gene_names)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
df_beta$gene <- fct_relevel(df_beta$gene, gene_names)

ggplot(df_beta, aes(x = gene, y = beta, color = is_sig)) + 
  geom_point() +
  geom_errorbar(aes(ymin = beta - 2 * beta_sd, ymax = beta + 2 * beta_sd)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title.x = element_blank()) +
  ylab(expression(beta)) +
  scale_color_brewer(palette = "Set2", name = "Significant")

A typical analysis might follow by graphing the largest effect size; we can do this as follows:

which_largest <- which.max(df_beta$beta)

df_large <- data_frame(
  y = sim[['y']][, which_largest],
  x = factor(sim[['x']]),
  z = sim[['z']]
)

ggplot(df_large, aes(x = z, y = y, color = x)) +
  geom_point() +
  scale_color_brewer(palette = "Set1") +
  stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Advanced

Using an SummarizedExperiment as input

Alternatively you might have expression values in an SummarizedExperiment. For single-cell data it is highly recommended to use the SummarizedExperiment in which case data is stored in a class derived from SummarizedExperiment.

We’ll first construct an example SummarizedExperiment using our previous simulation data:

suppressPackageStartupMessages(library(SummarizedExperiment))
exprs_mat <- t(sim$y)
pdata <- data.frame(x = sim$x)
sce <- SummarizedExperiment(assays = list(exprs = exprs_mat), 
                            colData = pdata)
sce
## class: SummarizedExperiment 
## dim: 40 100 
## metadata(0):
## assays(1): exprs
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(1): x

Note that PhenoPath will use by default is in the exprs assay of a SummarizedExperiment (ie assay(sce, "exprs")) as gene expression values, which can be changed using the sce_assay string in the column to phenopath.

We can then pass the x covariates to PhenoPath in three ways:

  1. As a vector or matrix as before
  2. As a character that names a column of colData(sce) to use
  3. A formula to build a model matrix from colData(sce)

For our example, these three look like

fit <- phenopath(sce, sim$x) # 1
fit <- phenopath(sce, "x") # 2
fit <- phenopath(sce, ~ x) # 3

Note that if the column of the SCESet is a factor it is coerced into a one-hot encoding. The intercept term is then removed as this is taken care of by the λ coefficient automatically, and the columns centre-scaled.

Initialisation of latent space

The posterior distribution is naturally multi-modal and the use of variational inference means we essentially dive straight into the nearest local maximum. Therefore, correct initialisation of the latent space is important. This is controlled through the z_init argument.

PhenoPath allows for three different initialisations:

  1. An integer specifying a principal component of the data to initialise to
  2. A vector specifying the initial values
  3. Random initialisation from standard normal distribution

For our example these three look like

fit <- phenopath(sim$y, sim$x, z_init = 1) # 1, initialise to first principal component
fit <- phenopath(sim$y, sim$x, z_init = sim$z) # 2, initialise to true values
fit <- phenopath(sim$y, sim$x, z_init = "random") # 3, random initialisation

Controlling variational inference

There are several parameters that tune the coordinate ascent variational inference (CAVI):

  1. maxiter maximum number of iterations to run CAVI for
  2. elbo_tol the percentage change in the ELBO below which the model is considered converged
  3. thin Computing the ELBO is expensive, so only compute the ELBO every thin iterations

For example:

fit <- phenopath(sim$y, sim$x,
                 maxiter = 1000, # 1000 iterations max
                 elbo_tol = 1e-2, # consider model converged when change in ELBO < 0.02%
                 thin = 20 # calculate ELBO every 20 iterations
                 )

Technical

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [5] IRanges_2.41.2              S4Vectors_0.45.2           
##  [7] BiocGenerics_0.53.3         generics_0.1.3             
##  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [11] phenopath_1.31.0            forcats_1.0.0              
## [13] tidyr_1.3.1                 ggplot2_3.5.1              
## [15] dplyr_1.1.4                 BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            xfun_0.49               bslib_0.8.0            
##  [4] lattice_0.22-6          vctrs_0.6.5             tools_4.4.2            
##  [7] tibble_3.2.1            pkgconfig_2.0.3         Matrix_1.7-1           
## [10] RColorBrewer_1.1-3      lifecycle_1.0.4         GenomeInfoDbData_1.2.13
## [13] compiler_4.4.2          farver_2.1.2            munsell_0.5.1          
## [16] codetools_0.2-20        htmltools_0.5.8.1       sys_3.4.3              
## [19] buildtools_1.0.0        sass_0.4.9              yaml_2.3.10            
## [22] pillar_1.10.0           crayon_1.5.3            jquerylib_0.1.4        
## [25] cachem_1.1.0            DelayedArray_0.33.3     abind_1.4-8            
## [28] nlme_3.1-166            tidyselect_1.2.1        digest_0.6.37          
## [31] purrr_1.0.2             maketools_1.3.1         labeling_0.4.3         
## [34] splines_4.4.2           fastmap_1.2.0           grid_4.4.2             
## [37] colorspace_2.1-1        cli_3.6.3               SparseArray_1.7.2      
## [40] magrittr_2.0.3          S4Arrays_1.7.1          withr_3.0.2            
## [43] scales_1.3.0            UCSC.utils_1.3.0        rmarkdown_2.29         
## [46] XVector_0.47.1          httr_1.4.7              evaluate_1.0.1         
## [49] knitr_1.49              mgcv_1.9-1              rlang_1.1.4            
## [52] Rcpp_1.0.13-1           glue_1.8.0              BiocManager_1.30.25    
## [55] jsonlite_1.8.9          R6_2.5.1

References

Blei, David M, Alp Kucukelbir, and Jon D McAuliffe. 2016. “Variational Inference: A Review for Statisticians.” arXiv Preprint arXiv:1601.00670.