Introduction to MFA

Introduction

mfa is an R package for fitting a Bayesian mixture of factor analysers to infer developmental trajectories with bifurcations from single-cell gene expression data. It is able to jointly infer pseudotimes, branching, and genes differentially regulated across branches using a generative, Bayesian hierarchical model. Inference is performed using fast Gibbs sampling.

Installation

mfa can be installed in one of two ways:

From Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("mfa")
library(mfa)

From Github

This requires the devtools package to be installed first

install.packages("devtools") # If not already installed
devtools::install_github("kieranrcampbell/mfa")
library(mfa)

An example on synthetic data

Generating synthetic data

We first create some synthetic data for 100 cells and 40 genes calling the mfa function create_synthetic. This returns a list with gene expression, pseudotime, branch allocation, and various parameter estimates:

synth <- create_synthetic(C = 100, G = 40)
print(str(synth))
## List of 7
##  $ X          : num [1:100, 1:40] 13.5 11.9 11.1 2.8 15.5 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:100] "cell1" "cell2" "cell3" "cell4" ...
##   .. ..$ : chr [1:40] "feature1" "feature2" "feature3" "feature4" ...
##  $ branch     : int [1:100] 0 0 1 1 0 0 0 0 0 0 ...
##  $ pst        : num [1:100] 0.654 0.589 0.644 0.283 0.907 ...
##  $ k          : num [1:40, 1:2] 6.93 -8.43 -7.42 9.14 -7.49 ...
##  $ phi        : num [1:40, 1:2] 7.17 8.09 6.91 9.6 5.5 ...
##  $ delta      : num [1:40, 1:2] 0.383 0.0128 0.3851 0.4239 0.3837 ...
##  $ p_transient: num 0
## NULL

We can then PCA and put into a tidy format:

df_synth <- as_data_frame(prcomp(synth$X)$x[,1:2]) %>% 
  mutate(pseudotime = synth$pst,
        branch = factor(synth$branch))

and have a look at a PCA representation, coloured by both pseudotime and branch allocation:

ggplot(df_synth, aes(x = PC1, y = PC2, color = pseudotime)) + geom_point()

ggplot(df_synth, aes(x = PC1, y = PC2, color = branch)) + geom_point()

Calling mfa

The input to mfa is either an ExpressionSet (e.g. from using the package Scater) or a cell-by-gene expression matrix. If an ExpressionSet is provided then the values in the exprs slot are used for gene expression.

We invoke mfa with a call to the mfa(...) function. Depending on the size of the dataset and number of MCMC iterations used, this may take some time:

m <- mfa(synth$X)
print(m)
## MFA fit with
##  100 cells and 40 genes
##  ( 2000 iterations )

Particular care must be paid to the initialisation of the pseudotimes: by default they are initialised to the first principal component, though if the researcher suspects (based on plotting marker genes) that the trajectory corresponds to a different PC, this can be set using the pc_initialise argument.

MCMC diagnostics

As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. Cowles and Carlin (1996) for a full discussion).

For a quick summary of these, mfa provides two functions: plot_mfa_trace and plot_mfa_autocorr for quick plotting of the trace and autocorrelation of the posterior log-likelihood:

plot_mfa_trace(m)

plot_mfa_autocorr(m)

Plotting results

We can extract posterior mean estimates along with credible intervals using the summary function:

ms <- summary(m)
print(head(ms))
## # A tibble: 6 × 5
##   pseudotime branch branch_certainty pseudotime_lower pseudotime_upper
##        <dbl> <fct>             <dbl>            <dbl>            <dbl>
## 1     -1.17  1                     1           -1.35            -0.973
## 2     -0.854 1                     1           -0.990           -0.611
## 3     -0.366 2                     1           -0.464           -0.104
## 4      0.819 1                     1            0.462            0.979
## 5     -1.68  1                     1           -1.95            -1.55 
## 6     -0.926 1                     1           -1.14            -0.752

This has six entries:

  • pseudotime The MAP pseudotime estimate
  • branch The MAP branch estimate
  • branch_certainty The proportion of MCMC traces (after burn-in) for which the cell was assigned to the MAP branch
  • pseudotime_lower and pseudotime_upper: the lower and upper 95% highest-probability-density posterior credible intervals

We can compare the inferred pseudotimes to the true values:

qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) +
  xlab('True pseudotime') + ylab('Inferred pseudotime') +
  scale_color_discrete(name = 'True\nbranch')

And we can equivalently plot the PCA representation coloured by MAP branch:

mutate(df_synth, inferred_branch = ms[['branch']]) %>% 
  ggplot(aes(x = PC1, y = PC2, color = inferred_branch)) +
  geom_point() +
  scale_color_discrete(name = 'Inferred\nbranch')

Finding genes that bifurcate

A unique part of this model is that through an ARD-like prior structure on the loading matrices we can automatically infer which genes are involved in the bifurcation process. For a quick-and-dirty look we can use the plot_chi function, where larger values of inverse-chi imply the gene is associated with the bifurcation:

plot_chi(m)

To calculate the MAP values for chi we can call the calculate_chi function, which returns a data_frame with the feature names and values:

posterior_chi_df <- calculate_chi(m)
head(posterior_chi_df)
## # A tibble: 6 × 2
##   feature  chi_map
##   <chr>      <dbl>
## 1 feature1   1.54 
## 2 feature2   0.619
## 3 feature3   1.01 
## 4 feature4   0.530
## 5 feature5   1.38 
## 6 feature6   0.677

Advanced usage

The mfa class

A call to mfa(...) returns an mfa object that contains all the information about the dataset and the MCMC inference performed. Note that it does not contain a copy of the original data. We can see the structure by calling str on an mfa object:

str(m, max.level = 1)
## List of 10
##  $ traces       :List of 10
##  $ iter         : num 2000
##  $ thin         : num 1
##  $ burn         : num 1000
##  $ b            : num 2
##  $ collapse     : logi FALSE
##  $ N            : int 100
##  $ G            : int 40
##  $ feature_names: chr [1:40] "feature1" "feature2" "feature3" "feature4" ...
##  $ cell_names   : chr [1:100] "cell1" "cell2" "cell3" "cell4" ...
##  - attr(*, "class")= chr "mfa"

This contains the following slots:

  • traces - the raw MCMC traces (discussed in following section)
  • iter - the number of MCMC iterations
  • thin - the thinning of the MCMC chain
  • burn - the number of MCMC iterations thrown away as burn-in
  • b - the number of branches modelled
  • collapse - whether collapsed Gibbs sampling was implemented
  • N - the number of cells
  • G - the number of features (e.g. genes)
  • feature_names - the names of the features (e.g. genes)
  • cell_names - the names of the cells

Accessing MCMC traces

MCMC traces can be accessed through the traces slot of an mfa object. This gives a list with an element for each variable, along with the log-likelihood:

print(names(m$traces))
##  [1] "tau_trace"          "gamma_trace"        "pst_trace"         
##  [4] "theta_trace"        "lambda_theta_trace" "chi_trace"         
##  [7] "eta_trace"          "k_trace"            "c_trace"           
## [10] "lp_trace"

For non-branch-specific variables this is simply a matrix. For example, for the variable τ is just an interation-by-gene matrix:

str(m$traces$tau_trace)
##  num [1:1000, 1:40] 13.1 11.9 11.3 13.5 14.1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:40] "tau[1]" "tau[2]" "tau[3]" "tau[4]" ...

We can easily get the posterior mean by calling colMeans. More fancy posterior density estimation can be perfomed using the MCMCglmm package, such as posterior.mode(...) for MAP estimation (though in practice this is often similar to posterior mean). We can estimate posterior intervals using the HPDInterval(...) function from the coda package (note that traces must be converted to coda objects before calling either of these).

Some variables are branch dependent, meaning the traces returned are arrays (or tensors in fashionable speak) that have dimension iteration x gene x branch. An example is the k variable:

str(m$traces$k_trace)
##  num [1:1000, 1:40, 1:2] -0.984 -1.06 -1.072 -1.01 -1.081 ...

To get posterior means (or modes, or intervals) we then need to use the apply function to iterate over the branches. To find the posterior means of k, we then call

pmean_k <- apply(m$traces$k_trace, 3, colMeans)
str(pmean_k)
##  num [1:40, 1:2] -0.96 0.726 0.953 -0.97 0.967 ...

This returns a gene-by-branch matrix of posterior estimates.

Technical

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4      ggplot2_3.5.1    mfa_1.29.0       BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        tensorA_0.36.2.1    xfun_0.49          
##  [4] bslib_0.8.0         Biobase_2.67.0      GGally_2.2.1       
##  [7] lattice_0.22-6      vctrs_0.6.5         tools_4.4.2        
## [10] generics_0.1.3      parallel_4.4.2      tibble_3.2.1       
## [13] fansi_1.0.6         pkgconfig_2.0.3     Matrix_1.7-1       
## [16] RColorBrewer_1.1-3  ggmcmc_1.5.1.1      lifecycle_1.0.4    
## [19] cubature_2.1.1      compiler_4.4.2      farver_2.1.2       
## [22] MatrixModels_0.5-3  mcmc_0.9-8          munsell_0.5.1      
## [25] codetools_0.2-20    SparseM_1.84-2      quantreg_5.99.1    
## [28] htmltools_0.5.8.1   sys_3.4.3           buildtools_1.0.0   
## [31] sass_0.4.9          yaml_2.3.10         pillar_1.9.0       
## [34] jquerylib_0.1.4     tidyr_1.3.1         MASS_7.3-61        
## [37] cachem_1.1.0        nlme_3.1-166        ggstats_0.7.0      
## [40] tidyselect_1.2.1    digest_0.6.37       purrr_1.0.2        
## [43] maketools_1.3.1     labeling_0.4.3      splines_4.4.2      
## [46] fastmap_1.2.0       grid_4.4.2          colorspace_2.1-1   
## [49] cli_3.6.3           magrittr_2.0.3      survival_3.7-0     
## [52] utf8_1.2.4          ape_5.8             corpcor_1.6.10     
## [55] withr_3.0.2         scales_1.3.0        MCMCglmm_2.36      
## [58] rmarkdown_2.29      coda_0.19-4.1       evaluate_1.0.1     
## [61] knitr_1.49          rlang_1.1.4         MCMCpack_1.7-1     
## [64] Rcpp_1.0.13-1       glue_1.8.0          BiocManager_1.30.25
## [67] BiocGenerics_0.53.3 jsonlite_1.8.9      R6_2.5.1           
## [70] plyr_1.8.9

References

Cowles, Mary Kathryn, and Bradley P Carlin. 1996. “Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review.” Journal of the American Statistical Association 91 (434): 883–904.