mfa
is an R package for fitting a Bayesian mixture of
factor analysers to infer developmental trajectories with bifurcations
from single-cell gene expression data. It is able to jointly infer
pseudotimes, branching, and genes differentially regulated across
branches using a generative, Bayesian hierarchical model. Inference is
performed using fast Gibbs sampling.
mfa
can be installed in one of two ways:
We first create some synthetic data for 100 cells and 40 genes
calling the mfa
function create_synthetic
.
This returns a list with gene expression, pseudotime, branch allocation,
and various parameter estimates:
## List of 7
## $ X : num [1:100, 1:40] 13.5 11.9 11.1 2.8 15.5 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:100] "cell1" "cell2" "cell3" "cell4" ...
## .. ..$ : chr [1:40] "feature1" "feature2" "feature3" "feature4" ...
## $ branch : int [1:100] 0 0 1 1 0 0 0 0 0 0 ...
## $ pst : num [1:100] 0.654 0.589 0.644 0.283 0.907 ...
## $ k : num [1:40, 1:2] 6.93 -8.43 -7.42 9.14 -7.49 ...
## $ phi : num [1:40, 1:2] 7.17 8.09 6.91 9.6 5.5 ...
## $ delta : num [1:40, 1:2] 0.383 0.0128 0.3851 0.4239 0.3837 ...
## $ p_transient: num 0
## NULL
We can then PCA and put into a tidy format:
df_synth <- as_data_frame(prcomp(synth$X)$x[,1:2]) %>%
mutate(pseudotime = synth$pst,
branch = factor(synth$branch))
and have a look at a PCA representation, coloured by both pseudotime and branch allocation:
mfa
The input to mfa
is either an ExpressionSet
(e.g. from using the package Scater)
or a cell-by-gene expression matrix. If an ExpressionSet
is
provided then the values in the exprs
slot are used for
gene expression.
We invoke mfa
with a call to the mfa(...)
function. Depending on the size of the dataset and number of MCMC
iterations used, this may take some time:
## MFA fit with
## 100 cells and 40 genes
## ( 2000 iterations )
Particular care must be paid to the initialisation of the
pseudotimes: by default they are initialised to the first principal
component, though if the researcher suspects (based on plotting marker
genes) that the trajectory corresponds to a different PC, this can be
set using the pc_initialise
argument.
As in any MCMC analysis, basic care is needed to make sure the samples have converged to something resembling the stationary distribution (see e.g. Cowles and Carlin (1996) for a full discussion).
For a quick summary of these, mfa
provides two
functions: plot_mfa_trace
and
plot_mfa_autocorr
for quick plotting of the trace and
autocorrelation of the posterior log-likelihood:
We can extract posterior mean estimates along with credible intervals
using the summary
function:
## # A tibble: 6 × 5
## pseudotime branch branch_certainty pseudotime_lower pseudotime_upper
## <dbl> <fct> <dbl> <dbl> <dbl>
## 1 -1.17 1 1 -1.35 -0.973
## 2 -0.854 1 1 -0.990 -0.611
## 3 -0.366 2 1 -0.464 -0.104
## 4 0.819 1 1 0.462 0.979
## 5 -1.68 1 1 -1.95 -1.55
## 6 -0.926 1 1 -1.14 -0.752
This has six entries:
pseudotime
The MAP pseudotime estimatebranch
The MAP branch estimatebranch_certainty
The proportion of MCMC traces (after
burn-in) for which the cell was assigned to the MAP branchpseudotime_lower
and pseudotime_upper
: the
lower and upper 95% highest-probability-density posterior credible
intervalsWe can compare the inferred pseudotimes to the true values:
qplot(synth$pst, ms$pseudotime, color = factor(synth$branch)) +
xlab('True pseudotime') + ylab('Inferred pseudotime') +
scale_color_discrete(name = 'True\nbranch')
And we can equivalently plot the PCA representation coloured by MAP branch:
A unique part of this model is that through an ARD-like prior
structure on the loading matrices we can automatically infer which genes
are involved in the bifurcation process. For a quick-and-dirty look we
can use the plot_chi
function, where larger values of
inverse-chi imply the gene is associated with the bifurcation:
To calculate the MAP values for chi we can call the
calculate_chi
function, which returns a
data_frame
with the feature names and values:
## # A tibble: 6 × 2
## feature chi_map
## <chr> <dbl>
## 1 feature1 1.54
## 2 feature2 0.619
## 3 feature3 1.01
## 4 feature4 0.530
## 5 feature5 1.38
## 6 feature6 0.677
mfa
classA call to mfa(...)
returns an mfa
object
that contains all the information about the dataset and the MCMC
inference performed. Note that it does not contain a copy of
the original data. We can see the structure by calling str
on an mfa
object:
## List of 10
## $ traces :List of 10
## $ iter : num 2000
## $ thin : num 1
## $ burn : num 1000
## $ b : num 2
## $ collapse : logi FALSE
## $ N : int 100
## $ G : int 40
## $ feature_names: chr [1:40] "feature1" "feature2" "feature3" "feature4" ...
## $ cell_names : chr [1:100] "cell1" "cell2" "cell3" "cell4" ...
## - attr(*, "class")= chr "mfa"
This contains the following slots:
traces
- the raw MCMC traces (discussed in following
section)iter
- the number of MCMC iterationsthin
- the thinning of the MCMC chainburn
- the number of MCMC iterations thrown away as
burn-inb
- the number of branches modelledcollapse
- whether collapsed Gibbs sampling was
implementedN
- the number of cellsG
- the number of features (e.g. genes)feature_names
- the names of the features
(e.g. genes)cell_names
- the names of the cellsMCMC traces can be accessed through the traces
slot of
an mfa
object. This gives a list with an element for each
variable, along with the log-likelihood:
## [1] "tau_trace" "gamma_trace" "pst_trace"
## [4] "theta_trace" "lambda_theta_trace" "chi_trace"
## [7] "eta_trace" "k_trace" "c_trace"
## [10] "lp_trace"
For non-branch-specific variables this is simply a matrix. For example, for the variable τ is just an interation-by-gene matrix:
## num [1:1000, 1:40] 13.1 11.9 11.3 13.5 14.1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:40] "tau[1]" "tau[2]" "tau[3]" "tau[4]" ...
We can easily get the posterior mean by calling
colMeans
. More fancy posterior density estimation can be
perfomed using the MCMCglmm
package, such as
posterior.mode(...)
for MAP estimation (though in practice
this is often similar to posterior mean). We can estimate posterior
intervals using the HPDInterval(...)
function from the
coda
package (note that traces must be converted to
coda
objects before calling either of these).
Some variables are branch dependent, meaning the traces returned are
arrays (or tensors in fashionable speak) that have dimension
iteration x gene x branch
. An example is the k variable:
## num [1:1000, 1:40, 1:2] -0.984 -1.06 -1.072 -1.01 -1.081 ...
To get posterior means (or modes, or intervals) we then need to use
the apply
function to iterate over the branches. To find
the posterior means of k
, we then call
## num [1:40, 1:2] -0.96 0.726 0.953 -0.97 0.967 ...
This returns a gene-by-branch matrix of posterior estimates.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 ggplot2_3.5.1 mfa_1.29.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 tensorA_0.36.2.1 xfun_0.49
## [4] bslib_0.8.0 Biobase_2.67.0 GGally_2.2.1
## [7] lattice_0.22-6 vctrs_0.6.5 tools_4.4.2
## [10] generics_0.1.3 parallel_4.4.2 tibble_3.2.1
## [13] fansi_1.0.6 pkgconfig_2.0.3 Matrix_1.7-1
## [16] RColorBrewer_1.1-3 ggmcmc_1.5.1.1 lifecycle_1.0.4
## [19] cubature_2.1.1 compiler_4.4.2 farver_2.1.2
## [22] MatrixModels_0.5-3 mcmc_0.9-8 munsell_0.5.1
## [25] codetools_0.2-20 SparseM_1.84-2 quantreg_5.99.1
## [28] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
## [31] sass_0.4.9 yaml_2.3.10 pillar_1.9.0
## [34] jquerylib_0.1.4 tidyr_1.3.1 MASS_7.3-61
## [37] cachem_1.1.0 nlme_3.1-166 ggstats_0.7.0
## [40] tidyselect_1.2.1 digest_0.6.37 purrr_1.0.2
## [43] maketools_1.3.1 labeling_0.4.3 splines_4.4.2
## [46] fastmap_1.2.0 grid_4.4.2 colorspace_2.1-1
## [49] cli_3.6.3 magrittr_2.0.3 survival_3.7-0
## [52] utf8_1.2.4 ape_5.8 corpcor_1.6.10
## [55] withr_3.0.2 scales_1.3.0 MCMCglmm_2.36
## [58] rmarkdown_2.29 coda_0.19-4.1 evaluate_1.0.1
## [61] knitr_1.49 rlang_1.1.4 MCMCpack_1.7-1
## [64] Rcpp_1.0.13-1 glue_1.8.0 BiocManager_1.30.25
## [67] BiocGenerics_0.53.3 jsonlite_1.8.9 R6_2.5.1
## [70] plyr_1.8.9