This vignette contains a detailed tutorial on how to train a MOFA model using R. A concise template script can be found here. Many more examples on application of MOFA to various multi-omics data sets can be found here.
MOFA (and factor analysis models in general) are useful to uncover variation in complex data sets that contain multiple sources of heterogeneity. This requires a relatively large sample size (at least ~15 samples). In addition, MOFA needs the multi-modal measurements to be derived from the same samples. It is fine if you have samples that are missing some data modality, but there has to be a significant degree of matched measurements.
Proper normalisation of the data is critical. The model can handle three types of data: continuous (modelled with a gaussian likelihood), small counts (modelled with a Poisson likelihood) and binary measurements (modelled with a bernoulli likelihood). Non-gaussian likelihoods give non-optimal results, we recommend the user to apply data transformations to obtain continuous measurements. For example, for count-based data such as RNA-seq or ATAC-seq we recommend size factor normalisation + variance stabilisation (i.e. a log transformation).
It is strongly recommended that you select highly variable features (HVGs) per assay before fitting the model. This ensures a faster training and a more robust inference procedure. Also, for data modalities that have very different dimensionalities we suggest a stronger feature selection fort he bigger views, with the aim of reducing the feature imbalance between data modalities.
To create a MOFA object you need to specify three dimensions: samples, features and view(s). Optionally, a group can also be specified for each sample (no group structure by default). MOFA objects can be created from a wide range of input formats, including:
A list of matrices, where each entry corresponds to one view. Samples are stored in columns and features in rows.
Let’s simulate some data to start with
data <- make_example_data(
n_views = 2,
n_samples = 200,
n_features = 1000,
n_factors = 10
)[[1]]
lapply(data,dim)
## $view_1
## [1] 1000 200
##
## $view_2
## [1] 1000 200
Create the MOFA object:
Plot the data overview
In case you are using the multi-group functionality, the groups can
be specified using the groups
argument as a vector with the
group ID for each sample. Keep in mind that the multi-group
functionality is a rather advanced option that we discourage for
beginners. For more details on how the multi-group inference works, read
the FAQ section
and check
this vignette.
N = ncol(data[[1]])
groups = c(rep("A",N/2), rep("B",N/2))
MOFAobject <- create_mofa(data, groups=groups)
Plot the data overview
A long data.frame with columns sample
,
feature
, view
, group
(optional),
value
might be the best format for complex data sets with
multiple omics and potentially multiple groups of data. Also, there is
no need to add rows that correspond to missing data:
## sample feature view value
## <char> <char> <char> <num>
## 1: sample_0_group_1 feature_0_view_0 view_0 2.08
## 2: sample_1_group_1 feature_0_view_0 view_0 0.01
## 3: sample_2_group_1 feature_0_view_0 view_0 -0.11
## 4: sample_3_group_1 feature_0_view_0 view_0 -0.82
## 5: sample_4_group_1 feature_0_view_0 view_0 -1.13
## 6: sample_5_group_1 feature_0_view_0 view_0 -0.25
Create the MOFA object
## Creating MOFA object from a data.frame...
## Untrained MOFA model with the following characteristics:
## Number of views: 2
## Views names: view_0 view_1
## Number of features (per view): 1000 1000
## Number of groups: 1
## Groups names: single_group
## Number of samples (per group): 100
##
Plot data overview
FALSE
FALSE
## $scale_views
## [1] FALSE
##
## $scale_groups
## [1] FALSE
##
## $center_groups
## [1] TRUE
##
## $use_float32
## [1] TRUE
##
## $views
## [1] "view_0" "view_1"
##
## $groups
## [1] "single_group"
FALSE
.TRUE
.TRUE
if using multiple groups.TRUE
if using multiple views.Only change the default model options if you are familiar with the underlying mathematical model.
## $likelihoods
## view_0 view_1
## "gaussian" "gaussian"
##
## $num_factors
## [1] 10
##
## $spikeslab_factors
## [1] FALSE
##
## $spikeslab_weights
## [1] FALSE
##
## $ard_factors
## [1] FALSE
##
## $ard_weights
## [1] TRUE
## $maxiter
## [1] 1000
##
## $convergence_mode
## [1] "fast"
##
## $drop_factor_threshold
## [1] -1
##
## $verbose
## [1] FALSE
##
## $startELBO
## [1] 1
##
## $freqELBO
## [1] 5
Prepare the MOFA object
MOFAobject <- prepare_mofa(
object = MOFAobject,
data_options = data_opts,
model_options = model_opts,
training_options = train_opts
)
Train the MOFA model. Remember that in this step the
MOFA2
R package connets with the mofapy2
Python package using reticulate
. This is the source of most
problems when running MOFA. See our FAQ section if you
have issues. The output is saved in the file specified as
outfile
. If none is specified, the output is saved in a
temporary location.
outfile = file.path(tempdir(),"model.hdf5")
MOFAobject.trained <- run_mofa(MOFAobject, outfile, use_basilisk=TRUE)
## Warning: Output file /tmp/RtmpPX9kPf/model.hdf5 already exists, it will be replaced
## Connecting to the mofapy2 package using basilisk.
## Set 'use_basilisk' to FALSE if you prefer to manually set the python binary using 'reticulate'.
If everything is successful, you should observe an output analogous to the following:
######################################
## Training the model with seed 1 ##
######################################
Iteration 1: time=0.03, ELBO=-52650.68, deltaELBO=837116.802 (94.082647669%), Factors=10
(...)
Iteration 9: time=0.04, ELBO=-50114.43, deltaELBO=23.907 (0.002686924%), Factors=10
#######################
## Training finished ##
#######################
Saving model in `/var/folders/.../model.hdf5.../tmp/RtmpPX9kPf/model.hdf5.
This finishes the tutorial on how to train a MOFA object from R. To continue with the downstream analysis, follow this tutorial
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.16.2 pheatmap_1.0.12 lubridate_1.9.3 forcats_1.0.0
## [5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [9] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
## [13] MOFA2_1.17.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 filelock_1.0.3
## [4] fastmap_1.2.0 GGally_2.2.1 digest_0.6.37
## [7] timechange_0.3.0 lifecycle_1.0.4 magrittr_2.0.3
## [10] compiler_4.4.2 rlang_1.1.4 sass_0.4.9
## [13] tools_4.4.2 utf8_1.2.4 yaml_2.3.10
## [16] corrplot_0.95 ggsignif_0.6.4 knitr_1.49
## [19] S4Arrays_1.7.1 labeling_0.4.3 reticulate_1.40.0
## [22] DelayedArray_0.33.2 plyr_1.8.9 RColorBrewer_1.1-3
## [25] abind_1.4-8 HDF5Array_1.35.1 Rtsne_0.17
## [28] withr_3.0.2 BiocGenerics_0.53.3 sys_3.4.3
## [31] grid_4.4.2 stats4_4.4.2 fansi_1.0.6
## [34] ggpubr_0.6.0 colorspace_2.1-1 Rhdf5lib_1.29.0
## [37] scales_1.3.0 cli_3.6.3 mvtnorm_1.3-2
## [40] rmarkdown_2.29 crayon_1.5.3 generics_0.1.3
## [43] reshape2_1.4.4 tzdb_0.4.0 cachem_1.1.0
## [46] rhdf5_2.51.0 splines_4.4.2 zlibbioc_1.52.0
## [49] parallel_4.4.2 BiocManager_1.30.25 XVector_0.47.0
## [52] matrixStats_1.4.1 basilisk_1.19.0 vctrs_0.6.5
## [55] Matrix_1.7-1 carData_3.0-5 jsonlite_1.8.9
## [58] dir.expiry_1.15.0 car_3.1-3 IRanges_2.41.1
## [61] hms_1.1.3 S4Vectors_0.45.2 rstatix_0.7.2
## [64] ggrepel_0.9.6 irlba_2.3.5.1 Formula_1.2-5
## [67] maketools_1.3.1 jquerylib_0.1.4 glue_1.8.0
## [70] codetools_0.2-20 ggstats_0.7.0 cowplot_1.1.3
## [73] uwot_0.2.2 RcppAnnoy_0.0.22 stringi_1.8.4
## [76] gtable_0.3.6 munsell_0.5.1 pillar_1.9.0
## [79] basilisk.utils_1.19.0 htmltools_0.5.8.1 rhdf5filters_1.19.0
## [82] R6_2.5.1 evaluate_1.0.1 lattice_0.22-6
## [85] backports_1.5.0 png_0.1-8 broom_1.0.7
## [88] bslib_0.8.0 Rcpp_1.0.13-1 nlme_3.1-166
## [91] SparseArray_1.7.2 mgcv_1.9-1 xfun_0.49
## [94] MatrixGenerics_1.19.0 buildtools_1.0.0 pkgconfig_2.0.3