MMUPHin
is an R package implementing meta-analysis
methods for microbial community profiles. It has interfaces for: a)
covariate-controlled batch and study effect adjustment, b) meta-analytic
differential abundance testing, and meta-analytic discovery of c)
discrete (cluster-based) or d) continuous unsupervised population
structure.
Overall, MMUPHin
enables the normalization and
combination of multiple microbial community studies. It can then help in
identifying microbes, genes, or pathways that are differential with
respect to combined phenotypes. Finally, it can find clusters or
gradients of sample types that reproduce consistently among studies.
MMUPHin is a Bioconductor package and can be installed via the following command.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MMUPHin")
This vignette is intended to provide working examples for all four
functionalities of MMUPHin
.
MMUPHin can be loaded into R with the first library()
command shown below. We also load some utility packages for data
manipulation and visualization.
As input, MMUPHin
requires a properly formatted
collection of microbial community studies, with both feature abundances
and accompanying metadata. Here we use the five published colorectal
cancer (CRC) stool metagenomic studies, incorporated in Thomas et al. (2019). Data for the studies are
already conveniently packaged and accessible through the Biocondcutor
package curatedMetagenomicData,
though additional wranglings are needed to format input for
MMUPHin
.
Importantly, MMUPHin
asks that feature abundances be
provided as a feature-by-sample matrix, and the metadata be provided as
a data frame. The two objects shoud agree on sample IDs, that is,
colname
of the feature abundance matrix and
rowname
of the metadata data frame must agree. Many popular
’omic data classes in R already enforce this standard, such as
ExpressionSet
from Biobase,
or phyloseq
from phyloseq.
To minimize users’ efforts in loading data to run the examples, we have properly formatted the five studies for easy access. The feature abundances and metadata can be loaded with the following code chunk. For the interested user, the commented out scripts were used for accessing data directly from curatedMetagenomicData and formatting. It might be worthwhile to read through these as they perform many of the common tasks for preprocessing microbial feature abundance data in R, including sample/feature subsetting, normalization, filtering, etc.
## FengQ_2015.metaphlan_bugs_list.stool:SID31004
## s__Faecalibacterium_prausnitzii 0.11110668
## s__Streptococcus_salivarius 0.09660736
## s__Ruminococcus_sp_5_1_39BFAA 0.09115385
## s__Subdoligranulum_unclassified 0.05806767
## s__Bacteroides_stercoris 0.05685503
## subjectID body_site
## FengQ_2015.metaphlan_bugs_list.stool:SID31004 SID31004 stool
## study_condition
## FengQ_2015.metaphlan_bugs_list.stool:SID31004 CRC
## disease
## FengQ_2015.metaphlan_bugs_list.stool:SID31004 CRC;fatty_liver;hypertension
## age_category
## FengQ_2015.metaphlan_bugs_list.stool:SID31004 adult
##
## FengQ_2015.metaphlan_bugs_list.stool
## 107
## HanniganGD_2017.metaphlan_bugs_list.stool
## 55
## VogtmannE_2016.metaphlan_bugs_list.stool
## 104
## YuJ_2015.metaphlan_bugs_list.stool
## 128
## ZellerG_2014.metaphlan_bugs_list.stool
## 157
adjust_batch
adjust_batch
aims to correct for technical study/batch
effects in microbial feature abundances. It takes as input the
feature-by-sample abundance matrix, and performs batch effect adjustment
given provided batch and optional covariate variables. It returns the
batch-adjusted abundance matrix. Check help(adjust_batch)
for additional details and options. This figure from the paper (figure
2b) shows the desired effect of the batch adjustment: batch effects are
removed, making the effect of the biological variable of interest easier
to detect.
Here we use adjust_batch
to correct for differences in
the five studies, while controlling for the effect of CRC versus control
(variable study_condition
in CRC_meta
).
adjust_batch
returns a list of multiple outputs. The
part we are most interested in now is the adjusted abundance table,
which we assign to CRC_abd_adj
.
fit_adjust_batch <- adjust_batch(feature_abd = CRC_abd,
batch = "studyID",
covariates = "study_condition",
data = CRC_meta,
control = list(verbose = FALSE))
CRC_abd_adj <- fit_adjust_batch$feature_abd_adj
One way to evaluate the effect of batch adjustment is to assess the
total variability in microbial profiles attributable to study
differences, before and after adjustment. This is commonly known as a
PERMANOVA test (Tang, Chen, and Alekseyenko
2016), and can be performed with the adonis
function
in vegan. The
inputs to adonis()
are indices of microbial compositional
dissimilarity.
## This is vegan 2.6-8
D_before <- vegdist(t(CRC_abd))
D_after <- vegdist(t(CRC_abd_adj))
set.seed(1)
fit_adonis_before <- adonis2(D_before ~ studyID, data = CRC_meta)
fit_adonis_after <- adonis2(D_after ~ studyID, data = CRC_meta)
print(fit_adonis_before)
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = D_before ~ studyID, data = CRC_meta)
## Df SumOfSqs R2 F Pr(>F)
## Model 4 13.36 0.07991 11.855 0.001 ***
## Residual 546 153.82 0.92009
## Total 550 167.18 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation test for adonis under reduced model
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = D_after ~ studyID, data = CRC_meta)
## Df SumOfSqs R2 F Pr(>F)
## Model 4 4.939 0.03044 4.2855 0.001 ***
## Residual 546 157.298 0.96956
## Total 550 162.237 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that, before study effect adjustment, study differences can expalin a total of NA% of the variability in microbial abundance profiles, whereas after adjustment this was reduced to NA%, though the effect of study is significant in both cases.
lm_meta
One of the most common meta-analysis goals is to combine association
effects across batches/studies to identify consistent overall effects.
lm_meta
provides a straightforward interface to this task,
by first performing regression analysis in individual batches/studies
using the well-validated Maaslin2
package, and then aggregating results with established fixed/mixed
effect models, realized via the vegan package.
Here, we use lm_meta
to test for consistently differential
abundant species between CRC and control samples across the five
studies, while controlling for demographic covariates (gender, age,
BMI). Again, lm_meta returns a list of more than one components.
meta_fits provides the final meta-analytical testing results. See
help(lm_meta) for the meaning of other components.
fit_lm_meta <- lm_meta(feature_abd = CRC_abd_adj,
batch = "studyID",
exposure = "study_condition",
covariates = c("gender", "age", "BMI"),
data = CRC_meta,
control = list(verbose = FALSE))
## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate gender is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate age is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate BMI is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool
You’ll note that this command produces a few error messages. This is because for one of the studies the specified covariates were not measured. As the message states, these covariates are not included when analyzing data from that study. We also note that batch correction is standard but not necessary for the analysis here.
We can visualize the significant (FDR q < 0.05) species associated with CRC in these studies/samples. Comparing them with Figure 1b of Thomas et al. (2019), we can see that many of the significant species do agree.
discrete_discover
Clustering analysis of microbial profiles can help identify
meaningful discrete population subgroups (Ravel
et al. 2011), but must be carried out carefully with validations
to ensure that the identified structures are consistent (Koren et al. 2013).
discrete_discover
provides the functionality to use
prediction strength (Tibshirani and Walther
2005) to evaluate discrete clustering structures within
individual microbial studies, as well as a “generalized prediction
strength” to evaluate their reproducibility in other studies. These
jointly provide meta-analytical evidence for (or against) identifying
discrete population structures in microbial profiles. Check
help(discrete_discover)
to see more details on the method
and additional options. We can contrast discrete structure (clusters)
with continuous structure (which will be overviewed in the next
section):
The gut microbiome is known to form gradients rather than discrete
clusters (Koren et al. 2013). Here we use
discrete_discover
to evaluate clustering structures among
control samples in the five stool studies. It’s worth noting that this
function takes a distance/dissimilarity matrix as an input, rather than
the abundance table.
control_meta <- subset(CRC_meta, study_condition == "control")
control_abd_adj <- CRC_abd_adj[, rownames(control_meta)]
D_control <- vegdist(t(control_abd_adj))
fit_discrete <- discrete_discover(D = D_control,
batch = "studyID",
data = control_meta,
control = list(k_max = 8,
verbose = FALSE))
The internal_mean
and internal_sd
components of fit_discrete
are matrices that provide
internal evaluation statistics (prediction strength) for each batch
(column) and evaluated number of clusters (row). Similarly,
external_mean
and external_sd
provide external
evaluation statistics ( generalized prediction strenght). Evidence for
existence of discrete structures would be a “peaking” of the mean
statistics at a particular cluster number. Here, for easier examination
of such a pattern, we visualize the results for the largest study,
ZellerG_2014. Note that visualization for all studies are by default
automatically generated and saved to the output file
“diagnostic_discrete.pdf”.
study_id = "ZellerG_2014.metaphlan_bugs_list.stool"
internal <- data.frame(K = 2:8,
statistic = fit_discrete$internal_mean[, study_id],
se = fit_discrete$internal_se[, study_id],
type = "internal")
external <- data.frame(K = 2:8,
statistic = fit_discrete$external_mean[, study_id],
se = fit_discrete$external_se[, study_id],
type = "external")
rbind(internal, external) %>%
ggplot(aes(x = K, y = statistic, color = type)) +
geom_point(position = position_dodge(width = 0.5)) +
geom_line(position = position_dodge(width = 0.5)) +
geom_errorbar(aes(ymin = statistic - se, ymax = statistic + se),
position = position_dodge(width = 0.5), width = 0.5) +
ggtitle("Evaluation of discrete structure in control stool microbiome (ZellerG_2014)")
The decreasing trend for both the internal and external statistics along with number of clusters (K) suggests that discrete structures cannot be well-established. To provide a positive example, we examine the two vaginal microbiome studies provided by curatedMetagenomicData, as the vaginal microbiome is known to have distinct subtypes (Ravel et al. 2011). Again, we pre-formatted these datasets for easy access, but you can recreate them from curatedMetagenomicData with the commented out scripts.
data("vaginal_abd", "vaginal_meta")
D_vaginal <- vegdist(t(vaginal_abd))
fit_discrete_vag <- discrete_discover(D = D_vaginal,
batch = "studyID",
data = vaginal_meta,
control = list(verbose = FALSE,
k_max = 8))
hmp_id = "HMP_2012.metaphlan_bugs_list.vagina"
data.frame(K = 2:8,
statistic = fit_discrete_vag$internal_mean[, hmp_id],
se = fit_discrete_vag$internal_se[, hmp_id]) %>%
ggplot(aes(x = K, y = statistic)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = statistic - se,
ymax = statistic + se),
width = 0.5) +
ggtitle("Evaluation of discrete structure in vaginal microbiome (HMP_2012)")
We can see that for the vaginal microbiome,
discrete_discover
suggests the existence of five clusters.
Here we examine only the internal metrics of HMP_2012 as the other study
(FerrettiP_2018) has only 19 samples.
continuous_discover
Population structure in the microbiome can manifest as gradients
rather than discrete clusters, such as dominant phyla trade-off or
disease-associated dysbiosis. continuous_discover
provide
functionality to identify such structures as well as to validate them
with meta-analysis. We again evaluate these continuous structures in
control samples of the five studies.
Much like adjust_batch and lm_meta, continuous_discover also takes as
input feature-by-sample abundances. The control
argument
offers many tuning parameters and here we set one of them,
var_perc_cutoff
, to 0.5, which asks the method to include
top principal components within each batch that in total explain at
least 50% of the total variability in the batch. See
help(continuous_discover)
for more details on the tuning
parameters and their interpretations.
fit_continuous <- continuous_discover(feature_abd = control_abd_adj,
batch = "studyID",
data = control_meta,
control = list(var_perc_cutoff = 0.5,
verbose = FALSE))
We can visualize the identified continuous structure scores in at least two ways: first, to examine their top contributing microbial features, to get an idea of what the score is characterizing, and second, to overlay the continuous scores with an ordination visualization. Here we perform these visualizations on the first identified continuous score.
loading <- data.frame(feature = rownames(fit_continuous$consensus_loadings),
loading1 = fit_continuous$consensus_loadings[, 1])
loading %>%
arrange(-abs(loading1)) %>%
slice(1:20) %>%
arrange(loading1) %>%
mutate(feature = factor(feature, levels = feature)) %>%
ggplot(aes(x = feature, y = loading1)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Features with top loadings")
mds <- cmdscale(d = D_control)
colnames(mds) <- c("Axis1", "Axis2")
as.data.frame(mds) %>%
mutate(score1 = fit_continuous$consensus_scores[, 1]) %>%
ggplot(aes(x = Axis1, y = Axis2, color = score1)) +
geom_point() +
coord_fixed()
From ordination we see that the first continuos score indeed represent strong variation across these stool samples. From the top loading features we can see that this score strongly represents a Bacteroidetes (the Bacteroides species) versus Firmicutes (the Ruminococcus species) tradeoff.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] vegan_2.6-8 lattice_0.22-6 permute_0.9-7 ggplot2_3.5.1
## [5] dplyr_1.1.4 magrittr_2.0.3 MMUPHin_1.21.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 biglm_0.9-3 xfun_0.49
## [4] bslib_0.8.0 numDeriv_2016.8-1.1 mathjaxr_1.6-0
## [7] vctrs_0.6.5 tools_4.4.2 generics_0.1.3
## [10] stats4_4.4.2 flexmix_2.3-19 parallel_4.4.2
## [13] getopt_1.20.4 tibble_3.2.1 DEoptimR_1.1-3-1
## [16] cluster_2.1.8 pkgconfig_2.0.3 logging_0.10-108
## [19] Matrix_1.7-1 lifecycle_1.0.4 compiler_4.4.2
## [22] farver_2.1.2 stringr_1.5.1 textshaping_0.4.1
## [25] munsell_0.5.1 class_7.3-22 htmltools_0.5.8.1
## [28] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
## [31] hash_2.2.6.3 yaml_2.3.10 pillar_1.10.0
## [34] crayon_1.5.3 jquerylib_0.1.4 tidyr_1.3.1
## [37] prabclus_2.3-4 MASS_7.3-61 diptest_0.77-1
## [40] cachem_1.1.0 fpc_2.2-13 mclust_6.1.1
## [43] metadat_1.2-0 nlme_3.1-166 robustbase_0.99-4-1
## [46] tidyselect_1.2.1 digest_0.6.37 mvtnorm_1.3-2
## [49] stringi_1.8.4 kernlab_0.9-33 purrr_1.0.2
## [52] labeling_0.4.3 maketools_1.3.1 splines_4.4.2
## [55] pcaPP_2.0-5 cowplot_1.1.3 fastmap_1.2.0
## [58] grid_4.4.2 colorspace_2.1-1 cli_3.6.3
## [61] metafor_4.6-0 withr_3.0.2 scales_1.3.0
## [64] rmarkdown_2.29 nnet_7.3-19 igraph_2.1.2
## [67] Maaslin2_1.21.0 modeltools_0.2-23 ragg_1.3.3
## [70] png_0.1-8 pbapply_1.7-2 evaluate_1.0.1
## [73] knitr_1.49 mgcv_1.9-1 rlang_1.1.4
## [76] glue_1.8.0 DBI_1.2.3 optparse_1.7.5
## [79] BiocManager_1.30.25 jsonlite_1.8.9 R6_2.5.1
## [82] systemfonts_1.1.0
adjust_batch
lm_meta
discrete_discover
continuous_discover