MaAsLin 3 is the next generation of MaAsLin (Microbiome Multivariable Associations with Linear Models). This comprehensive R package efficiently determines multivariable associations between clinical metadata and microbial meta-omics features. Relative to MaAsLin 2, MaAsLin 3 introduces the ability to quantify and test for both abundance and prevalence associations while accounting for compositionality. By incorporating generalized linear models, MaAsLin 3 accommodates most modern epidemiological study designs including cross-sectional and longitudinal studies.
If you use the MaAsLin 3 software, please cite our manuscript:
William A. Nickols, Thomas Kuntz, Jiaxian Shen, Sagun Maharjan, Himel Mallick, Eric A. Franzosa, Kelsey N. Thompson, Jacob T. Nearing, Curtis Huttenhower. MaAsLin 3: Refining and extending generalized multivariable linear models for meta-omic association discovery. bioRxiv 2024.12.13.628459; doi: https://doi.org/10.1101/2024.12.13.628459
The user manual can be found here. If using
vignettes, users should start with the
maaslin3_tutorial.Rmd
vignette and then refer to the
maaslin3_manual.Rmd
vignette as necessary. If you have
questions, please direct them to the MaAsLin
Forum
R is a programming language specializing in statistical computing and graphics. You can use R just the same as any other programming languages, but it is most useful for statistical analyses, with well-established packages for common tasks such as linear modeling, ’omic data analysis, machine learning, and visualization.
You can download and install the free R software environment here. Note that you should download the latest release - this will ensure the R version you have is compatible with MaAsLin 3.
RStudio is a freely available IDE (integrated development environment) for R. It is a “wrapper” around R with some additional functionalities that make programming in R a bit easier. Feel free to download RStudio and explore its interface - but it is not required for this tutorial.
If you already have R installed, then it is possible that the R
version you have does not support MaAsLin 3. The easiest way to check
this is to launch R/RStudio, and in the console (“Console” pane in
RStudio), type in the following command (without the >
symbol):
The first line of output message should indicate your current R version. For example:
> sessionInfo()
R version 4.4.1 (2024-06-14)
For MaAsLin 3 to install, you will need R >= 4.4. If your version
is older than that, please refer to section Installing R for the first
time to download the latest R. Note that RStudio users will need to
have RStudio “switch” to this later version once it is installed. This
should happen automatically for Windows and Mac OS users when you
relaunch RStudio. For Linux users you might need to bind the correct R
executable. Either way, once you have the correct version installed,
launch the software and use sessionInfo()
to make sure that
you indeed have R >= 4.4.
The latest version of MaAsLin 3 can be installed from BiocManager.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biobakery/maaslin3")
To compile vignettes yourself, specify
dependencies = TRUE
.
To run MaAsLin 3, the user must supply a table of per-sample feature abundances (with zeros still included), a table of per-sample metadata, and a model specifying how the metadata should relate to the feature prevalence (how likely the feature is to be present or absent) and abundance (how much of the feature is there if it’s there). MaAsLin 3 will return a table of associations including an effect size and p-value for each feature-metadatum association and a folder of visuals including a summary plot and diagnostic plots for significant associations.
MaAsLin3 requires two input files.
The data file can contain samples not included in the metadata file (along with the reverse case). For both cases, those samples not included in both files will be removed from the analysis. Also, the samples do not need to be in the same order in the two files.
Samples with NA values for the metadata will be excluded when fitting the models. It is recommended to drop these samples in advance under a missing-at-random assumption or use multiple imputation to run the complete dataset.
Example input files can be found in the inst/extdata
folder of the MaAsLin 3 source. The files provided were generated from
the Human Microbiome Project 2 (HMP2) data which can be downloaded from
https://ibdmdb.org/.
HMP2_taxonomy.tsv
: a tab-delimited file with samples as
rows and species as columns. It is a subset of the full HMP2 taxonomy
that includes just some of the the species abundances.HMP2_metadata.tsv
: a tab-delimited file with samples as
rows and metadata as columns. It is a subset of the full HMP2 metadata
that includes just some of the fields.# Read abundance table
taxa_table_name <- system.file("extdata", "HMP2_taxonomy.tsv",
package = "maaslin3")
taxa_table <- read.csv(taxa_table_name, sep = '\t')
# Read metadata table
metadata_name <- system.file("extdata", "HMP2_metadata.tsv",
package = "maaslin3")
metadata <- read.csv(metadata_name, sep = '\t')
# Factor the categorical variables to test IBD against healthy controls
metadata$diagnosis <-
factor(metadata$diagnosis, levels = c('nonIBD', 'UC', 'CD'))
metadata$dysbiosis_state <-
factor(metadata$dysbiosis_state, levels =
c('none', 'dysbiosis_UC', 'dysbiosis_CD'))
metadata$antibiotics <-
factor(metadata$antibiotics, levels = c('No', 'Yes'))
taxa_table[1:5, 1:5]
## Phocaeicola_vulgatus Faecalibacterium_prausnitzii
## CSM5FZ3N_P 0.4265226 0.060255109
## CSM5FZ3R_P 0.5369584 0.007396904
## CSM5FZ3T_P 0.5911821 0.000000000
## CSM5FZ3V_P 0.2661378 0.029680329
## CSM5FZ3X_P 0.6601039 0.003596740
## Bacteroides_uniformis Prevotella_copri_clade_A Bacteroides_stercoris
## CSM5FZ3N_P 0.2692411314 0.000000e+00 0.000000000
## CSM5FZ3R_P 0.2526048469 0.000000e+00 0.008390958
## CSM5FZ3T_P 0.0000000000 0.000000e+00 0.000000000
## CSM5FZ3V_P 0.4004265260 0.000000e+00 0.000000000
## CSM5FZ3X_P 0.0008804283 1.308102e-05 0.001335669
## participant_id site_name week_num reads diagnosis
## CSM5FZ3N_P C3001 Cedars-Sinai 0 9961743 CD
## CSM5FZ3R_P C3001 Cedars-Sinai 2 16456391 CD
## CSM5FZ3T_P C3002 Cedars-Sinai 0 10511448 CD
## CSM5FZ3V_P C3001 Cedars-Sinai 6 17808965 CD
## CSM5FZ3X_P C3002 Cedars-Sinai 2 13160893 CD
MaAsLin 3 is run by specifying the abundance table
(input_data
), the metadata table
(input_metadata
), the output directory
(output
), and a model. The model can come from a formula or
vectors of terms. In either case, variable names should not have spaces
or unusual characters.
formula
parameter should be set
to any formula that satisfies the lme4
specifications:
fixed effects, random effects (section
4.2), interaction terms (section 4.3),
polynomial terms, and more can all be included. If categorical variables
are included as fixed effects, each level will be tested against the
first factor level. In addition, ordered predictors (section 4.4) and group predictors (section 4.5) can be included by
including group(variable_name)
and
ordered(variable_name)
in the formula. When the data
involve paired or matched case-control samples, conditional logistic
regression can be run by specifying the grouping variable with
strata(variable_name)
.fixed_effects
,
random_effects
, group_effects
,
ordered_effects
, and strata_effects
.
Categorical variables should either be ordered as factors beforehand, or
reference
should be provided as a string of
‘variable,reference’ semi-colon delimited for multiple variables (e.g.,
variable_1,reference_1;variable_2,reference_2
). NOTE:
adding a space between the variable and level might result in the wrong
reference level being used.Because MaAsLin 3 identifies prevalence (presence/absence) associations, sample read depth (number of reads) should be included as a covariate if available. Deeper sequencing will likely increase feature detection in a way that could spuriously correlate with metadata of interest when read depth is not included in the model.
The following command runs MaAsLin 3 on the HMP2 data, running a
multivariable regression model to test for the association between
microbial species abundance and prevalence versus IBD diagnosis and dysbiosis
scores while controlling for antibiotic usage, age, and the sampling
depth (reads). The abundance associations come from fitting multivariate
linear models to the log base 2 relative abundances of features in
samples in which the feature is present. The prevalence associations
come from fitting multivariate logistic models to the presence/absence
data for a feature across all samples. Outputs are directed to a folder
called hmp2_output
under the current working directory
(output = "hmp2_output"
).
To show one of each type of plot (see below), the maximum number of
significant associations to plot has been increased with
max_pngs=100
(default 30
). All other
parameters beyond the first four are the default parameters but are
still included for clarity:
normalization = 'TSS'
) with a log
transformation ('transform = 'LOG'
, base 2) afterwards will
be used. These are almost always the recommended choices (and all
MaAsLin 3 evaluations were performed with these options), but other
normalizations and transformations are allowed (see
normalization
and transform
in
?maaslin3
).augment = TRUE
). This is
almost always recommended, though it can be turned off (see
augment
in the manual).standardize = TRUE
so that the coefficients will be on the
same scale (improving visualization).max_significance = 0.1
)
determines which associations are significant, but this can also be
changed (see max_significance
in the manual).median_comparison_abundance
). Since prevalence
coefficients do not have the same compositional properties, they are
instead compared against 0 (median_comparison_prevalence
).
See below for a discussion of these parameters.cores = 1
).The abundance, prevalence, and variance filtering parameters are not included, and they are 0 by default. Different from other differential abundance tools, low prevalence features need not be filtered out since the prevalence modeling in MaAsLin 3 already accounts for high proportions of zeros. However, filtering low prevalence features might improve power.
set.seed(1)
fit_out <- maaslin3(input_data = taxa_table,
input_metadata = metadata,
output = 'hmp2_output',
formula = '~ diagnosis + dysbiosis_state +
antibiotics + age + reads',
normalization = 'TSS',
transform = 'LOG',
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 10,
cores = 1)
By default, many lines of information are printed while the models
fit, but the verbosity can be changed with the verbosity
argument (e.g., verbosity = 'WARN'
).
MaAsLin 3 can also take as input a SummarizedExperiment
or TreeSummarizedExperiment
object from BioConductor:
se <- SummarizedExperiment(
assays = list(taxa_table = t(taxa_table)),
colData = metadata
)
fit_out <- maaslin3(input_data = se,
output = 'hmp2_output',
formula = '~ diagnosis + dysbiosis_state +
antibiotics + age + reads',
normalization = 'TSS',
transform = 'LOG',
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 10,
cores = 1)
When median_comparison_abundance
or
median_comparison_prevalence
are on, the coefficients for a
metadatum will be tested against the median coefficient for that
metadatum (median across the features). Otherwise, the coefficients will
be tested against 0. For abundance associations, this is designed to
account for compositionality, the idea that testing against zero
on the relative scale is not equivalent to testing against zero on the
absolute scale. The user manual contains a more extensive
discussion of when to use median_comparison_abundance
and
median_comparison_prevalence
, but in summary:
median_comparison_abundance
is TRUE
by
default and should be used to make inference on the absolute scale when
using relative abundance data. When
median_comparison_abundance
is TRUE
, only the
p-values and q-values change; the coefficients returned are still the
relative abundance coefficients unless subtract_median
is
TRUE
in which case the median will be subtracted from the
coefficients.median_comparison_abundance
should be
FALSE
when (1) testing for significant relative
associations, (2) testing for absolute abundance associations under the
assumption that the total absolute abundance is not changing, or (3)
testing for significant absolute associations when supplying spike-in or
total abundances with unscaled_abundance
.median_comparison_prevalence
is FALSE
by
default.The main output from MaAsLin 3 is the list of significant
associations in significant_results.tsv
. This lists all
associations with joint or individual q-values that pass MaAsLin 3’s
significance threshold, ordered by increasing individual q-value. The
format is:
feature
and metadata
are the feature and
metadata names.value
and name
are the value of the
metadata and variable name from the model.coef
and stderr
are the fit coefficient
and standard error from the model. In abundance models, a one-unit
change in the metadatum variable corresponds to a 2coef fold change in the relative
abundance of the feature. In prevalence models, a one-unit change in the
metadatum variable corresponds to a coef change in the log-odds of a feature
being present.pval_individual
and qval_individual
are
the p-value and false discovery rate (FDR) corrected q-value of the
individual association. FDR correction is performed over all p-values
without errors in the abundance and prevalence modeling together.pval_joint
and qval_joint
are the p-value
and q-value of the joint prevalence and abundance association. The
p-value comes from plugging in the minimum of the association’s
abundance and prevalence p-values into the Beta(1,2) CDF. It is
interpreted as the probability that either the abundance or prevalence
association would be as extreme as observed if there was neither an
abundance nor prevalence association between the feature and
metadatum.model
specifies whether the association is abundance or
prevalence.N
and N_not_zero
are the total number of
data points and the total number of non-zero data points for the
feature.feature | metadata | value | name | coef | stderr | pval_individual | qval_individual | pval_joint | qval_joint | model | N | N_not_zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Phocaeicola_sartorii | reads | reads | reads | 1.1 | 0.0787 | 5.42e-44 | 1.06e-40 | 1.08e-43 | 1.14e-40 | prevalence | 1530 | 424 |
Faecalibacterium_prausnitzii | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -3.49 | 0.282 | 3.54e-35 | 3.46e-32 | 7.09e-35 | 3.74e-32 | prevalence | 1530 | 1370 |
Bifidobacterium_longum | age | age | age | -0.672 | 0.0594 | 1.01e-29 | 4.94e-27 | 2.02e-29 | 7.12e-27 | prevalence | 1530 | 843 |
Clostridium_sp_AM49_4BH | diagnosis | CD | diagnosisCD | -1.81 | 0.163 | 1.11e-28 | 4.34e-26 | 2.22e-28 | 5.87e-26 | prevalence | 1530 | 344 |
Phocaeicola_dorei | reads | reads | reads | 0.66 | 0.0634 | 2.56e-25 | 8.35e-23 | 5.13e-25 | 1.08e-22 | prevalence | 1530 | 769 |
GGB16040_SGB9347 | age | age | age | 0.96 | 0.0944 | 2.65e-24 | 7.39e-22 | 5.29e-24 | 9.31e-22 | prevalence | 1530 | 114 |
Firmicutes_bacterium_AF16_15 | diagnosis | UC | diagnosisUC | -1.67 | 0.166 | 6.77e-24 | 1.65e-21 | 1.35e-23 | 2.04e-21 | prevalence | 1530 | 827 |
Eubacterium_rectale | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.36 | 0.237 | 2.47e-23 | 5.36e-21 | 4.94e-23 | 6.51e-21 | prevalence | 1530 | 1210 |
Phascolarctobacterium_faecium | age | age | age | 0.574 | 0.0578 | 3.19e-23 | 6.24e-21 | 6.39e-23 | 7.5e-21 | prevalence | 1530 | 503 |
Dysosmobacter_welbionis | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.4 | 0.242 | 3.69e-23 | 6.56e-21 | 7.38e-23 | 7.8e-21 | prevalence | 1530 | 1190 |
Blautia_faecis | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.42 | 0.249 | 2.66e-22 | 4.33e-20 | 5.31e-22 | 5.1e-20 | prevalence | 1530 | 1120 |
Fusicatenibacter_saccharivorans | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.35 | 0.243 | 4.52e-22 | 6.79e-20 | 9.03e-22 | 7.95e-20 | prevalence | 1530 | 1120 |
Lachnospira_sp_NSJ_43 | diagnosis | CD | diagnosisCD | -1.65 | 0.177 | 8.94e-21 | 1.16e-18 | 1.79e-20 | 1.45e-18 | prevalence | 1530 | 296 |
Bacteroides_ovatus | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.13 | 0.228 | 1.09e-20 | 1.33e-18 | 2.18e-20 | 1.65e-18 | prevalence | 1530 | 1180 |
Clostridium_sp_AM49_4BH | diagnosis | UC | diagnosisUC | -1.57 | 0.17 | 2.51e-20 | 2.89e-18 | 5.02e-20 | 3.54e-18 | prevalence | 1530 | 344 |
Bacteroides_faecis | reads | reads | reads | 0.6 | 0.0657 | 7.46e-20 | 8.1e-18 | 1.49e-19 | 9.84e-18 | prevalence | 1530 | 434 |
Dysosmobacter_welbionis | reads | reads | reads | 0.719 | 0.0798 | 2.1e-19 | 2.05e-17 | 4.2e-19 | 2.61e-17 | prevalence | 1530 | 1190 |
Roseburia_sp_AF02_12 | diagnosis | CD | diagnosisCD | -1.38 | 0.154 | 2.44e-19 | 2.27e-17 | 4.87e-19 | 2.86e-17 | prevalence | 1530 | 363 |
Clostridiales_bacterium | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.46 | 0.275 | 3.56e-19 | 3.16e-17 | 7.12e-19 | 3.96e-17 | prevalence | 1530 | 985 |
Alistipes_putredinis | diagnosis | UC | diagnosisUC | -1.42 | 0.159 | 5.06e-19 | 4.3e-17 | 1.01e-18 | 5.35e-17 | prevalence | 1530 | 873 |
MaAsLin 3 generates two types of output files explained below: data
and visualizations. In addition, the object returned from
maaslin3
(fit_out
above) contains all the data
and results (see ?maaslin_fit
).
significant_results.tsv
all_results.tsv
features
models_linear.rds
and models_logistic.rds
linear
for linear models, logistic
for
logistic models).save_models
is set to
TRUE.residuals_linear.rds
and
residuals_logstic.rds
fitted_linear.rds
and fitted_logistic.rds
ranef_linear.rds
and ranef_logistic.rds
maaslin3.log
summary_plot.pdf
max_significance
,
and two stars indicate the individual q-value is below
max_significance
divided by 10. (Make sure to set
include=T
to view the plots when knitting the Rmd.)association_plots/[metadatum]/[association]/ [metadatum]_[feature]_[association].png
max_pngs
.max_pngs = 250
above and set eval=TRUE
and
include=TRUE
in this chunkAt the top right of each association plot is the name of the significant association in the results file, the FDR corrected q-value for the individual association, the number of samples in the dataset, and the number of samples with non-zero abundances for the feature. In the plots with categorical metadata variables, the reference category is on the left, and the significant q-values and coefficients in the top right are in the order of the values specified above. Because the displayed coefficients correspond to the full fit model with (possibly) scaled metadata variables, the marginal association plotted might not match the coefficient displayed. However, the plots are intended to provide an interpretable visual while usually agreeing with the full model.
There are a few common issues to check for in the results:
error
column of
all_results.tsv
. Often, these indicate model fitting
failures or poor fits that should not be trusted, but sometimes the
warnings can be benign, and the model fit might still be reasonable.
Users should check associations of interest if they produce errors.
warn_prevalence=TRUE
, a note will be produced in
the error column indicating when prevalence associations are likely
induced by abundance associations. These should be checked visually with
the diagnosic visuals.association_plots
directory.
Once maaslin3
has been run once,
maaslin_plot_results_from_output
can be run to (re-)create
the plots. This allows the user to plot the associations even without
having the R object returned by maaslin_fit
or
maaslin3
(e.g., if fitting models through the command
line). It is recommended to fit models with simple variables names that
are robust to formula plotting and then convert these into proper names
for plotting. Likewise, heatmap_vars
and
coef_plot_vars
can be specified at any point, but it is
usually easier to see how the names come out first and then choose which
metadata variables will go in the coefficient plot and which will go in
the heatmap afterwards with
maaslin_plot_results_from_output
.
# This section is necessary for updating the
# summary plot and the association plots
# Rename results file with clean titles
all_results <- read.csv('hmp2_output/all_results.tsv', sep='\t')
all_results <- all_results %>%
mutate(metadata = case_when(metadata == 'age' ~ 'Age',
metadata == 'antibiotics' ~ 'Abx',
metadata == 'diagnosis' ~ 'Diagnosis',
metadata == 'dysbiosis_state' ~ 'Dysbiosis',
metadata == 'reads' ~ 'Read depth'),
value = case_when(value == 'dysbiosis_CD' ~ 'CD',
value == 'dysbiosis_UC' ~ 'UC',
value == 'Yes' ~ 'Used', # Antibiotics
value == 'age' ~ 'Age',
value == 'reads' ~ 'Read depth',
TRUE ~ value),
feature = gsub('_', ' ', feature) %>%
gsub(pattern = 'sp ', replacement = 'sp. '))
# Write results
write.table(all_results, 'hmp2_output/all_results.tsv', sep='\t')
# Set the new heatmap and coefficient plot variables and order them
heatmap_vars = c('Dysbiosis UC', 'Diagnosis UC',
'Abx Used', 'Age', 'Read depth')
coef_plot_vars = c('Dysbiosis CD', 'Diagnosis CD')
# This section is necessary for updating the association plots
taxa_table_copy <- taxa_table
colnames(taxa_table_copy) <- gsub('_', ' ', colnames(taxa_table_copy)) %>%
gsub(pattern = 'sp ', replacement = 'sp. ')
# Rename the features in the norm transformed data file
data_transformed <-
read.csv('hmp2_output/features/data_transformed.tsv', sep='\t')
colnames(data_transformed) <-
gsub('_', ' ', colnames(data_transformed)) %>%
gsub(pattern = 'sp ', replacement = 'sp. ')
write.table(data_transformed,
'hmp2_output/features/data_transformed.tsv',
sep='\t', row.names = FALSE)
# Rename the metadata like in the outputs table
metadata_copy <- metadata
colnames(metadata_copy) <-
case_when(colnames(metadata_copy) == 'age' ~ 'Age',
colnames(metadata_copy) == 'antibiotics' ~ 'Abx',
colnames(metadata_copy) == 'diagnosis' ~ 'Diagnosis',
colnames(metadata_copy) == 'dysbiosis_state' ~ 'Dysbiosis',
colnames(metadata_copy) == 'reads' ~ 'Read depth',
TRUE ~ colnames(metadata_copy))
metadata_copy <- metadata_copy %>%
mutate(Dysbiosis = case_when(Dysbiosis == 'dysbiosis_UC' ~ 'UC',
Dysbiosis == 'dysbiosis_CD' ~ 'CD',
Dysbiosis == 'none' ~ 'None') %>%
factor(levels = c('None', 'UC', 'CD')),
Abx = case_when(Abx == 'Yes' ~ 'Used',
Abx == 'No' ~ 'Not used') %>%
factor(levels = c('Not used', 'Used')),
Diagnosis = case_when(Diagnosis == 'nonIBD' ~ 'non-IBD',
TRUE ~ Diagnosis) %>%
factor(levels = c('non-IBD', 'UC', 'CD')))
# Recreate the plots
scatter_plots <- maaslin_plot_results_from_output(
output = 'hmp2_output',
metadata = metadata_copy,
normalization = "TSS",
transform = "LOG",
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_significance = 0.1,
max_pngs = 20)
In the new summary plot below, we can see that the feature names are cleaned up, the metadata names are cleaned up, the set of metadata variables used in the coefficient plot is different, and the metadata used in the heatmap is reordered.
In the association plots, the taxa and metadata have been renamed to be consistent with the results file from earlier.
Most microbiome methodologies have historically focused on relative abundances (proportions out of 1). However, some experimental protocols can enable estimation of absolute abundances (cell count/concentration). MaAsLin 3 can be used with two types of absolute abundance estimation: spike-ins and total abundance scaling. In a spike-in procedure, a small, known quantity of a microbe that otherwise would not be present in the sample is added, and the sequencing procedure is carried out as usual. Then, the absolute abundance of a microbe already in the community is estimated as: $$\textrm{Absolute abundance other microbe}=\frac{\textrm{Relative abundance other microbe}}{\textrm{Relative abundance spike-in microbe}}\cdot (\textrm{Absolute abundance spike-in microbe})$$ Alternatively, the total microbial abundance of a sample can be determined (e.g., with qPCR of a marker gene or by cell counting). Then, the absolute abundance of a microbe in the community is estimated as: Absolute abundance microbe = (Total absolute abundance) ⋅ (Relative abundance microbe)
The following section shows a synthetic spike-in procedure with simulated data from SparseDOSSA2. The abundance table is like any other abundance table input to MaAsLin 3 except that ‘Feature101’ is the spike-in (which must be present in every sample). The scaling factor data frame has ‘Feature101’ as its only column name with the absolute abundance of the spike-in for each sample (rownames). If the same quantity of the spike-in was added to all samples, this column could be entirely 1 (or any other arbitrary value). When using a spike-in procedure, the scaling factor data frame should have a single column named the same as one of the features in the abundance table, which will be used as the spike-in.
# Abundance table
taxa_table_name <- system.file("extdata", "abundance_spike_in_ex.tsv",
package = "maaslin3")
spike_in_taxa_table <- read.csv(taxa_table_name, sep = '\t')
# Metadata table
metadata_name <- system.file("extdata", "metadata_spike_in_ex.tsv",
package = "maaslin3")
spike_in_metadata <- read.csv(metadata_name, sep = '\t')
for (col in c('Metadata_1', 'Metadata_2', 'Metadata_5')) {
spike_in_metadata[,col] <- factor(spike_in_metadata[,col])
}
# Spike-in table
unscaled_name <- system.file("extdata",
"scaling_factors_spike_in_ex.tsv",
package = "maaslin3")
spike_in_unscaled <- read.csv(unscaled_name, sep = '\t')
spike_in_taxa_table[c(1:5, 101),1:5]
## Sample1 Sample2 Sample3 Sample4 Sample5
## Feature1 0 0 0 0 0
## Feature2 0 0 0 0 0
## Feature3 554 0 1737 0 0
## Feature4 0 0 0 0 0
## Feature5 0 0 0 0 0
## Feature101 15464 4788 5155 2052 11540
## Metadata_1 Metadata_2 Metadata_3 Metadata_4 Metadata_5 ID
## Sample1 1 1 -0.41011453 0.6352468 0 Subject1
## Sample2 0 1 -1.35105918 0.2362369 0 Subject2
## Sample3 1 1 -0.48798215 -0.5458514 0 Subject3
## Sample4 0 0 -0.07268457 -0.9903428 1 Subject4
## Sample5 1 0 0.27201198 1.1333223 1 Subject5
## Feature101
## Sample1 418
## Sample2 183
## Sample3 38
## Sample4 3230
## Sample5 243
The following code fits the absolute abundance model. Note that the
normalization must be set to TSS
since the procedure
involves scaling the relative abundance ratios. Also,
median_comparison_abundance
is now set to
FALSE
since we want to test the absolute coefficients
against zero. The data frame of absolute abundances is included as the
unscaled_abundance
parameter, and the spike-in strategy
will automatically be detected based on the column name.
fit_out <- maaslin3(
input_data = spike_in_taxa_table,
input_metadata = spike_in_metadata,
output = 'spike_in_demo',
formula = '~ Metadata_1 + Metadata_2 + Metadata_3 +
Metadata_4 + Metadata_5',
normalization = 'TSS',
transform = 'LOG',
median_comparison_abundance = FALSE,
unscaled_abundance = spike_in_unscaled)
The absolute abundance coefficients can be inspected:
rownames(fit_out$fit_data_abundance$results) <- NULL
head(fit_out$fit_data_abundance$results, 20) %>%
dplyr::mutate_if(is.numeric, .funs =
function(x){(as.character(signif(x, 3)))}) %>%
knitr::kable() %>%
kableExtra::kable_styling("striped", position = 'center') %>%
kableExtra::scroll_box(width = "800px", height = "400px")
feature | metadata | value | name | coef | stderr | pval_individual | qval_individual | pval_joint | qval_joint | error | model | N | N_not_zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Feature58 | Metadata_2 | 1 | Metadata_21 | -4.9 | 0.519 | 7.63e-13 | 5.48e-10 | 7.63e-13 | 3.2e-10 | NA | linear | 100 | 58 |
Feature88 | Metadata_5 | 1 | Metadata_51 | 7.86 | 0.819 | 2.99e-12 | 1.08e-09 | 5.99e-12 | 1.26e-09 | NA | linear | 100 | 49 |
Feature10 | Metadata_1 | 1 | Metadata_11 | -4.19 | 0.669 | 3.96e-07 | 5.29e-05 | 3.96e-07 | 4.16e-05 | NA | linear | 100 | 40 |
Feature97 | Metadata_4 | Metadata_4 | Metadata_4 | -2.93 | 0.475 | 6.2e-06 | 0.000446 | 1.24e-05 | 0.000521 | NA | linear | 100 | 25 |
Feature56 | Metadata_4 | Metadata_4 | Metadata_4 | 5.93 | 1 | 1.74e-05 | 0.00104 | 1.74e-05 | 0.000665 | NA | linear | 100 | 23 |
Feature83 | Metadata_3 | Metadata_3 | Metadata_3 | 5.45 | 1.07 | 7.55e-05 | 0.00388 | 0.000151 | 0.00423 | NA | linear | 100 | 24 |
Feature49 | Metadata_1 | 1 | Metadata_11 | -5.45 | 1.02 | 8.22e-05 | 0.00394 | 8.22e-05 | 0.00246 | NA | linear | 100 | 21 |
Feature49 | Metadata_5 | 1 | Metadata_51 | -5.6 | 1.23 | 0.000374 | 0.0117 | 0.000748 | 0.0131 | NA | linear | 100 | 21 |
Feature64 | Metadata_3 | Metadata_3 | Metadata_3 | -3.23 | 0.828 | 0.000612 | 0.0169 | 0.00122 | 0.0198 | NA | linear | 100 | 32 |
Feature84 | Metadata_1 | 1 | Metadata_11 | -4.54 | 0.919 | 0.000801 | 0.0206 | 0.0016 | 0.0232 | NA | linear | 100 | 15 |
Feature39 | Metadata_4 | Metadata_4 | Metadata_4 | -3.66 | 0.943 | 0.000867 | 0.0215 | 8.83e-07 | 6.18e-05 | NA | linear | 100 | 27 |
Feature67 | Metadata_1 | 1 | Metadata_11 | -2.32 | 0.661 | 0.00115 | 0.0275 | 0.0023 | 0.0321 | NA | linear | 100 | 46 |
Feature79 | Metadata_4 | Metadata_4 | Metadata_4 | -1.01 | 0.314 | 0.0045 | 0.0924 | 0.00898 | 0.122 | NA | linear | 100 | 25 |
Feature79 | Metadata_3 | Metadata_3 | Metadata_3 | -1.07 | 0.336 | 0.00471 | 0.0941 | 0.0094 | 0.123 | NA | linear | 100 | 25 |
Feature37 | Metadata_4 | Metadata_4 | Metadata_4 | 1 | 0.344 | 0.00525 | 0.102 | 0.0105 | 0.133 | NA | linear | 100 | 62 |
Feature25 | Metadata_1 | 1 | Metadata_11 | -5.25 | 0.0484 | 0.00587 | 0.109 | 0.0117 | 0.142 | NA | linear | 100 | 7 |
Feature44 | Metadata_1 | 1 | Metadata_11 | 3.73 | 0.957 | 0.00594 | 0.109 | 0.0118 | 0.142 | NA | linear | 100 | 13 |
Feature15 | Metadata_3 | Metadata_3 | Metadata_3 | 1.93 | 0.601 | 0.00676 | 0.119 | 0.0135 | 0.149 | NA | linear | 100 | 19 |
Feature42 | Metadata_2 | 1 | Metadata_21 | 4.08 | 1.02 | 0.00711 | 0.119 | 0.0142 | 0.149 | NA | linear | 100 | 12 |
Feature56 | Metadata_3 | Metadata_3 | Metadata_3 | 1.86 | 0.608 | 0.00712 | 0.119 | 0.0142 | 0.149 | NA | linear | 100 | 23 |
The following section shows a synthetic total abundance scaling procedure with simulated data from SparseDOSSA2. The abundance table is like any other abundance table input to MaAsLin 3 without any extra features. The scaling factor data frame has ‘total’ as its only column name with the total absolute abundance for each sample (rownames).
# Abundance table
taxa_table_name <- system.file("extdata", "abundance_total_ex.tsv",
package = "maaslin3")
total_scaling_taxa_table <- read.csv(taxa_table_name, sep = '\t')
# Metadata table
metadata_name <- system.file("extdata", "metadata_total_ex.tsv",
package = "maaslin3")
total_scaling_metadata <- read.csv(metadata_name, sep = '\t')
for (col in c('Metadata_1', 'Metadata_3', 'Metadata_5')) {
spike_in_metadata[,col] <- factor(spike_in_metadata[,col])
}
# Total abundance table
unscaled_name <- system.file("extdata", "scaling_factors_total_ex.tsv",
package = "maaslin3")
total_scaling_unscaled <- read.csv(unscaled_name, sep = '\t')
total_scaling_taxa_table[1:5, 1:5]
## Sample1 Sample2 Sample3 Sample4 Sample5
## Feature1 11837 0 0 8 68
## Feature2 0 0 0 0 1
## Feature3 0 0 0 0 0
## Feature4 0 0 1 0 0
## Feature5 0 0 0 0 0
## Metadata_1 Metadata_2 Metadata_3 Metadata_4 Metadata_5 ID
## Sample1 0 1.2410844 0 0.4341439 1 Subject1
## Sample2 1 0.0176896 1 1.4894610 1 Subject2
## Sample3 1 -1.7370341 1 -0.7973592 1 Subject3
## Sample4 1 0.1500793 0 3.2816068 0 Subject4
## Sample5 0 -0.2819085 0 -0.6731906 0 Subject5
## total
## Sample1 1352.276
## Sample2 1720.637
## Sample3 64194.918
## Sample4 50256.129
## Sample5 1132.367
The following code fits the absolute abundance model. As before,
TSS
must be set to TRUE
,
median_comparison_abundance
is set to FALSE
,
and the data frame of absolute abundances is included as the
unscaled_abundance
parameter. The total abundance scaling
procedure will be detected based on the column name.
fit_out <- maaslin3(
input_data = total_scaling_taxa_table,
input_metadata = total_scaling_metadata,
output = 'total_scaling_demo',
formula = '~ Metadata_1 + Metadata_2 + Metadata_3 +
Metadata_4 + Metadata_5',
normalization = 'TSS',
transform = 'LOG',
median_comparison_abundance = FALSE,
unscaled_abundance = total_scaling_unscaled)
The absolute abundance coefficients can be inspected:
rownames(fit_out$fit_data_abundance$results) <- NULL
head(fit_out$fit_data_abundance$results, n = 20) %>%
dplyr::mutate_if(is.numeric,
.funs = function(x){(as.character(signif(x, 3)))}) %>%
knitr::kable() %>%
kableExtra::kable_styling("striped", position = 'center') %>%
kableExtra::scroll_box(width = "800px", height = "400px")
feature | metadata | value | name | coef | stderr | pval_individual | qval_individual | pval_joint | qval_joint | error | model | N | N_not_zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Feature94 | Metadata_1 | Metadata_1 | Metadata_1 | 2.83 | 0.374 | 3.32e-11 | 2.3e-08 | 6.64e-11 | 2.76e-08 | NA | linear | 100 | 94 |
Feature80 | Metadata_3 | Metadata_3 | Metadata_3 | 2.86 | 0.316 | 3.32e-10 | 1.15e-07 | 3.32e-10 | 6.89e-08 | NA | linear | 100 | 37 |
Feature80 | Metadata_5 | Metadata_5 | Metadata_5 | 2.22 | 0.296 | 1.92e-08 | 4.44e-06 | 3.85e-08 | 5.32e-06 | NA | linear | 100 | 37 |
Feature66 | Metadata_2 | Metadata_2 | Metadata_2 | 3.83 | 0.56 | 1.92e-07 | 1.9e-05 | 1.92e-07 | 1.14e-05 | NA | linear | 100 | 34 |
Feature80 | Metadata_1 | Metadata_1 | Metadata_1 | -2.09 | 0.311 | 1.69e-07 | 1.9e-05 | 1.69e-07 | 1.14e-05 | NA | linear | 100 | 37 |
Feature4 | Metadata_1 | Metadata_1 | Metadata_1 | 2.87 | 0.434 | 1.53e-06 | 0.000106 | 3.06e-06 | 0.000127 | NA | linear | 100 | 27 |
Feature45 | Metadata_3 | Metadata_3 | Metadata_3 | -1.82 | 0.408 | 0.000144 | 0.00712 | 0.000288 | 0.00796 | NA | linear | 100 | 32 |
Feature101 | Metadata_1 | Metadata_1 | Metadata_1 | 1.12 | 0.287 | 0.000178 | 0.0077 | 0.000178 | 0.00568 | NA | linear | 100 | 100 |
Feature8 | Metadata_2 | Metadata_2 | Metadata_2 | 3.33 | 0.656 | 0.000168 | 0.0077 | 0.000337 | 0.00874 | NA | linear | 100 | 20 |
Feature5 | Metadata_4 | Metadata_4 | Metadata_4 | 4.92 | 1.23 | 0.00205 | 0.0678 | 0.00205 | 0.0406 | NA | linear | 100 | 17 |
Feature45 | Metadata_2 | Metadata_2 | Metadata_2 | -2.08 | 0.645 | 0.00341 | 0.102 | 0.00109 | 0.025 | NA | linear | 100 | 32 |
Feature68 | Metadata_3 | Metadata_3 | Metadata_3 | -2.27 | 0.717 | 0.00569 | 0.146 | 0.0113 | 0.168 | NA | linear | 100 | 23 |
Feature75 | Metadata_5 | Metadata_5 | Metadata_5 | -3.29 | 1.08 | 0.00645 | 0.16 | 0.0128 | 0.184 | NA | linear | 100 | 26 |
Feature86 | Metadata_1 | Metadata_1 | Metadata_1 | 0.974 | 0.337 | 0.00671 | 0.16 | 0.0134 | 0.185 | NA | linear | 100 | 39 |
Feature44 | Metadata_4 | Metadata_4 | Metadata_4 | -0.565 | 0.218 | 0.011 | 0.23 | 0.011 | 0.168 | NA | linear | 100 | 100 |
Feature79 | Metadata_2 | Metadata_2 | Metadata_2 | 1.9 | 0.724 | 0.013 | 0.257 | 0.0258 | 0.335 | NA | linear | 100 | 40 |
Feature33 | Metadata_4 | Metadata_4 | Metadata_4 | -0.986 | 0.391 | 0.0152 | 0.291 | 0.0302 | 0.379 | NA | linear | 100 | 53 |
Feature75 | Metadata_3 | Metadata_3 | Metadata_3 | 2.71 | 1.09 | 0.0215 | 0.381 | 0.00702 | 0.127 | NA | linear | 100 | 26 |
Feature91 | Metadata_2 | Metadata_2 | Metadata_2 | -1.21 | 0.511 | 0.0263 | 0.444 | 0.0518 | 0.597 | NA | linear | 100 | 29 |
Feature20 | Metadata_2 | Metadata_2 | Metadata_2 | -5.14 | 1.67 | 0.0272 | 0.445 | 0.0537 | 0.602 | NA | linear | 100 | 11 |
When using a purely computational approach to estimate absolute
abundance (e.g., Nishijima et al. Cell 2024 with the function
MLP
), the abundance table can be estimated and then run
with the normalization = 'NONE'
to avoid scaling the
results.
Certain studies have a natural “grouping” of sample observations,
such as by subject in longitudinal designs or by family in family
designs. It is important for statistic analysis to address the
non-independence between samples belonging to the same group. MaAsLin 3
provides a simple interface for this with random effects, where the user
can specify the grouping variable to run a mixed
effect model. This grouping variable can be provided with the
random_effects
parameter or specified in the model with
(1|grouping_variable)
. This will add in a “random
intercept,” a per-group adjustment to the model intercept. More
complicated random effects can be specified in the formula in accordance
with lme4
formula parsing. For example, HMP2 has a
longitudinal design where the same subject (column
participant_id
) can have multiple samples. We thus use
participant_id
as a random effect grouping variable in the
model.
# Subset to only CD cases for time; taxa are subset automatically
fit_out <- maaslin3(
input_data = taxa_table,
input_metadata = metadata[metadata$diagnosis == 'CD',],
output = 'random_effects_output',
formula = '~ dysbiosis_state + antibiotics +
age + reads + (1|participant_id)',
plot_summary_plot = FALSE,
plot_associations = FALSE)
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
## boundary (singular) fit: see help('isSingular')
feature | metadata | value | name | coef | stderr | pval_individual | qval_individual | pval_joint | qval_joint | model | N | N_not_zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Phocaeicola_sartorii | reads | reads | reads | 1.56 | 0.185 | 3.11e-17 | 3.38e-14 | 6.22e-17 | 3.62e-14 | prevalence | 685 | 205 |
Faecalibacterium_prausnitzii | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -4.89 | 0.692 | 1.68e-12 | 9.11e-10 | 3.36e-12 | 9.77e-10 | prevalence | 685 | 573 |
Bacteroides_eggerthii | reads | reads | reads | 1.61 | 0.246 | 5.55e-11 | 2.01e-08 | 1.11e-10 | 2.15e-08 | prevalence | 685 | 119 |
Blautia_faecis | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.56 | 0.4 | 1.5e-10 | 4.06e-08 | 2.99e-10 | 4.35e-08 | prevalence | 685 | 469 |
Bacteroides_uniformis | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -3.48 | 0.556 | 4.2e-10 | 7.6e-08 | 8.4e-10 | 8.15e-08 | prevalence | 685 | 553 |
Dysosmobacter_welbionis | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.87 | 0.458 | 3.74e-10 | 7.6e-08 | 7.48e-10 | 8.15e-08 | prevalence | 685 | 524 |
Fusicatenibacter_saccharivorans | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.49 | 0.401 | 5.13e-10 | 7.95e-08 | 1.03e-09 | 8.53e-08 | prevalence | 685 | 484 |
Parabacteroides_distasonis | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -3.27 | 0.533 | 8.06e-10 | 1.09e-07 | 1.61e-09 | 1.17e-07 | prevalence | 685 | 484 |
Clostridium_fessum | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -3.59 | 0.592 | 1.28e-09 | 1.4e-07 | 2.57e-09 | 1.5e-07 | prevalence | 685 | 391 |
Clostridium_sp_AF34_10BH | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -3.3 | 0.543 | 1.19e-09 | 1.4e-07 | 2.38e-09 | 1.5e-07 | prevalence | 685 | 377 |
Eubacterium_rectale | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.82 | 0.469 | 1.86e-09 | 1.83e-07 | 3.72e-09 | 1.97e-07 | prevalence | 685 | 514 |
Phocaeicola_dorei | reads | reads | reads | 0.828 | 0.139 | 2.94e-09 | 2.66e-07 | 5.88e-09 | 2.85e-07 | prevalence | 685 | 310 |
Clostridiales_bacterium | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -3.88 | 0.668 | 6.33e-09 | 5.29e-07 | 1.27e-08 | 5.67e-07 | prevalence | 685 | 407 |
Clostridium_sp_AT4 | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | 3.07 | 0.593 | 1.89e-08 | 1.47e-06 | 3.79e-08 | 1.47e-06 | abundance | 685 | 229 |
Escherichia_coli | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | 1.96 | 0.407 | 3.73e-08 | 2.53e-06 | 3.73e-08 | 1.47e-06 | abundance | 685 | 327 |
Anaerostipes_hadrus | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.31 | 0.421 | 3.7e-08 | 2.53e-06 | 7.39e-08 | 2.69e-06 | prevalence | 685 | 468 |
Roseburia_inulinivorans | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -2.25 | 0.413 | 4.85e-08 | 3.1e-06 | 9.7e-08 | 3.32e-06 | prevalence | 685 | 349 |
Bacteroides_faecis | reads | reads | reads | 1.05 | 0.194 | 5.74e-08 | 3.46e-06 | 1.15e-07 | 3.71e-06 | prevalence | 685 | 203 |
Bacteroides_thetaiotaomicron | dysbiosis_state | dysbiosis_CD | dysbiosis_statedysbiosis_CD | -3.28 | 0.608 | 6.72e-08 | 3.84e-06 | 1.34e-07 | 4.12e-06 | prevalence | 685 | 453 |
Ruthenibacterium_lactatiformans | reads | reads | reads | 0.62 | 0.115 | 7.66e-08 | 4.16e-06 | 1.53e-07 | 4.46e-06 | prevalence | 685 | 336 |
Note that no participant_id
terms are included in the
outputs; the random intercepts are just used to control for grouping. If
you are interested in testing the effect of time in a longitudinal
study, the time point variable should be included as a fixed effect in
your MaAsLin 3 formula.
Mixed effects models take longer to fit because of their complexity,
and the fitting may fail more often because of the extra terms. However,
it is important to account for this non-independence, and MaAsLin 3 is
still able to fit complex models with thousands of samples and thousands
of features in a few hours (and increasing the CPUs used with
cores
can help). However, random effects are not the
solution to every correlated data problem. In particular,
logistic random intercept models may produce poor model fits
when there are fewer than 5 observations per group. In these
scenarios, per-group fixed effects or, for advanced users, conditional
logistic regression (see strata_effects
in the manual)
should be used. For example, in a before-and-after treatment study with
one sample before and one sample after, the participant IDs and a
pre/post indicator should be included as fixed effects or a
strata
variable, but only the pre/post indicator should be
retained for analysis afterwards.
One of the benefits of MaAsLin 3’s formula mode is the ability to
include interaction terms. Mathematically, an interaction term
corresponds to the product of two columns in the design matrix. When a
continuous variable is interacted with a categorical term, the
interaction term corresponds to the change in the continuous variable’s
slope between the categories. For two categorical variables interacted,
see below; it is better explained through an example. In the formula,
interactions can be specified with the :
symbol to include
only the interaction term(s) or the *
symbol to include
both the interaction term and the non-interacted terms (i.e.,
a*b
adds the terms a
, b
, and
a:b
).
metadata <- metadata %>%
mutate(dysbiosis_general = ifelse(dysbiosis_state != 'none',
'dysbiosis', 'none')) %>%
mutate(dysbiosis_general = factor(dysbiosis_general, levels =
c('none', 'dysbiosis')))
fit_out <- maaslin3(
input_data = taxa_table,
input_metadata = metadata,
output = 'interaction_output',
formula = '~ diagnosis + diagnosis:dysbiosis_general +
antibiotics + age + reads')
full_results <- rbind(fit_out$fit_data_abundance$results,
fit_out$fit_data_prevalence$results)
full_results <- full_results %>%
dplyr::arrange(qval_joint) %>%
dplyr::filter(metadata == "diagnosis")
rownames(full_results) <- NULL
head(full_results, n = 20) %>%
dplyr::mutate_if(is.numeric,
.funs = function(x){(as.character(signif(x, 3)))}) %>%
knitr::kable() %>%
kableExtra::kable_styling("striped", position = 'center') %>%
kableExtra::scroll_box(width = "800px", height = "400px")
feature | metadata | value | name | coef | stderr | pval_individual | qval_individual | pval_joint | qval_joint | error | model | N | N_not_zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Faecalibacterium_prausnitzii | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -2.18 | 0.309 | 1.18e-06 | 1.03e-05 | 7.2e-35 | 3.8e-32 | NA | linear | 1530 | 1370 |
Faecalibacterium_prausnitzii | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -3.48 | 0.282 | 3.6e-35 | 3.52e-32 | 7.2e-35 | 3.8e-32 | NA | logistic | 1530 | 1370 |
Clostridium_sp_AM49_4BH | diagnosis | CD | diagnosisCD | -0.68 | 0.344 | 0.108 | 0.197 | 2.31e-28 | 6.1e-26 | NA | linear | 1530 | 344 |
Clostridium_sp_AM49_4BH | diagnosis | CD | diagnosisCD | -1.81 | 0.163 | 1.15e-28 | 4.51e-26 | 2.31e-28 | 6.1e-26 | NA | logistic | 1530 | 344 |
Firmicutes_bacterium_AF16_15 | diagnosis | UC | diagnosisUC | -0.47 | 0.246 | 0.0924 | 0.174 | 1.4e-23 | 2.12e-21 | NA | linear | 1530 | 827 |
Firmicutes_bacterium_AF16_15 | diagnosis | UC | diagnosisUC | -1.67 | 0.166 | 7.01e-24 | 1.71e-21 | 1.4e-23 | 2.12e-21 | NA | logistic | 1530 | 827 |
Eubacterium_rectale | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -0.602 | 0.386 | 0.395 | 0.533 | 5.15e-23 | 6.79e-21 | NA | linear | 1530 | 1210 |
Eubacterium_rectale | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -2.36 | 0.237 | 2.57e-23 | 5.59e-21 | 5.15e-23 | 6.79e-21 | NA | logistic | 1530 | 1210 |
Dysosmobacter_welbionis | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | 0.7 | 0.342 | 0.0316 | 0.075 | 7.7e-23 | 8.13e-21 | NA | linear | 1530 | 1190 |
Dysosmobacter_welbionis | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -2.39 | 0.241 | 3.85e-23 | 6.84e-21 | 7.7e-23 | 8.13e-21 | NA | logistic | 1530 | 1190 |
Blautia_faecis | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | 1.03 | 0.474 | 0.0211 | 0.0532 | 5.45e-22 | 5.23e-20 | NA | linear | 1530 | 1120 |
Blautia_faecis | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -2.42 | 0.249 | 2.72e-22 | 4.44e-20 | 5.45e-22 | 5.23e-20 | NA | logistic | 1530 | 1120 |
Fusicatenibacter_saccharivorans | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | 0.311 | 0.4 | 0.263 | 0.393 | 9.33e-22 | 8.21e-20 | NA | linear | 1530 | 1120 |
Fusicatenibacter_saccharivorans | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -2.35 | 0.243 | 4.67e-22 | 7.01e-20 | 9.33e-22 | 8.21e-20 | NA | logistic | 1530 | 1120 |
Lachnospira_sp_NSJ_43 | diagnosis | CD | diagnosisCD | -0.403 | 0.401 | 0.437 | 0.577 | 1.86e-20 | 1.51e-18 | NA | linear | 1530 | 296 |
Lachnospira_sp_NSJ_43 | diagnosis | CD | diagnosisCD | -1.65 | 0.177 | 9.28e-21 | 1.21e-18 | 1.86e-20 | 1.51e-18 | NA | logistic | 1530 | 296 |
Bacteroides_ovatus | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | 0.278 | 0.429 | 0.316 | 0.454 | 2.25e-20 | 1.7e-18 | NA | linear | 1530 | 1180 |
Bacteroides_ovatus | diagnosis | CD:dysbiosis_generaldysbiosis | diagnosisCD:dysbiosis_generaldysbiosis | -2.13 | 0.228 | 1.13e-20 | 1.37e-18 | 2.25e-20 | 1.7e-18 | NA | logistic | 1530 | 1180 |
Clostridium_sp_AM49_4BH | diagnosis | UC | diagnosisUC | 0.136 | 0.357 | 0.877 | 0.931 | 5.16e-20 | 3.64e-18 | NA | linear | 1530 | 344 |
Clostridium_sp_AM49_4BH | diagnosis | UC | diagnosisUC | -1.57 | 0.169 | 2.58e-20 | 2.97e-18 | 5.16e-20 | 3.64e-18 | NA | logistic | 1530 | 344 |
In the model above, we have converted dysbiosis_CD
and
dysbiosis_UC
into dysbiosis
in the variable
dysbiosis_general
and then specified an interaction between
diagnosis
and dysbiosis_general
with
diagnosis:dysbiosis_general
. Since diagnosis
has 3 levels itself (nonIBD
, UC
,
CD
), this will produce five terms (name
column) for each feature:
diagnosisCD
is the difference between non-dysbiotic CD
and non-IBDdiagnosisUC
is the difference between non-dysbiotic UC
and non-IBDdiagnosisCD:dysbiosis_generaldysbiosis
(the interaction
of diagnosisCD
and dysbiosis_generaldysbiosis
)
is the difference between non-dysbiotic CD and dysbiotic CDdiagnosisnonIBD:dysbiosis_generaldysbiosis
(the
interaction of diagnosisnonIBD
and
dysbiosis_generaldysbiosis
) is the difference between
dysbiotic and non-dysbiotic non-IBD. Note that these coefficients are
all NA since no subjects labeled non-IBD were marked as being
dysbiotic.diagnosisUC:dysbiosis_generaldysbiosis
(the interaction
of diagnosisUC
and dysbiosis_generaldysbiosis
)
is the difference between non-dysbiotic UC and dysbiotic UCSince we have effectively fit the original model in a different way,
this output table is equivalent to the original output table but with
new names in the columns metadata
, value
, and
name
based on the interaction terms.
Another feature of MaAsLin 3 is the ability to test for
level-versus-level differences in ordered predictors with contrast
tests. Ordered predictors are categorical variables with a natural
ordering such as cancer stage, consumption frequency of a dietary
factor, or dosage group. Here, we assess how microbial abundances and
prevalences are associated with eating red meat by including
ordered(red_meat)
in the formula. This will perform a
contrast test for whether there is a difference between each pair of
subsequent levels (e.g., “Yesterday, 3 or more times” versus “Yesterday,
1 to 2 times”) rather than whether there are differences between the
levels and the baseline (e.g., “Yesterday, 3 or more times” versus “Not
in the last 7 days”). The coefficient, standard error, and p-value all
correspond to the difference between the level in value
and
the previous level. Ordered predictors should only be included as fixed
effects (i.e., no ordered predictors as random effects etc.).
# Put the red meat consumption responses in order
metadata <- metadata %>%
mutate(red_meat = ifelse(
red_meat == 'No, I did not consume these products in the last 7 days',
'Not in the last 7 days',
red_meat) %>%
factor(levels = c('Not in the last 7 days',
'Within the past 4 to 7 days',
'Within the past 2 to 3 days',
'Yesterday, 1 to 2 times',
'Yesterday, 3 or more times'))
)
# Create the model with only non-IBD subjects
fit_out <- maaslin3(
input_data = taxa_table,
input_metadata = metadata[metadata$diagnosis == 'nonIBD',],
output = 'ordered_outputs',
formula = '~ ordered(red_meat) + antibiotics + age + reads',
plot_summary_plot = TRUE,
plot_associations = TRUE,
heatmap_vars = c('red_meat Within the past 4 to 7 days',
'red_meat Within the past 2 to 3 days',
'red_meat Yesterday, 1 to 2 times',
'red_meat Yesterday, 3 or more times'),
max_pngs = 30)
If we look at the resulting heatmap, we can identify the Alistipes shahii prevalence association as potentially interesting since it increases in prevalence with more meat consumption in all but one level-versus-level comparison.
feature | metadata | value | name | coef | stderr | pval_individual | qval_individual | pval_joint | qval_joint | error | model | N | N_not_zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alistipes_shahii | red_meat | Within the past 4 to 7 days | red_meatWithin the past 4 to 7 days | 1.47 | 0.353 | 3.16e-05 | 0.00113 | 6.31e-05 | 0.00141 | NA | logistic | 405 | 266 |
Alistipes_shahii | red_meat | Within the past 2 to 3 days | red_meatWithin the past 2 to 3 days | -0.67 | 0.316 | 0.0341 | 0.194 | 0.067 | 0.23 | NA | logistic | 405 | 266 |
Alistipes_shahii | red_meat | Yesterday, 1 to 2 times | red_meatYesterday, 1 to 2 times | 1.07 | 0.291 | 0.000241 | 0.0064 | 0.000483 | 0.00788 | NA | logistic | 405 | 266 |
Alistipes_shahii | red_meat | Yesterday, 3 or more times | red_meatYesterday, 3 or more times | 3.29 | 3.83 | 0.391 | 0.68 | 0.011 | 0.0688 | NA | logistic | 405 | 266 |
MaAsLin 3 can also test for group-wise differences in categorical
predictors using an ANOVA or ANOVA-like procedure. Group-wise predictors
are categorical variables with or without an ordering such as race,
country, or consumption frequency of a dietary factor. When performing a
group-wise difference test, we are not comparing any two particular
levels; rather, we are investigating whether abundances and prevalences
are the same for all levels. Therefore, no coefficient is returned, only
a p-value. Here, we test whether there are differences in microbial
abundances and prevalences across people with different levels of red
meat consumption by including group(red_meat)
. Group-wise
predictors should only be included as fixed effects (i.e., no group-wise
predictors as random effects etc.).
fit_out <- maaslin3(
input_data = taxa_table,
input_metadata = metadata[metadata$diagnosis == 'nonIBD',],
output = 'group_outputs',
formula = '~ group(red_meat) + antibiotics + age + reads',
plot_summary_plot = TRUE,
plot_associations = TRUE,
heatmap_vars = c('red_meat Within the past 4 to 7 days',
'red_meat Within the past 2 to 3 days',
'red_meat Yesterday, 1 to 2 times',
'red_meat Yesterday, 3 or more times'),
max_pngs = 200)
If we look at the same Alistipes shahii association from before, we see that no coefficient or standard error is returned for a group predictor, but the q-value is very low. This corroborates the observation earlier that there are differences in Alistipes shahii prevalence with red meat consumption.
feature | metadata | value | name | coef | stderr | pval_individual | qval_individual | pval_joint | qval_joint | error | model | N | N_not_zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alistipes_shahii | red_meat | NA | red_meat | NA | NA | 5.86e-09 | 5.88e-07 | 1.17e-08 | 7.29e-07 | NA | logistic | 405 | 266 |
We can also perform contrast testing in MaAsLin 3 to check whether a linear combination of the coefficients is significantly different from some constant on the right hand size (RHS). Specify a contrast matrix with column names matching the coefficient names and rows with the contrasts of interest. For each row CT, the hypothesis H0 : CTβ⃗ = RHS will be tested. If no RHS is specified, a median test will be performed, and the median test RHS will be returned.
Below, we test whether the UC and CD diagnosis coefficients are the same and whether the UC and CD dysbiosis coefficients are the same. The contrast matrix can easily be extended to test any groups pairwise.
fit_out <- maaslin3(input_data = taxa_table,
input_metadata = metadata,
output = 'contrast_test_output',
formula = '~ diagnosis + dysbiosis_state +
antibiotics + age + reads',
plot_summary_plot = FALSE,
plot_associations = FALSE,
cores = 1)
contrast_mat <- matrix(c(1, -1, 0, 0, 0, 0, 1, -1),
ncol = 4, nrow = 2, byrow = TRUE)
colnames(contrast_mat) <- c("diagnosisUC",
"diagnosisCD",
"dysbiosis_statedysbiosis_UC",
"dysbiosis_statedysbiosis_CD")
rownames(contrast_mat) <- c("diagnosis_test", "dysbiosis_test")
contrast_mat
## diagnosisUC diagnosisCD dysbiosis_statedysbiosis_UC
## diagnosis_test 1 -1 0
## dysbiosis_test 0 0 1
## dysbiosis_statedysbiosis_CD
## diagnosis_test 0
## dysbiosis_test -1
contrast_out <- maaslin_contrast_test(fits = fit_out$fit_data_abundance$fits,
contrast_mat = contrast_mat,
median_comparison = TRUE)
head(contrast_out, n = 20) %>%
knitr::kable() %>%
kableExtra::kable_styling("striped", position = 'center') %>%
kableExtra::scroll_box(width = "800px", height = "400px")
feature | test | coef | rhs | stderr | pval_individual | error | model |
---|---|---|---|---|---|---|---|
Phocaeicola_vulgatus | diagnosis_test | 0.1089748 | 0.1089748 | 0.2443056 | 1.0000000 | NA | abundance |
Phocaeicola_vulgatus | dysbiosis_test | 0.3658063 | 0.3961497 | 0.9026559 | 0.9731889 | NA | abundance |
Faecalibacterium_prausnitzii | diagnosis_test | 0.3506263 | 0.5303588 | 0.1259597 | 0.1538356 | NA | abundance |
Faecalibacterium_prausnitzii | dysbiosis_test | 2.2499071 | 3.8075289 | 0.4748916 | 0.0010644 | NA | abundance |
Bacteroides_uniformis | diagnosis_test | -0.1714861 | 0.0637664 | 0.1709981 | 0.1691449 | NA | abundance |
Bacteroides_uniformis | dysbiosis_test | 2.4290291 | 4.2928010 | 0.6875146 | 0.0068034 | NA | abundance |
Prevotella_copri_clade_A | diagnosis_test | 1.9979479 | 3.8654625 | 0.7520264 | 0.0136002 | NA | abundance |
Prevotella_copri_clade_A | dysbiosis_test | 0.8652342 | 1.3337903 | 2.6151949 | 0.8579358 | NA | abundance |
Bacteroides_stercoris | diagnosis_test | 0.1819607 | 0.2510259 | 0.2750066 | 0.8017688 | NA | abundance |
Bacteroides_stercoris | dysbiosis_test | 2.9598746 | 5.4796559 | 1.4948726 | 0.0922512 | NA | abundance |
Phocaeicola_dorei | diagnosis_test | 0.6128319 | 1.1089574 | 0.4890767 | 0.3107083 | NA | abundance |
Phocaeicola_dorei | dysbiosis_test | 0.3971967 | 0.3971967 | 1.6972670 | 1.0000000 | NA | abundance |
Bacteroides_ovatus | diagnosis_test | 0.6408062 | 1.0930804 | 0.1844777 | 0.0143660 | NA | abundance |
Bacteroides_ovatus | dysbiosis_test | 1.3045581 | 2.1694292 | 0.8593975 | 0.3144465 | NA | abundance |
Bacteroides_fragilis | diagnosis_test | -0.3317458 | 0.0755821 | 0.2672160 | 0.1278297 | NA | abundance |
Bacteroides_fragilis | dysbiosis_test | -0.1072644 | 0.3718968 | 0.7859776 | 0.5422799 | NA | abundance |
Eubacterium_rectale | diagnosis_test | 0.3292258 | 0.5147960 | 0.1685224 | 0.2710468 | NA | abundance |
Eubacterium_rectale | dysbiosis_test | 1.8318224 | 3.1782115 | 0.8070358 | 0.0955136 | NA | abundance |
Alistipes_putredinis | diagnosis_test | -0.8184856 | 0.0051087 | 0.2225484 | 0.0002286 | NA | abundance |
Alistipes_putredinis | dysbiosis_test | 2.6075856 | 4.7650816 | 1.3397762 | 0.1076884 | NA | abundance |
MaAsLin 3 can also be run with a command line interface. For example, the first HMP2 analysis can be performed with the following command (the slashes may need to be removed):
./R/maaslin3.R \
inst/extdata/HMP2_taxonomy.tsv \
inst/extdata/HMP2_metadata.tsv \
command_line_output \
--formula='~ diagnosis + dysbiosis_state + antibiotics + age + reads' \
--reference='diagnosis,nonIBD;dysbiosis_state,none;antibiotics,No'
./R/maaslin3.R
).inst/extdata/HMP2_taxonomy.tsv
is the path to your data
(or features) fileinst/extdata/HMP2_metadata.tsv
is the path to your
metadata filecommand_line_output
is the path to the folder to write
the outputTool | Models differential abundance | Models differential prevalence |
Models allowed | Accounts for compositionality |
Can use experimental absolute abundance information |
---|---|---|---|---|---|
MaAsLin 3 | Yes | Yes, for any model type | Any lme4 model (general mixed models), level-vs-level differences, group-wise differences, feature-specific covariates, contrast tests |
Yes | Yes |
MaAsLin 2 | Yes | Only insofar as it affects abundance | Fixed and random effects (subset of general mixed models) |
No | No |
ANCOM-BC2 | Yes | Only when completely absent in a group | Any lme4 model (general mixed models), level-vs-level differences, group-wise differences, contrast tests |
Yes | No |
ALDEx2 | Yes | Only insofar as it affects abundance | Fixed effects |
Yes | No |
LinDA | Yes | Only insofar as it affects abundance | Any lme4 model (general mixed models), group-wise differences, contrast tests |
Yes | No |
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.49 ggplot2_3.5.1 dplyr_1.1.4
## [5] maaslin3_0.99.3
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.2 pbapply_1.7-2
## [3] sandwich_3.1-1 rlang_1.1.5
## [5] magrittr_2.0.3 multcomp_1.4-26
## [7] matrixStats_1.5.0 compiler_4.4.2
## [9] mgcv_1.9-1 systemfonts_1.2.0
## [11] vctrs_0.6.5 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.3
## [15] fastmap_1.2.0 XVector_0.47.2
## [17] labeling_0.4.3 rmarkdown_2.29
## [19] UCSC.utils_1.3.1 nloptr_2.1.1
## [21] purrr_1.0.2 xfun_0.50
## [23] cachem_1.1.0 GenomeInfoDb_1.43.2
## [25] jsonlite_1.8.9 DelayedArray_0.33.4
## [27] BiocParallel_1.41.0 parallel_4.4.2
## [29] R6_2.5.1 bslib_0.8.0
## [31] stringi_1.8.4 RColorBrewer_1.1-3
## [33] boot_1.3-31 numDeriv_2016.8-1.1
## [35] GenomicRanges_1.59.1 jquerylib_0.1.4
## [37] Rcpp_1.0.14 SummarizedExperiment_1.37.0
## [39] zoo_1.8-12 IRanges_2.41.2
## [41] Matrix_1.7-1 splines_4.4.2
## [43] tidyselect_1.2.1 rstudioapi_0.17.1
## [45] abind_1.4-8 yaml_2.3.10
## [47] TreeSummarizedExperiment_2.15.0 codetools_0.2-20
## [49] lattice_0.22-6 tibble_3.2.1
## [51] lmerTest_3.1-3 Biobase_2.67.0
## [53] treeio_1.31.0 withr_3.0.2
## [55] evaluate_1.0.3 survival_3.8-3
## [57] getopt_1.20.4 xml2_1.3.6
## [59] Biostrings_2.75.3 pillar_1.10.1
## [61] MatrixGenerics_1.19.1 stats4_4.4.2
## [63] reformulas_0.4.0 generics_0.1.3
## [65] S4Vectors_0.45.2 munsell_0.5.1
## [67] scales_1.3.0 tidytree_0.4.6
## [69] minqa_1.2.8 glue_1.8.0
## [71] lazyeval_0.2.2 maketools_1.3.1
## [73] tools_4.4.2 ggnewscale_0.5.0
## [75] sys_3.4.3 lme4_1.1-36
## [77] buildtools_1.0.0 fs_1.6.5
## [79] mvtnorm_1.3-3 grid_4.4.2
## [81] optparse_1.7.5 tidyr_1.3.1
## [83] ape_5.8-1 rbibutils_2.3
## [85] colorspace_2.1-1 SingleCellExperiment_1.29.1
## [87] nlme_3.1-166 GenomeInfoDbData_1.2.13
## [89] patchwork_1.3.0 cli_3.6.3
## [91] S4Arrays_1.7.1 viridisLite_0.4.2
## [93] svglite_2.1.3 gtable_0.3.6
## [95] yulab.utils_0.1.9 logging_0.10-108
## [97] sass_0.4.9 digest_0.6.37
## [99] BiocGenerics_0.53.3 SparseArray_1.7.4
## [101] TH.data_1.1-3 farver_2.1.2
## [103] htmltools_0.5.8.1 lifecycle_1.0.4
## [105] httr_1.4.7 MASS_7.3-64
# Clean-up
unlink('hmp2_output', recursive = TRUE)
unlink('spike_in_demo', recursive = TRUE)
unlink('total_scaling_demo', recursive = TRUE)
unlink('random_effects_output', recursive = TRUE)
unlink('interaction_output', recursive = TRUE)
unlink('ordered_outputs', recursive = TRUE)
unlink('group_outputs', recursive = TRUE)
unlink('contrast_test_output', recursive = TRUE)