MaAsLin 3 is the next generation of MaAsLin (Microbiome Multivariable Associations with Linear Models). This comprehensive R package efficiently determines multivariable associations between clinical metadata and microbial meta-omics features. Relative to MaAsLin 2, MaAsLin 3 introduces the ability to quantify and test for both abundance and prevalence associations while accounting for compositionality. By incorporating generalized linear models, MaAsLin 3 accommodates most modern epidemiological study designs including cross-sectional and longitudinal studies.
If you use the MaAsLin 3 software, please cite our manuscript:
William A. Nickols, Thomas Kuntz, Jiaxian Shen, Sagun Maharjan, Himel Mallick, Eric A. Franzosa, Kelsey N. Thompson, Jacob T. Nearing, Curtis Huttenhower. MaAsLin 3: Refining and extending generalized multivariable linear models for meta-omic association discovery. bioRxiv 2024.12.13.628459; doi: https://doi.org/10.1101/2024.12.13.628459
Check out the MaAsLin 3
tutorial for an overview of analysis options and some example runs.
If using vignettes, users should start with the
maaslin3_tutorial.Rmd
vignette and then refer to the
maaslin3_manual.Rmd
vignette as necessary.
If you have questions, please direct them to the MaAsLin 3 Forum
MaAsLin3 is an R package that can be run on the command line or as an R function.
MaAsLin3 can be run from the command line or as an R function. Both methods require the same arguments, have the same options, and use the same default settings. To run MaAsLin 3, the user must supply a table of per-sample feature abundances (with zeros still included), a table of per-sample metadata, and a model specifying how the metadata should relate to the feature prevalence (how likely the feature is to be present or absent) and abundance (how much of the feature is there if it’s there). MaAsLin 3 will return a table of associations including an effect size and p-value for each feature-metadatum association and a folder of visuals including a summary plot and diagnostic plots for significant associations.
MaAsLin3 requires two input files.
The data file can contain samples not included in the metadata file (along with the reverse case). For both cases, those samples not included in both files will be removed from the analysis. Also, the samples do not need to be in the same order in the two files.
To run MaAsLin 3, it is also necessary to specify a model. The model can come from a formula or vectors of terms. In either case, variable names should not have spaces or unusual characters.
formula
parameter should be set
to any formula that satisfies the lme4
specifications:
fixed effects, random effects, interaction terms, polynomial terms, and
more can all be included. If categorical variables are included as fixed
effects, each level will be tested against the first factor level. In
addition, ordered predictors, group predictors, and strata variables can
be included by including group(variable_name)
,
ordered(variable_name)
, and
strata(variable_name)
respectively in the formula. Ordered
and group predictors should stand alone in the formula (i.e., no group
predictors in random effects). Only one strata variable can be
included.fixed_effects
,
random_effects
, group_effects
,
ordered_effects
, and strata_effects
.
Categorical variables should either be ordered as factors beforehand, or
reference
should be provided as a string of
‘variable,reference’ semi-colon delimited for multiple variables (e.g.,
variable_1,reference_1;variable_2,reference_2
). NOTE:
adding a space between the variable and level might result in the wrong
reference level being used.Because MaAsLin 3 identifies prevalence (presence/absence) associations, sample read depth (number of reads) should be included as a covariate if available. Deeper sequencing will likely increase feature detection in a way that could spuriously correlate with metadata of interest when read depth is not included in the model.
MaAsLin 3 generates two types of output files explained below: data
and visualizations. In addition, the object returned from
maaslin3
contains all the data and results (see
?maaslin_fit
).
all_results.tsv
feature
and metadata
are the feature and
metadata names.value
and name
are the value of the
metadata and variable name from the model.coef
and stderr
are the fit coefficient
and standard error from the model. In abundance models, a one-unit
change in the metadatum variable corresponds to a 2coef fold change in the relative
abundance of the feature. In prevalence models, a one-unit change in the
metadatum variable corresponds to a coef change in the log-odds of a feature
being present.pval_individual
is the p-value of the individual
association.qval_individual
is the false discovery rate (FDR)
corrected q-value of the individual association. FDR correction is
performed over all p-values without errors in the abundance and
prevalence modeling together.pval_joint
and qval_joint
are the p-value
and q-value of the joint prevalence and abundance association. The
p-value comes from plugging in the minimum of the association’s
abundance and prevalence p-values into the Beta(1,2) CDF. It is
interpreted as the probability that either the abundance or prevalence
association would be as extreme as observed if there was neither an
abundance nor prevalence association between the feature and
metadatum.error
lists any errors from the model fitting.model
specifies whether the association is abundance or
prevalence.N
and N_not_zero
are the total number of
data points and the total number of non-zero data points for the
feature.significant_results.tsv
features
models_linear.rds
and models_logistic.rds
linear
for linear models, logistic
for
logistic models).save_models
is set to
TRUE.residuals_linear.rds
and
residuals_logstic.rds
fitted_linear.rds
and fitted_logistic.rds
ranef_linear.rds
and ranef_logistic.rds
maaslin3.log
summary_plot.pdf
max_significance
,
and two stars indicate the individual q-value is below
max_significance
divided by 10.association_plots/[metadatum]/[association]/ [metadatum]_[feature]_[association].png
max_pngs
.At the top right of each association plot is the name of the significant association in the results file, the FDR corrected q-value for the individual association, the number of samples in the dataset, and the number of samples with non-zero abundances for the feature. In the plots with categorical metadata variables, the reference category is on the left, and the significant q-values and coefficients in the top right are in the order of the values specified above. Because the displayed coefficients correspond to the full fit model with (possibly) scaled metadata variables, the marginal association plotted might not match the coefficient displayed. However, the plots are intended to provide an interpretable visual while usually agreeing with the full model.
error
column of
all_results.tsv
. Often, these indicate model fitting
failures or poor fits that should not be trusted, but sometimes the
warnings can be benign, and the model fit might still be reasonable.
Users should check associations of interest if they produce errors.association_plots
directory.
Example input files can be found in the inst/extdata
folder of the MaAsLin 3 source. The files provided were generated from
the Human Microbiome Project 2 (HMP2) data which can be downloaded from
https://ibdmdb.org/.
HMP2_taxonomy.tsv
: a tab-delimited file with samples as
rows and species as columns. It is a subset of the full HMP2 taxonomy
that includes just some of the the species abundances.HMP2_metadata.tsv
: a tab-delimited file with samples as
rows and metadata as columns. It is a subset of the full HMP2 metadata
that includes just some of the fields.The following code identifies associations between patient metadata and microbial species in the HMP2 cohort.
# Read abundance table
taxa_table_name <- system.file("extdata", "HMP2_taxonomy.tsv",
package = "maaslin3")
taxa_table <- read.csv(taxa_table_name, sep = '\t')
# Read metadata table
metadata_name <- system.file("extdata", "HMP2_metadata.tsv",
package = "maaslin3")
metadata <- read.csv(metadata_name, sep = '\t')
metadata$diagnosis <-
factor(metadata$diagnosis, levels = c('nonIBD', 'UC', 'CD'))
metadata$dysbiosis_state <-
factor(metadata$dysbiosis_state, levels = c('none', 'dysbiosis_UC',
'dysbiosis_CD'))
metadata$antibiotics <-
factor(metadata$antibiotics, levels = c('No', 'Yes'))
# Fit models
set.seed(1)
fit_out <- maaslin3(input_data = taxa_table,
input_metadata = metadata,
output = 'hmp2_output',
formula = '~ diagnosis + dysbiosis_state +
antibiotics + age + reads',
normalization = 'TSS',
transform = 'LOG',
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 20,
cores = 1)
The maaslin3
wrapper function runs all steps at once,
but equivalent results would be produced from running one step at a
time:
maaslin_log_arguments(input_data = taxa_table,
input_metadata = metadata,
output = 'hmp2_output',
formula = '~ diagnosis + dysbiosis_state +
antibiotics + age + reads',
normalization = 'TSS',
transform = 'LOG',
augment = TRUE,
standardize = TRUE,
max_significance = 0.1,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_pngs = 20,
cores = 1)
read_data_list <- maaslin_read_data(taxa_table,
metadata)
read_data_list <- maaslin_reorder_data(read_data_list$data,
read_data_list$metadata)
data <- read_data_list$data
metadata <- read_data_list$metadata
formulas <- maaslin_check_formula(data,
metadata,
'~ diagnosis + dysbiosis_state +
antibiotics + age + reads')
formula <- formulas$formula
random_effects_formula <- formulas$random_effects_formula
normalized_data = maaslin_normalize(data,
output = 'hmp2_output',
normalization = 'TSS')
filtered_data = maaslin_filter(normalized_data,
output = 'hmp2_output')
transformed_data = maaslin_transform(filtered_data,
output = 'hmp2_output',
transform = 'LOG')
standardized_metadata = maaslin_process_metadata(metadata,
formula,
standardize = TRUE)
maaslin_results = maaslin_fit(filtered_data,
transformed_data,
standardized_metadata,
formula,
random_effects_formula,
data = data,
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
cores = 1)
maaslin_write_results(output = 'hmp2_output',
maaslin_results$fit_data_abundance,
maaslin_results$fit_data_prevalence,
random_effects_formula,
max_significance = 0.1)
# Invisible to avoid dumping plots
invisible(maaslin_plot_results(output = 'hmp2_output',
transformed_data,
metadata,
maaslin_results$fit_data_abundance,
maaslin_results$fit_data_prevalence,
normalization = "TSS",
transform = "LOG",
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_significance = 0.1,
max_pngs = 20))
Once maaslin3
has been run once,
maaslin_plot_results_from_output
can be run to (re-)create
the plots. This allows the user to plot the associations even without
having the R object returned by maaslin_fit
or
maaslin3
(e.g., if fitting models through the command
line). It is recommended to fit models with simple variables names that
are robust to formula plotting and then convert these into proper names
for plotting. Likewise, heatmap_vars
and
coef_plot_vars
can be specified at any point, but it is
usually easier to see how the names come out first and then choose which
metadata variables will go in the coefficient plot and which will go in
the heatmap afterwards with
maaslin_plot_results_from_output
.
# This section is necessary for updating
# the summary plot and the association plots
# Rename results file with clean titles
all_results <- read.csv('hmp2_output/all_results.tsv', sep='\t')
all_results <- all_results %>%
mutate(metadata = case_when(metadata == 'age' ~ 'Age',
metadata == 'antibiotics' ~ 'Abx',
metadata == 'diagnosis' ~ 'Diagnosis',
metadata == 'dysbiosis_state' ~ 'Dysbiosis',
metadata == 'reads' ~ 'Read depth'),
value = case_when(value == 'dysbiosis_CD' ~ 'CD',
value == 'dysbiosis_UC' ~ 'UC',
value == 'Yes' ~ 'Used', # Antibiotics
value == 'age' ~ 'Age',
value == 'reads' ~ 'Read depth',
TRUE ~ value),
feature = gsub('_', ' ', feature) %>%
gsub(pattern = 'sp ', replacement = 'sp. '))
# Write results
write.table(all_results, 'hmp2_output/all_results.tsv', sep='\t')
# Set the new heatmap and coefficient plot variables and order them
heatmap_vars = c('Dysbiosis UC', 'Diagnosis UC',
'Abx Used', 'Age', 'Read depth')
coef_plot_vars = c('Dysbiosis CD', 'Diagnosis CD')
# This section is necessary for updating the association plots
colnames(taxa_table) <- gsub('_', ' ', colnames(taxa_table)) %>%
gsub(pattern = 'sp ', replacement = 'sp. ')
# Rename the features in the norm transformed data file
data_transformed <-
read.csv('hmp2_output/features/data_transformed.tsv', sep='\t')
colnames(data_transformed) <-
gsub('_', ' ', colnames(data_transformed)) %>%
gsub(pattern = 'sp ', replacement = 'sp. ')
write.table(data_transformed,
'hmp2_output/features/data_transformed.tsv',
sep='\t', row.names = FALSE)
# Rename the metadata like in the outputs table
colnames(metadata) <-
case_when(colnames(metadata) == 'age' ~ 'Age',
colnames(metadata) == 'antibiotics' ~ 'Abx',
colnames(metadata) == 'diagnosis' ~ 'Diagnosis',
colnames(metadata) == 'dysbiosis_state' ~ 'Dysbiosis',
colnames(metadata) == 'reads' ~ 'Read depth',
TRUE ~ colnames(metadata))
metadata <- metadata %>%
mutate(Dysbiosis = case_when(Dysbiosis == 'dysbiosis_UC' ~ 'UC',
Dysbiosis == 'dysbiosis_CD' ~ 'CD',
Dysbiosis == 'none' ~ 'None') %>%
factor(levels = c('None', 'UC', 'CD')),
Abx = case_when(Abx == 'Yes' ~ 'Used',
Abx == 'No' ~ 'Not used') %>%
factor(levels = c('Not used', 'Used')),
Diagnosis = case_when(Diagnosis == 'nonIBD' ~ 'non-IBD',
TRUE ~ Diagnosis) %>%
factor(levels = c('non-IBD', 'UC', 'CD')))
# Recreate the plots
scatter_plots <- maaslin_plot_results_from_output(
output = 'hmp2_output',
metadata,
normalization = "TSS",
transform = "LOG",
median_comparison_abundance = TRUE,
median_comparison_prevalence = FALSE,
max_significance = 0.1,
max_pngs = 20)
In the new summary plot below, we can see that the feature names are
cleaned up, the metadata names are cleaned up, the set of metadata
variables used in the coefficient plot is different, and the metadata
used in the heatmap is reordered. (Make sure to set
include=T
to view the plots when knitting the Rmd.)
In the association plots below, the taxa and metadata have been
renamed to be consistent with the results file from earlier. To
see the plots, set max_pngs = 250
above and set
eval=TRUE
in this chunk.
We can also perform contrast testing in MaAsLin 3 to check whether a linear combination of the coefficients is significantly different from some constant. Specify a contrast matrix with column names matching the coefficient names and rows with the contrasts of interest. A separate contrast test will be performed for each row.
contrast_mat <- matrix(c(1, 1, 0, 0, 0, 0, 1, 1),
ncol = 4, nrow = 2, byrow = TRUE)
colnames(contrast_mat) <- c("diagnosisUC",
"diagnosisCD",
"dysbiosis_statedysbiosis_UC",
"dysbiosis_statedysbiosis_CD")
rownames(contrast_mat) <- c("diagnosis_test", "dysbiosis_test")
contrast_mat
maaslin_contrast_test(fits = fit_out$fit_data_abundance$fits,
contrast_mat = contrast_mat)
MaAsLin 3 can also be run with a command line interface. For example, the first HMP2 analysis can be performed with the following command (the slashes may need to be removed):
./R/maaslin3.R \
inst/extdata/HMP2_taxonomy.tsv \
inst/extdata/HMP2_metadata.tsv \
command_line_output \
--formula='~ diagnosis + dysbiosis_state + antibiotics + age + reads' \
--reference='diagnosis,nonIBD;dysbiosis_state,none;antibiotics,No'
./R/maaslin3.R
).inst/extdata/HMP2_taxonomy.tsv
is the path to your data
(or features) fileinst/extdata/HMP2_metadata.tsv
is the path to your
metadata filecommand_line_output
is the path to the folder to write
the outputFrom the command line, the following command will print the list of MaAsLin 3 options and default settings:
$ ./R/maaslin3.R --help
When running MaAsLin 3 in R, the manual page for each function (e.g.,
?maaslin3
) will show the available options and default
settings. For both, the options and settings are as follows:
input_data
: A data frame of feature abundances or read
counts or a filepath to a tab-delimited file with abundances. It should
be formatted with features as columns and samples as rows (or the
transpose). The column and row names should be the feature names and
sample names respectively.input_metadata
: A data frame of per-sample metadata or
a filepath to a tab-delimited file with metadata. It should be formatted
with variables as columns and samples as rows (or the transpose). The
column and row names should be the variable names and sample names
respectively.output
: The output folder to write results.formula
: A formula in lme4
format. Random
effects, interactions, and functions of the metadata can be included
(note that these functions will be applied after standardization if
standardize = TRUE
). Group, ordered, and strata variables
can be specified as: group(grouping_variable)
,
ordered(ordered_variable)
, and
strata(strata_variable)
. The other variable options below
will not be considered if a formula is set.fixed_effects
: A vector of variable names to be
included as fixed effects.
lm
(linear) or
glm
(logistic).reference
: For a variable with more than two levels
supplied with fixed_effects
, the factor to use as a
reference provided as a string of ‘variable,reference’ semi-colon
delimited for multiple variables.random_effects
: A vector of variable names to be
included as random intercepts. Random intercept models may
produce poor model fits when there are fewer than 5 observations per
group. In these scenarios, per-group fixed effects should be
used and subsequently filtered out. (See strata_effects
as
well.)
lmer
(linear) and
glmer
(logistic), and the significance tests come from
lmerTest
and glmer
respectively.group_effects
: A factored categorical variable to be
included for group testing. An ANOVA-style test will be performed to
assess whether any of the variable’s levels are significant, and no
coefficients or individual p-values will be returned.
anova
function’s
LRT
option (logistic fixed and mixed effects), the
anova
function’s F test (linear fixed effects), or
lmerTest::contest
(linear mixed effects).ordered_effects
: A factored categorical variable to be
included. Consecutive levels will be tested for significance against
each other with contrast tests, and the resulting associations will
correspond to effect sizes, standard errors, and significances of each
level versus the previous.
multcomp::glht
(fixed
effects and logistic mixed effects) and lmerTest::contest
(linear mixed effects).strata_effects
: A single grouping variable to be
included in matched case-control studies. If a strata variable is
included, no random effects can be included. When a strata variable is
included, a conditional logistic regression will be run to account for
the strata. The abundance model will be run with a random intercept in
place of the strata. Strata can include more than two observations per
group. Only variables that differ within the groups can be tested. In
general, strata effects are not recommended except for advanced users.
Fixed or random intercepts are recommended instead.Particularly for use in metatranscriptomics workflows, a table of feature-specific covariates can be included. A feature’s covariates will be included like a fixed effect metadatum when fitting the model for that feature. The covariate’s name does not need to be included in the formula.
feature_specific_covariate
: A table of feature-specific
covariates or a filepath to a tab-delimited file with feature-specific
covariates. It should be formatted with features as columns and samples
as rows (or the transpose). The row names and column names should be the
same as those of the input_data
: the column and row names
should be the feature names and sample names respectively.feature_specific_covariate_name
: The name for the
feature-specific covariates when fitting the models.feature_specific_covariate_record
: Whether to keep the
feature-specific covariates in the outputs when calculating p-values,
writing results, and displaying plots.min_abundance
(default 0
): Features with
abundances of at least min_abundance
in
min_prevalence
of the samples will be included for
analysis. The threshold is applied before normalization and
transformation.min_prevalence
(default 0
): See
above.zero_threshold
(default 0
): Abundances
less than or equal to zero_threshold
will be treated as
zeros. This is primarily to be used when the abundance table has likely
low-abundance false positives.min_variance
(default 0
): Features with
abundance variances less than or equal to min_variance
will
be dropped. This is primarily used for dropping features that are
entirely zero.max_significance
(default 0.1
): The FDR
corrected q-value threshold for significance used in selecting which
associations to write as significant and to plot.normalization
(default TSS
): The
normalization to apply to the features before transformation and
analysis. The option TSS
(total-sum scaling) is
recommended, but CLR
(centered log ratio) and
NONE
can also be used.transform
(default LOG
, base 2): The
transformation to apply to the features after normalization and before
analysis. The option LOG
is recommended, but
PLOG
(pseudo-log with a pseudo-count of half the dataset
minimum non-zero abundance replacing zeros, particularly for
metabolomics data) and NONE
can also be used.correction
(default BH
): The correction to
obtain FDR-corrected q-values from raw p-values. Any valid options for
p.adjust
can be used.standardize
(default TRUE
): Whether to
apply z-scores to continuous metadata variables so they are on the same
scale. This is recommended in order to compare coefficients across
metadata variables, but note that functions of the metadata specified in
the formula
will apply after standardization.warn_prevalence
(default TRUE
): Warn when
prevalence associations are likely induced by abundance associations.
This requires re-fitting the linear models on the TSS log-transformed
data. A prevalence coefficient is flagged if its corresponding abundance
coefficient is significantly different from 0, of the same sign, and
larger in absolute value.augment
(default TRUE
): To avoid linear
separability in the logistic regression, at each input data point, add
an extra 0 and an extra 1 observation weighted as the number of
predictors divided by two times the number of data points. This is
almost always recommended to avoid linear separability while having a
minor effect on fit coefficients otherwise.evaluate_only
(default NULL
): To fit only
the abundance or only the prevalence models, evaluate_only
can be set to abundance
or prevalence
.Most microbiome methodologies have historically focused on relative abundances (proportions out of 1). However, some experimental protocols can enable estimation of absolute abundances (cell count/concentration). MaAsLin 3 can be used with two types of absolute abundance estimation: spike-ins and total abundance scaling. In a spike-in procedure, a small, known quantity of a microbe that otherwise would not be present in the sample is added, and the sequencing procedure is carried out as usual. Then, the absolute abundance of a microbe already in the community is estimated as: $$\textrm{Absolute abundance other microbe}=\frac{\textrm{Relative abundance other microbe}}{\textrm{Relative abundance spike-in microbe}}\cdot (\textrm{Absolute abundance spike-in microbe})$$ Alternatively, the total microbial abundance of a sample can be determined (e.g., with qPCR of a marker gene or by cell counting). Then, the absolute abundance of a microbe in the community is estimated as: Absolute abundance microbe = (Total absolute abundance) ⋅ (Relative abundance microbe)
unscaled_abundance
: A data frame with a single column
of absolute abundances or a filepath to such a tab-delimited file. The
row names should match the names of the samples in
input_data
and input_metadata
. When using
spike-ins, the single column should have the same name as one of the
features in input_data
, and the values should correspond to
the absolute quantity of the spike-in. For example, if the same spike-in
quantity is used in each sample, the entire column can be set to 1. When
using total abundance scaling, the single column should have the name
‘total’, and the values should correspond to the total abundance of each
sample. In both cases, median_comparison_abundance
should
be set to FALSE
since the spike-in or total abundance
normalization accounts for compositionality.Alternatively, if the input_data
abundances have already
been scaled to be absolute abundances, the user should set
normalization = NONE
and
median_comparison_abundance = FALSE
and not include
anything for unscaled_abundance
. Then, the absolute
abundances will be log transformed, and models will be fit on those
values directly.
When median_comparison_abundance
or
median_comparison_prevalence
are on, the coefficients for a
metadatum will be tested against the median coefficient for that
metadatum (median across the features). Otherwise, the coefficients will
be tested against 0. For abundance associations, this is designed to
account for compositionality, the idea that if only one feature has a
positive association with a metadatum on the absolute scale (cell
count), the other features will have apparent negative associations with
that metadatum on the relative scale (proportion of the community)
because relative abundances must sum to 1. More generally, associations
on the relative scale are not necessarily the same as the associations
on the absolute scale in magnitude or sign, so testing against
zero on the relative scale is not equivalent to testing against zero on
the absolute scale. When testing associations on the relative
scale, the coefficients should be tested against 0 (median comparison
off). However, since these tests do not correspond to tests for
associations on the absolute scale, we instead use a test against the
median, which can enable some inference on the absolute scale. There are
two interpretations of this test for absolute abundance
associations:
By contrast, sparsity should be less affected by compositionality
since a feature should still be present even if another increases or
decreases in abundance. (Note that, because the read depth is finite,
this might not always be true in practice.) Therefore,
median_comparison_prevalence
is off by default, but it can
be turned on if the user is interested in testing whether a particular
prevalence association is significantly different from the typical
prevalence association.
In both cases, if the tested coefficient is within
median_comparison_[abundance/prevalence]_threshold
of the
median, it will automatically receive a p-value of 1. This is based on
the idea that the association might be statistically significantly
different but not substantially different from the median and therefore
is likely still a result of compositionality.
To conclude:
median_comparison_abundance
is TRUE
by
default and should be used to make inference on the absolute scale when
using relative abundance data. When
median_comparison_abundance
is TRUE
, only the
p-values and q-values change. The coefficients returned are still the
relative abundance coefficients unless subtract_median
is
set to TRUE
in which case the median will be
subtracted.median_comparison_abundance
should be
FALSE
when (1) testing for significant relative
associations, (2) testing for absolute abundance associations under the
assumption that the total absolute abundance is not changing, or (3)
testing for significant absolute associations when supplying spike-in or
total abundances with unscaled_abundance
.median_comparison_prevalence
is FALSE
by
default.median_comparison_abundance
(default
TRUE
): Test abundance coefficients against a null value
corresponding to the median coefficient for a metadata variable across
the features. Otherwise, test against 0. This is recommended for
relative abundance data but should not be used for absolute abundance
data.median_comparison_prevalence
(default
FALSE
): Test prevalence coefficients against a null value
corresponding to the median coefficient for a metadata variable across
the features. Otherwise, test against 0. This is only recommended if the
analyst is interested in how feature prevalence associations compare to
each other or if there is likely strong compositionality-induced
sparsity.median_comparison_abundance_threshold
(default
0
): Coefficients within
median_comparison_abundance_threshold
of the median
association will automatically be counted as insignificant (p-value set
to 1) since they likely represent compositionality-induced associations.
This threshold will be divided by the metadata variable’s standard
deviation if the metadatum is continuous to ensure the threshold applies
to the right scale.median_comparison_prevalence_threshold
(default
0
): Same as
median_comparison_abundance_threshold
but applied to the
prevalence associations.subtract_median
(default FALSE
): Subtract
the median from the coefficients.plot_summary_plot
(default TRUE
): Generate
a summary plot of significant associations.summary_plot_first_n
(default 25
): Include
the top summary_plot_first_n
features with significant
associations.coef_plot_vars
: Vector of variable names to be used in
the coefficient plot section of the summary plot. Continuous variables
should match the metadata column name, and categorical variables should
be of the form "[variable] [level]"
.By default, the (up to) two metadata variables with the most
significant associations will be plotted in the coefficient plot, and
the rest will be plotted in the heatmap. Because predicting the output
variable names can be tricky, it is recommended to first run
maaslin3
without setting coef_plot_vars
or
heatmap_vars
, look at the names of the variables in the
summary plot, and then rerun with
maaslin_plot_results_from_output
after updating
coef_plot_vars
and heatmap_vars
with the
desired variables. * heatmap_vars
: Vector of variable names
to be used in the heatmap section of the summary plot. Continuous
variables should match the metadata column name, and categorical
variables should be of the form "[variable] [level]"
. *
plot_associations
(default TRUE
): Whether to
generate plots for significant associations. * max_pngs
(default 30
): The top max_pngs
significant
associations will be plotted.
cores
(default 1
): How many cores to use
when fitting models. (Using multiple cores will likely be faster only
for large datasets or complex models.)save_models
(default FALSE
): Whether to
return the fit models and save them to an RData file.verbosity
(default 'FINEST'
): The level of
verbosity for the logging
package.save_plots_rds
(default FALSE
): Whether to
save the plots as RDS files.Tool | Models differential abundance | Models differential prevalence |
Models allowed | Accounts for compositionality |
Can use experimental absolute abundance information |
---|---|---|---|---|---|
MaAsLin 3 | Yes | Yes, for any model type | Any lme4 model (general mixed models), level-vs-level differences, group-wise differences, feature-specific covariates, contrast tests |
Yes | Yes |
MaAsLin 2 | Yes | Only insofar as it affects abundance | Fixed and random effects (subset of general mixed models) |
No | No |
ANCOM-BC2 | Yes | Only when completely absent in a group | Any lme4 model (general mixed models), level-vs-level differences, group-wise differences, contrast tests |
Yes | No |
ALDEx2 | Yes | Only insofar as it affects abundance | Fixed effects |
Yes | No |
LinDA | Yes | Only insofar as it affects abundance | Any lme4 model (general mixed models), group-wise differences, contrast tests |
Yes | No |
maaslin3.R: command not found
. How do I fix this?
Error in library(maaslin3): there is no package called 'maaslin3'
.
How do I fix this?
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.49 ggplot2_3.5.1 dplyr_1.1.4 maaslin3_0.99.3
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2
## [3] Biostrings_2.75.3 fastmap_1.2.0
## [5] SingleCellExperiment_1.29.1 lazyeval_0.2.2
## [7] TH.data_1.1-3 digest_0.6.37
## [9] lifecycle_1.0.4 survival_3.8-3
## [11] tidytree_0.4.6 magrittr_2.0.3
## [13] compiler_4.4.2 rlang_1.1.5
## [15] sass_0.4.9 tools_4.4.2
## [17] yaml_2.3.10 labeling_0.4.3
## [19] S4Arrays_1.7.1 DelayedArray_0.33.4
## [21] RColorBrewer_1.1-3 TreeSummarizedExperiment_2.15.0
## [23] abind_1.4-8 multcomp_1.4-26
## [25] BiocParallel_1.41.0 withr_3.0.2
## [27] purrr_1.0.2 BiocGenerics_0.53.3
## [29] sys_3.4.3 grid_4.4.2
## [31] stats4_4.4.2 colorspace_2.1-1
## [33] scales_1.3.0 MASS_7.3-64
## [35] logging_0.10-108 SummarizedExperiment_1.37.0
## [37] optparse_1.7.5 cli_3.6.3
## [39] mvtnorm_1.3-3 rmarkdown_2.29
## [41] crayon_1.5.3 treeio_1.31.0
## [43] generics_0.1.3 httr_1.4.7
## [45] getopt_1.20.4 pbapply_1.7-2
## [47] ape_5.8-1 cachem_1.1.0
## [49] splines_4.4.2 parallel_4.4.2
## [51] XVector_0.47.2 matrixStats_1.5.0
## [53] vctrs_0.6.5 yulab.utils_0.1.9
## [55] Matrix_1.7-1 sandwich_3.1-1
## [57] jsonlite_1.8.9 patchwork_1.3.0
## [59] IRanges_2.41.2 S4Vectors_0.45.2
## [61] maketools_1.3.1 ggnewscale_0.5.0
## [63] jquerylib_0.1.4 tidyr_1.3.1
## [65] glue_1.8.0 codetools_0.2-20
## [67] gtable_0.3.6 GenomeInfoDb_1.43.2
## [69] GenomicRanges_1.59.1 UCSC.utils_1.3.1
## [71] munsell_0.5.1 tibble_3.2.1
## [73] pillar_1.10.1 htmltools_0.5.8.1
## [75] GenomeInfoDbData_1.2.13 R6_2.5.1
## [77] evaluate_1.0.3 lattice_0.22-6
## [79] Biobase_2.67.0 bslib_0.8.0
## [81] Rcpp_1.0.14 SparseArray_1.7.4
## [83] nlme_3.1-166 xfun_0.50
## [85] MatrixGenerics_1.19.1 fs_1.6.5
## [87] zoo_1.8-12 buildtools_1.0.0
## [89] pkgconfig_2.0.3