The Microbiome Batch-Effect Correction Suite aims to provide a
toolkit for stringent assessment and correction of batch-effects in
microbiome data sets. To that end, the package offers wrapper-functions
to summarize study-design and data, e.g., PCA, Heatmap and Mosaic-plots,
and to estimate the proportion of variance that can be attributed to the
batch effect. The mbecsCorrection()
function acts as a
wrapper for various batch effect correcting algorithms (BECA) and in
conjunction with the aforementioned tools, it can be used to compare the
effectiveness of correction methods on particular sets of data. All
functions of this package are accessible on their own or within the
preliminary and comparative report pipelines respectively.
The MBECS
package relies on the following packages to
work:
Package | Version | Date | Repository |
---|---|---|---|
phyloseq | 1.51.0 | 2021-11-29 | https://bioc.r-universe.dev |
magrittr | 2.0.3 | NULL | RSPM |
cluster | 2.1.6 | 2023-11-30 | RSPM |
dplyr | 1.1.4 | NULL | RSPM |
ggplot2 | 3.5.1 | NULL | RSPM |
gridExtra | 2.3 | NULL | RSPM |
limma | 3.63.2 | 2024-11-10 | https://bioc.r-universe.dev |
lme4 | 1.1.35.5 | NULL | RSPM |
lmerTest | 3.1.3 | NULL | RSPM |
pheatmap | 1.0.12 | 2018-12-26 | RSPM |
rmarkdown | 2.29 | NULL | RSPM |
ruv | 0.9.7.1 | 2019-08-30 | RSPM |
sva | 3.55.0 | NULL | https://bioc.r-universe.dev |
tibble | 3.2.1 | NULL | RSPM |
tidyr | 1.3.1 | NULL | RSPM |
vegan | 2.6.8 | NULL | RSPM |
methods | 4.4.2 | NULL | NULL |
stats | 4.4.2 | NULL | NULL |
utils | 4.4.2 | NULL | NULL |
The main application of this package is microbiome data. It is common
practice to use the phyloseq
(McMurdie et al. 2021) package for
analyses of this type of data. To that end, the MBECS
package extends the phyloseq
class in order to provide its
functionality. The user can utilize objects of class
phyloseq
or a list
object that contains an
abundance table as well as meta data. The package contains a dummy
data-set of artificially generated data to illustrate this process. To
start an analysis, the user requires the mbecProcessInput()
function.
Load the package via the library()
function.
Use the data()
function to load the provided mockup
data-sets at this point.
For an input that consists of an abundance table and meta-data, both
tables require sample names as either row or column names. They need to
be passed in a list
object with the abundance matrix as
first element. The mbecProcessInput()
function will handle
the correct orientation and return an object of class
MbecData
.
The optional argument required.col
may be used to ensure
that all covariate columns that should be there are available. For the
dummy-data these are “group”, “batch” and “replicate”.
The start is the same if the data is already of class
phyloseq
. The dummy.ps
object contains the
same data as dummy.list
, but it is of class
phyloseq
. Create an MbecData
object from
phyloseq
input.
The optional argument required.col
may be used to ensure
that all covariate columns that should be there are available. For the
dummy-data these are “group”, “batch” and “replicate”.
The most common normalizing transformations in microbiome analysis
are total sum scaling (TSS) and centered log-ratio transformation (CLR).
Hence, the MBECS
package offers these two methods. The
resulting matrices will be stored in their respective
slots (tss, clr)
in the MbecData
object, while
the original abundance table will remain unchanged.
Use mbecTransform()
to apply total sum scaling to the
data.
Apply centered log-ratio transformation to the data. Due to the
sparse nature of compositional microbiome data, the parameter
offset
may be used to add a small offset to the abundance
matrix in order to facilitate the CLR transformation.
The function mbecReportPrelim()
will provide the user
with an overview of experimental setup and the significance of the batch
effect. To that end it is required to declare the covariates that are
related to batch effect and group effect respectively. In addition it
provides the option to select the abundance table to use here. The CLR
transformed abundances are the default and the function will calculate
them if they are not present in the input. Technically, the user can
start the analysis at this point because the function incorporates the
functionality of the aforementioned processing functions.
The parameter model.vars
is a character vector with two
elements. The first denotes the covariate column that describes the
batch effect and the second one should be used for the presumed
biological effect of interest, e.g., the group effect in case/control
studies. The type
parameter selects which abundance table
is to be used “otu”, “clr”, “tss” .
The package acts as a wrapper for six different batch effect correcting algorithms (BECA).
Remove Unwanted Variation 3 (ruv3
)
Batch Mean Centering (bmc
)
ComBat (bat
)
Remove Batch Effect (rbe
)
Percentile Normalization (pn
)
mbecCorrections
can be used to manually select
untransformed or centered log ratio transformed abundances.Support Vector Decomposition (svd
)
The user has the choice between two functions
mbecCorrection()
and mbecRunCorrections()
, the
latter one acts as a wrapper that can apply multiple correction methods
in a single run. If the input to
mbecRunCorrections()
is missing CLR and TSS transformed
abundance matrices, they will be created with default settings, i.e.,
offsets for zero values will be set to 1/#features.
The function mbecCorrection()
will apply a single
correction algorithm selected by the parameter method
and
return an object that contains the resulting corrected abundance matrix
in its cor slot
with the respective name.
mbec.obj <- mbecCorrection(mbec.obj, model.vars=c("batch","group"),
method = "bat", type = "clr")
#> Applying ComBat (sva) for batch-correction.
#> Found2batches
#> Adjusting for1covariate(s) or covariate level(s)
#> Standardizing Data across genes
#> Fitting L/S model and finding priors
#> Finding nonparametric adjustments
#> Adjusting the Data
The function mbecRunCorrections()
will apply all
correction algorithms selected by the parameter method
and
return an object that contains all respective corrected abundance
matrices in the cor
slot. In this example there will be
three in total, named like the methods that created them.
mbec.obj <- mbecRunCorrections(mbec.obj, model.vars=c("batch","group"),
method=c("ruv3","rbe","bmc","pn","svd"),
type = "clr")
#> No negative control features provided.
#> Using pseudo-negative controls.
#> Applying Remove Unwanted Variantion v3 (RUV-III).
#> Applying 'removeBatchEffect' (limma) for batch-correction.
#> Applying Batch Mean-Centering (BMC) for batch-correction.
#> Applying Percentile Normalization (PN).
#> Warning in mbecPN(input.obj, model.vars, type = eval(type)): Abundances contain
#> zero values. Adding small uniform offset.
#> Group A is considered control group, i.e., reference
#> for normalization procedure. To change reference please 'relevel()'
#> grouping factor accordingly.
#> Applying Singular Value Decomposition (SVD) for batch-correction.
The mbecReportPost()
function will provide the user with
a comparative report that shows how the chosen batch effect correction
algorithms changed the data-set compared to the initial values.
The parameter model.vars
is a character vector with two
elements. The first denotes the covariate column that describes the
batch effect and the second one should be used for the presumed
biological effect of interest, e.g., the group effect in case/control
studies. The type
parameter selects which abundance table
is to be used “otu”, “clr”, “tss” .
Because the MbecData
class extends the
phyloseq
class, all functions from phyloseq
can be used as well. They do however only apply to the
otu_table
slot and will return an object of class
phyloseq
, i.e., any transformations or corrections will be
lost. To retrieve an object of class phyloseq
that contains
the otu_table
of corrected counts, for downstream analyses,
the user can employ the mbecGetPhyloseq()
function. As
before, the arguments type
and label
are used
to specify which abundance table should be used in the returned
object.
To retrieve the CLR transformed counts, set type
accordingly.
If the batch-mean-centering corrected counts show the best results,
select “cor” as type
and set the label
to “bmc”.
Relative Log-Expression plots can be created with the
mbecRLE()
function.
Produce RLE-plot on CLR transformed values. The
return.data
parameter can be set to TRUE to retrieve the data for inspection
or use in your own plotting function.
PCA plots can be created with the mbecPCA()
function.
Produce PCA-plot on CLR transformed values. The principal components
can be selected with the pca.axes
parameter. The vector of
length two works like this c(x-axis, y-axis). The return data parameter can
be set to TRUE to retrieve the data
for inspection or use in your own plotting function.
pca.plot <- mbecPCA(input.obj=mbec.obj, model.vars = "group", type="clr",
pca.axes=c(1,2), return.data = FALSE)
Plot with two grouping variables, determining color and shape of the output.
Box plots of the most variable features can be create with the
mbecBox()
function.
Produce Box-plots of the most variable features on CLR transformed
values. The method
parameter determines if “ALL” or only the “TOP” n features are plotted. The
n
parameter selects the number of features to plot. The
function will return a list
of plot objects to be used.
box.plot <- mbecBox(input.obj=mbec.obj, method = "TOP", n = 2,
model.var = "batch", type="clr", return.data = FALSE)
# Take a look.
plot(box.plot$OTU109)
Setting method
to a vector of feature names will select
exactly these features for plotting.
Pheatmap is used to produce heat-maps of the most variable features
with the mbecHeat()
function.
Produce a heatmap of the most variable features on CLR transformed
values. The method
parameter determines if “ALL” or only the “TOP” n features are plotted. The
n
parameter selects the number of features to plot. The
parameters center
and scale
can be used to
de-/activate centering and scaling prior to plotting. The
model.vars
parameter denotes all covariates of
interest.
The functions aim to determine the amount of variance that can be attributed to selected covariates of interest and visualize the results.
Use the function mbecModelVariance()
with parameter
method
set to “lm” to
fit a linear model of the form y ~ group + batch
to every
feature. Covariates of interest can be selected with the
model.vars
parameter and the function will construct a
model-formula. The parameters type
and label
can be used to select the desired abundance matrix to work on This
defaults to CLR transformed values.
Plot the resulting data with the mbecVarianceStatsPlot()
function and show the proportion of variance with regards to the
covariates in a box-plot.
Use the function mbecModelVariance()
with parameter
method
set to “lmm” to
fit a linear mixed model of the form y ~ group + (1|batch)
to every feature. Covariates of interest can be selected with the
model.vars
parameter and the function will construct a
model-formula. The parameters type
and label
can be used to select the desired abundance matrix to work on This
defaults to CLR transformed values.
Plot the resulting data with the mbecVarianceStatsPlot()
function and show the proportion of variance with regards to the
covariates in a box-plot.
Use the function mbecModelVariance()
with parameter
method
set to “rda” to
calculate the percentage of variance that can be attributed to the
covariates of interest with partial Redundancy Analysis. Covariates of
interest can be selected with the model.vars
parameter. The
parameters type
and label
can be used to
select the desired abundance matrix to work on This defaults to CLR
transformed values.
Plot the resulting data with the mbecRDAStatsPlot()
function and show the percentage of variance with regards to the
covariates in a bar-plot.
Use the function mbecModelVariance()
with parameter
method
set to “pvca”
to calculate the percentage of variance that can be attributed to the
covariates of interest using Principal Variance Components Analysis.
Covariates of interest can be selected with the model.vars
parameter. The parameters type
and label
can
be used to select the desired abundance matrix to work on This defaults
to CLR transformed values.
Plot the resulting data with the mbecPVCAStatsPlot()
function and show the percentage of variance with regards to the
covariates in a bar-plot.
Evaluate how good the samples fit to the clustering that is implied by the covariates of interest.
Use the function mbecModelVariance()
with parameter
method
set to “s.coef”
to evaluate the clustering that is implied by the covariates of interest
with the Silhouette Coefficient. Covariates of interest can be selected
with the model.vars
parameter. The parameters
type
and label
can be used to select the
desired abundance matrix to work on This defaults to CLR transformed
values.
Plot the resulting data with the mbecSCOEFStatsPlot()
function and show the silhouette coefficients in a dot-plot.
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MBECS_1.11.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 gridExtra_2.3 permute_0.9-7
#> [4] rlang_1.1.4 magrittr_2.0.3 ade4_1.7-22
#> [7] matrixStats_1.4.1 compiler_4.4.2 RSQLite_2.3.8
#> [10] mgcv_1.9-1 png_0.1-8 vctrs_0.6.5
#> [13] reshape2_1.4.4 sva_3.55.0 stringr_1.5.1
#> [16] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
#> [19] XVector_0.47.0 labeling_0.4.3 utf8_1.2.4
#> [22] rmarkdown_2.29 nloptr_2.1.1 UCSC.utils_1.3.0
#> [25] purrr_1.0.2 bit_4.5.0 xfun_0.49
#> [28] zlibbioc_1.52.0 cachem_1.1.0 GenomeInfoDb_1.43.2
#> [31] jsonlite_1.8.9 biomformat_1.35.0 blob_1.2.4
#> [34] rhdf5filters_1.19.0 Rhdf5lib_1.29.0 BiocParallel_1.41.0
#> [37] parallel_4.4.2 cluster_2.1.6 R6_2.5.1
#> [40] bslib_0.8.0 stringi_1.8.4 RColorBrewer_1.1-3
#> [43] limma_3.63.2 genefilter_1.89.0 boot_1.3-31
#> [46] jquerylib_0.1.4 Rcpp_1.0.13-1 iterators_1.0.14
#> [49] knitr_1.49 IRanges_2.41.1 Matrix_1.7-1
#> [52] splines_4.4.2 igraph_2.1.1 tidyselect_1.2.1
#> [55] yaml_2.3.10 vegan_2.6-8 codetools_0.2-20
#> [58] lattice_0.22-6 tibble_3.2.1 plyr_1.8.9
#> [61] withr_3.0.2 Biobase_2.67.0 KEGGREST_1.47.0
#> [64] evaluate_1.0.1 survival_3.7-0 Biostrings_2.75.1
#> [67] pillar_1.9.0 phyloseq_1.51.0 MatrixGenerics_1.19.0
#> [70] foreach_1.5.2 stats4_4.4.2 generics_0.1.3
#> [73] S4Vectors_0.45.2 ggplot2_3.5.1 munsell_0.5.1
#> [76] scales_1.3.0 minqa_1.2.8 xtable_1.8-4
#> [79] glue_1.8.0 pheatmap_1.0.12 maketools_1.3.1
#> [82] tools_4.4.2 sys_3.4.3 data.table_1.16.2
#> [85] lme4_1.1-35.5 annotate_1.85.0 locfit_1.5-9.10
#> [88] buildtools_1.0.0 XML_3.99-0.17 rhdf5_2.51.0
#> [91] grid_4.4.2 tidyr_1.3.1 ape_5.8
#> [94] AnnotationDbi_1.69.0 edgeR_4.5.0 colorspace_2.1-1
#> [97] nlme_3.1-166 GenomeInfoDbData_1.2.13 cli_3.6.3
#> [100] ruv_0.9.7.1 fansi_1.0.6 dplyr_1.1.4
#> [103] gtable_0.3.6 sass_0.4.9 digest_0.6.37
#> [106] BiocGenerics_0.53.3 farver_2.1.2 memoise_2.0.1
#> [109] htmltools_0.5.8.1 multtest_2.63.0 lifecycle_1.0.4
#> [112] httr_1.4.7 statmod_1.5.0 bit64_4.5.2
#> [115] MASS_7.3-61