| Title: | PLSDA-batch |
|---|---|
| Description: | A novel framework to correct for batch effects prior to any downstream analysis in microbiome data based on Projection to Latent Structures Discriminant Analysis. The main method is named “PLSDA-batch”. It first estimates treatment and batch variation with latent components, then subtracts batch-associated components from the data whilst preserving biological variation of interest. PLSDA-batch is highly suitable for microbiome data as it is non-parametric, multivariate and allows for ordination and data visualisation. Combined with centered log-ratio transformation for addressing uneven library sizes and compositional structure, PLSDA-batch addresses all characteristics of microbiome data that existing correction methods have ignored so far. Two other variants are proposed for 1/ unbalanced batch x treatment designs that are commonly encountered in studies with small sample sizes, and for 2/ selection of discriminative variables amongst treatment groups to avoid overfitting in classification problems. These two variants have widened the scope of applicability of PLSDA-batch to different data settings. |
| Authors: | Yiwen (Eva) Wang [aut, cre] (ORCID: <https://orcid.org/0000-0002-7067-9093>), Kim-Anh Le Cao [aut] |
| Maintainer: | Yiwen (Eva) Wang <[email protected]> |
| License: | GPL-3 |
| Version: | 2.1.0 |
| Built: | 2026-05-30 07:17:42 UTC |
| Source: | https://github.com/bioc/PLSDAbatch |
This study explored the microbial indicators that could improve the efficacy of anaerobic digestion (AD) bioprocess and prevent its failure. The samples were treated with two different ranges of phenol concentration (effect of interest) and processed at five different dates (batch effect). This study includes a clear and strong batch effect with an approx. balanced batch x treatment design.
data('AD_data')data('AD_data')
A list containing three TreeSummarizedExperiment objects
FullData, EgData and CorrectData:
A TreeSummarizedExperiment object containing the counts of 75 samples and 567 OTUs. The meta data information of each sample is stored in the rowData, while the taxonomy of each OTU is stored in the colData.
A TreeSummarizedExperiment object containing the values of 75
samples and 231 OTUs filtered and centered log ratio transformed from the
FullData with raw counts.The rowData includes Y.trt and
Y.bat. Y.trt is the effect of interest, which is a factor of
phenol concentrations for each sample in the AD study; Y.bat is the
batch effect, which is a factor of sample processing dates for each sample.
The taxonomy of each OTU is stored in the colData. The rowTree is built based
on the Y.bat.
A TreeSummarizedExperiment object containing seven datasets before or after batch effect correction using different methods. Each assay includes 75 samples and 231 OTUs.
None.
The raw data were provided by Dr. Olivier Chapleur and published at the referenced article. Filtering and normalisation described in our package vignette.
Chapleur O, Madigou C, Civade R, Rodolphe Y, Mazéas L, Bouchez T (2016). “Increasing concentrations of phenol progressively affect anaerobic digestion of cellulose and associated microbial communities.” Biodegradation, 27(1), 15–27.
This function evaluates the degree of mixing samples from different batches in the batch corrected data. It is based on the dissimilarity matrix from Principal Component Analysis.
alignment_score(data, batch, var = 0.95, k = NULL, ncomp = 20)alignment_score(data, batch, var = 0.95, k = NULL, ncomp = 20)
data |
A numeric matrix. Samples are in rows, while variables are in
columns. |
batch |
A factor or a class vector for the batch grouping information (categorical outcome variable). The length should be equal to the number of samples in the data. |
var |
The proportion of data variance explained by
the principal components,
ranging from |
k |
Integer, the number of nearest neighbours.
By default |
ncomp |
Integer, the number of components for
principal component analysis.
Default value is |
A numeric alignment score that ranges from 0 to 1,
representing poor to perfect
performance of mixing the samples from different batches.
Yiwen Wang, Kim-Anh Le Cao
Butler A, Hoffman P, Smibert P, Papalexi E, Satija R (2018). “Integrating single-cell transcriptomic data across different conditions, technologies, and species.” Nature biotechnology, 36(5), 411–420.
Scatter_Density, box_plot,
density_plot and partVar_plot as the other
methods for batch effect detection and batch effect removal assessment.
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("sponge_data") X <- SummarizedExperiment::assays(sponge_data)$Clr_value # centered log ratio transformed data batch <- SummarizedExperiment::rowData(sponge_data)$Y.bat # batch information names(batch) <- rownames(sponge_data) alignment_score(data = X, batch = batch, var = 0.95, k = 3, ncomp = 20) }if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("sponge_data") X <- SummarizedExperiment::assays(sponge_data)$Clr_value # centered log ratio transformed data batch <- SummarizedExperiment::rowData(sponge_data)$Y.bat # batch information names(batch) <- rownames(sponge_data) alignment_score(data = X, batch = batch, var = 0.95, k = 3, ncomp = 20) }
This function draws side-by-side box plots for each batch.
box_plot( df, title = NULL, batch.legend.title = "Batch", ylab = "Value", color.set = NULL, x.angle = 0, x.hjust = 0.5, x.vjust = 0.5 )box_plot( df, title = NULL, batch.legend.title = "Batch", ylab = "Value", color.set = NULL, x.angle = 0, x.hjust = 0.5, x.vjust = 0.5 )
df |
A data frame with at least two columns. The first column contains the numeric values to be plotted on the y-axis, and the second column contains the batch information (categorical). Additional columns, if present, are ignored. |
title |
Character, the plot title. |
batch.legend.title |
Character, the legend title for the batch groups. |
ylab |
Character, the y-axis title. |
color.set |
A character vector specifying the colours to use for the
batch groups. Colours can be given as hexadecimal codes or any values
understood by |
x.angle |
Numeric, angle of the x-axis labels in degrees,
in the range from |
x.hjust |
Numeric, horizontal justification of the x-axis labels,
in the range from |
x.vjust |
Numeric, vertical justification of the x-axis labels,
in the range from |
A ggplot object representing the box plots.
Yiwen Wang, Kim-Anh Le Cao
Scatter_Density, density_plot,
alignment_score and partVar_plot as other
methods for batch effect detection and batch effect removal assessment.
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log-ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat names(ad.batch) <- rownames(AD_data$EgData) ad.df <- data.frame( value = ad.clr[, 1], batch = ad.batch ) box_plot(df = ad.df, title = "OTU 12", x.angle = 30) # using a custom colour set colorlist <- rainbow(10) box_plot( df = ad.df, title = "OTU 12", color.set = colorlist, x.angle = 30 ) }if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log-ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat names(ad.batch) <- rownames(AD_data$EgData) ad.df <- data.frame( value = ad.clr[, 1], batch = ad.batch ) box_plot(df = ad.df, title = "OTU 12", x.angle = 30) # using a custom colour set colorlist <- rainbow(10) box_plot( df = ad.df, title = "OTU 12", color.set = colorlist, x.angle = 30 ) }
This function takes one or more colors and decreases their brightness
by subtracting a constant amount from their RGB values. The adjustment
is performed in the RGB space, and the resulting values are truncated
to stay within [0, 1]. Alpha channels are not preserved and
the returned colors are fully opaque.
darken(colvec, amount = 0.15)darken(colvec, amount = 0.15)
colvec |
A vector of colors specified as character strings (for example, hexadecimal colors or any color name recognized by R). |
amount |
A scalar numeric value indicating how much brightness
to subtract. Must be between |
A character vector of darkened colors in hexadecimal RGB format.
Yiwen Wang, Kim-Anh Le Cao
darken("#336699") darken(c("red", "blue"), amount = 0.1) darken(pb_color(seq_len(3)), amount = 0.2)darken("#336699") darken(c("red", "blue"), amount = 0.1) darken(pb_color(seq_len(3)), amount = 0.2)
This function removes the variance of a given component comp from the
input matrix X.
It is mainly used internally in PLSDA_batch.
deflate_mtx(X, comp)deflate_mtx(X, comp)
X |
A numeric matrix to be deflated. It assumes that samples are
on the rows and variables are on the columns. |
comp |
A numeric vector or single-column matrix representing the component to be deflated out from the matrix. |
A deflated matrix with the same dimension as the input matrix.
Yiwen Wang, Kim-Anh Le Cao
Barker M, Rayens W (2003). “Partial least squares for discrimination.” Journal of Chemometrics: A Journal of the Chemometrics Society, 17(3), 166–173.
NULLNULL
This function draws overlapping density plots for each batch.
density_plot( df, title = NULL, batch.legend.title = "Batch", xlab = "Value", color.set = NULL, title.hjust = 0.5 )density_plot( df, title = NULL, batch.legend.title = "Batch", xlab = "Value", color.set = NULL, title.hjust = 0.5 )
df |
A data frame with at least two columns. The first column contains the numeric values to be plotted on the x-axis, and the second column contains the batch information (categorical). Additional columns, if present, are ignored. |
title |
Character, the plot title. |
batch.legend.title |
Character, the legend title for the batch groups. |
xlab |
Character, the x-axis title. |
color.set |
A character vector specifying the colours to use for the
batch groups. Colours can be given as hexadecimal codes or any values
understood by |
title.hjust |
Numeric, horizontal justification of the plot title,
in the range from |
A ggplot object representing the density plots.
Yiwen Wang, Kim-Anh Le Cao
Scatter_Density, box_plot,
alignment_score and partVar_plot as other
methods for batch effect detection and batch effect removal assessment.
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log-ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat names(ad.batch) <- rownames(AD_data$EgData) ad.df <- data.frame( value = ad.clr[, 1], batch = ad.batch ) density_plot(df = ad.df, title = "OTU 12") # using a custom colour set colorlist <- rainbow(10) density_plot( df = ad.df, title = "OTU 12", color.set = colorlist ) }if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log-ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat names(ad.batch) <- rownames(AD_data$EgData) ad.df <- data.frame( value = ad.clr[, 1], batch = ad.batch ) density_plot(df = ad.df, title = "OTU 12") # using a custom colour set colorlist <- rainbow(10) density_plot( df = ad.df, title = "OTU 12", color.set = colorlist ) }
This function takes one or more colors and increases their brightness
by adding a constant amount to their RGB values. The adjustment is
performed in the RGB space, and the resulting values are truncated
to stay within [0, 1]. Alpha channels are not preserved and
the returned colors are fully opaque.
lighten(colvec, amount = 0.15)lighten(colvec, amount = 0.15)
colvec |
A vector of colors specified as character strings (for example, hexadecimal colors or any color name recognized by R). |
amount |
A scalar numeric value indicating how much brightness
to add. Must be between |
A character vector of lightened colors in hexadecimal RGB format.
Yiwen Wang, Kim-Anh Le Cao
lighten("#336699") lighten(c("red", "blue"), amount = 0.1) lighten(pb_color(seq_len(3)), amount = 0.2)lighten("#336699") lighten(c("red", "blue"), amount = 0.1) lighten(pb_color(seq_len(3)), amount = 0.2)
This function fits linear regression models (either a linear model or a
linear mixed model) to each microbial variable, including treatment and batch
covariates as specified. For each variable, two models are fitted:
(i) a model including only the treatment effect (trt.only);
(ii) a model including both treatment and batch effects (trt.batch).
A selected criterion (e.g., AIC, BIC, RMSE, R2) is used to choose the
better model for each variable, and only the p-value of the selected model
is returned.
linear_regres( data, trt, batch.fix = NULL, batch.fix2 = NULL, batch.random = NULL, type = "linear model", p.adjust.method = "fdr", criterion = "AIC", return.model = TRUE )linear_regres( data, trt, batch.fix = NULL, batch.fix2 = NULL, batch.random = NULL, type = "linear model", p.adjust.method = "fdr", criterion = "AIC", return.model = TRUE )
data |
A data frame containing microbial variables. Rows correspond to samples and columns to features. |
trt |
A factor or class vector representing the treatment groups.
This argument is mandatory and is coerced to a factor internally. The
p-values correspond to the global treatment effect extracted from
|
batch.fix |
A factor or vector representing a batch effect treated
as a fixed effect. Required when |
batch.fix2 |
A second fixed batch effect. Can only be used when
|
batch.random |
A factor or vector representing a batch effect treated
as a random effect. Required when |
type |
Either |
p.adjust.method |
Method for p-value adjustment. One of:
|
criterion |
A character string indicating the model selection
criterion used to choose between the treatment-only model and the
treatment+batch model for each microbial variable. One of
A larger R2 indicates a better model. For all other criteria
( |
return.model |
Logical. If |
A list containing:
type |
The type of model used ("linear model" or "linear mixed model"). |
model |
A list of fitted |
raw.p |
A vector of p-values corresponding to the selected model
(based on |
adj.p |
Adjusted p-values. |
p.adjust.method |
The p-value adjustment method used. |
criterion |
The criterion used to select between the two models. |
best.model |
For each feature, either |
raw.R2 |
For |
adj.R2 |
Adjusted R2 values (linear model only). For mixed models,
this field contains |
cond.R2 |
Conditional R2 for mixed models: first column corresponds
to the treatment-only linear model, second column to the mixed model
including batch.random. For linear models, this field is |
RMSE |
The root mean squared error for both models (two columns:
|
RSE |
Residual standard error for both models. |
AIC |
AIC values for both models. |
BIC |
BIC values for both models. |
For each microbial variable, two models are always fitted:
trt.only: y ~ trt
trt.batch: y ~ trt + batch (or with random effects)
The selected model is determined solely by criterion. Only the
p-value corresponding to the selected model is returned in raw.p.
Yiwen Wang, Kim-Anh Le Cao
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ad.lm <- linear_regres( data = ad.clr, trt = ad.trt, batch.fix = ad.batch, type = "linear model", criterion = "AIC" ) ad.p.adj <- data.frame(best_model = ad.lm$best.model, adjust_p = ad.lm$adj.p) head(ad.p.adj) head(ad.lm$AIC) table(ad.lm$best.model) }if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ad.lm <- linear_regres( data = ad.clr, trt = ad.trt, batch.fix = ad.batch, type = "linear model", criterion = "AIC" ) ad.p.adj <- data.frame(best_model = ad.lm$best.model, adjust_p = ad.lm$adj.p) head(ad.p.adj) head(ad.lm$AIC) table(ad.lm$best.model) }
This function draws a partitioned variance plot explained by different sources.
partVar_plot( prop.df, text.cex = 3, x.angle = 60, x.hjust = 1, title = NULL, color.set = NULL )partVar_plot( prop.df, text.cex = 3, x.angle = 60, x.hjust = 1, title = NULL, color.set = NULL )
prop.df |
A data frame that contains the proportion of variance explained by different sources. |
text.cex |
Numeric, the size of text on the plot.
Use the size rule of |
x.angle |
Numeric, angle of x axis, in the range of
|
x.hjust |
Numeric, horizontal justification of x axis,
in the range of |
title |
Character, the plot title. |
color.set |
A vector of characters, indicating the set of colors to use. The colors are represented by hexadecimal color code. |
A ggplot object.
Yiwen Wang, Kim-Anh Le Cao
Scatter_Density, box_plot,
density_plot and alignment_score as the other
methods for batch effect detection and batch effect removal assessment.
if (requireNamespace("vegan", quietly = TRUE) && requireNamespace("SummarizedExperiment", quietly = TRUE)) { ## First example data("AD_data") # centered log ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch) rda.res <- vegan::varpart(ad.clr, ~trt, ~batch, data = ad.factors.df, scale = TRUE ) ad.prop.df <- data.frame( Treatment = NA, Batch = NA, Intersection = NA, Residuals = NA ) ad.prop.df[1, ] <- rda.res$part$indfract$Adj.R.squared ad.prop.df <- ad.prop.df[, c(1, 3, 2, 4)] partVar_plot(prop.df = ad.prop.df) ## Second example # a list of data corrected from different methods ad.corrected.list <- SummarizedExperiment::assays(AD_data$CorrectData) ad.prop.df <- data.frame( Treatment = NA, Batch = NA, Intersection = NA, Residuals = NA ) for (i in seq_len(length(ad.corrected.list))) { rda.res <- vegan::varpart(ad.corrected.list[[i]], ~trt, ~batch, data = ad.factors.df, scale = TRUE ) ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared } rownames(ad.prop.df) <- names(ad.corrected.list) ad.prop.df <- ad.prop.df[, c(1, 3, 2, 4)] partVar_plot(prop.df = ad.prop.df) }if (requireNamespace("vegan", quietly = TRUE) && requireNamespace("SummarizedExperiment", quietly = TRUE)) { ## First example data("AD_data") # centered log ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch) rda.res <- vegan::varpart(ad.clr, ~trt, ~batch, data = ad.factors.df, scale = TRUE ) ad.prop.df <- data.frame( Treatment = NA, Batch = NA, Intersection = NA, Residuals = NA ) ad.prop.df[1, ] <- rda.res$part$indfract$Adj.R.squared ad.prop.df <- ad.prop.df[, c(1, 3, 2, 4)] partVar_plot(prop.df = ad.prop.df) ## Second example # a list of data corrected from different methods ad.corrected.list <- SummarizedExperiment::assays(AD_data$CorrectData) ad.prop.df <- data.frame( Treatment = NA, Batch = NA, Intersection = NA, Residuals = NA ) for (i in seq_len(length(ad.corrected.list))) { rda.res <- vegan::varpart(ad.corrected.list[[i]], ~trt, ~batch, data = ad.factors.df, scale = TRUE ) ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared } rownames(ad.prop.df) <- names(ad.corrected.list) ad.prop.df <- ad.prop.df[, c(1, 3, 2, 4)] partVar_plot(prop.df = ad.prop.df) }
This function returns a discrete color palette used in
PLSDAbatch plots. The palette combines hues from
color.mixo and hue_pal
and contains at most 25 distinct colors.
pb_color(num.vector)pb_color(num.vector)
num.vector |
An integer vector of indices specifying which colors
to extract from the internal palette. Valid values are between
|
A character vector of hexadecimal color codes.
Yiwen Wang, Kim-Anh Le Cao
lighten, darken,
color.mixo,
hue_pal
pb_color(seq_len(5)) pb_color(c(1, 3, 5))pb_color(seq_len(5)) pb_color(c(1, 3, 5))
This function corrects for batch effects in case-control microbiome studies. Briefly, the relative abundance of microbial variables (i.e. bacterial taxa) in case (i.e. disease) samples is converted to percentiles of the corresponding variables in control (i.e. healthy) samples within each batch, before pooling data across batches. Pooled batches must share comparable case and control cohort definitions.
percentile_norm(data, batch, trt, ctrl.grp)percentile_norm(data, batch, trt, ctrl.grp)
data |
A data frame or matrix that contains the microbial variables to be corrected for batch effects. Samples are in rows and variables are in columns. |
batch |
A factor or vector for the batch grouping information
(categorical outcome variable). Its length must match the number of rows
in |
trt |
A factor or vector for the treatment grouping information
(categorical outcome variable). Its length must match the number of rows
in |
ctrl.grp |
Character, the label of control samples (i.e. healthy) in
|
A data frame with batch effects corrected by percentile normalisation.
Yiwen Wang, Kim-Anh Le Cao
Gibbons SM, Duvallet C, Alm EJ (2018). “Correcting for batch effects in case-control microbiome studies.” PLoS Computational Biology, 14(4), e1006102.
linear_regres and PLSDA_batch as
other methods for batch effect management.
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ad.PN <- percentile_norm( data = ad.clr, batch = ad.batch, trt = ad.trt, ctrl.grp = "0-0.5" ) }if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") # centered log ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ad.PN <- percentile_norm( data = ad.clr, batch = ad.batch, trt = ad.trt, ctrl.grp = "0-0.5" ) }
This function estimates latent dimensions from the explanatory matrix
X. The latent dimensions are maximally associated with the outcome
matrix Y. It is a built-in function of PLSDA_batch.
PLSDA( X, Y, ncomp = 2, keepX = NULL, mode = c("regression", "canonical"), tol = 1e-06, max.iter = 500 )PLSDA( X, Y, ncomp = 2, keepX = NULL, mode = c("regression", "canonical"), tol = 1e-06, max.iter = 500 )
X |
A numeric matrix that is centered and scaled as an explanatory
matrix. |
Y |
A dummy matrix that is centered and scaled as an outcome matrix. |
ncomp |
Integer, the number of dimensions to include in the model. |
keepX |
A numeric vector of length |
mode |
Character, either |
tol |
Numeric, convergence stopping value. |
max.iter |
Integer, the maximum number of iterations. |
PLSDA returns a list that contains the following components:
original_data |
The original explanatory matrix |
defl_data |
The centered and scaled deflated matrices ( |
latent_comp |
The latent components calculated with estimated latent dimensions. |
loadings |
The estimated latent dimensions. |
iters |
Number of iterations of the algorthm for each component. |
exp_var |
The amount of data variance explained per component (note
that contrary to |
Yiwen Wang, Kim-Anh Le Cao
Barker M, Rayens W (2003). “Partial least squares for discrimination.” Journal of Chemometrics: A Journal of the Chemometrics Society, 17(3), 166–173.
NULLNULL
This function removes batch variation from the input data given the batch
grouping information and the number of associated components with
PLSDA-batch. For sparse PLSDA-batch, the number of variables to keep for each
treatment related component is needed (keepX.trt). For weighted
PLSDA-batch, the balance should be set to FALSE, and it cannot
deal with the nested batch x treatment design.
PLSDA_batch( X, Y.trt = NULL, Y.bat, ncomp.trt = 2, ncomp.bat = 2, keepX.trt = NULL, keepX.bat = NULL, max.iter = 500, tol = 1e-06, near.zero.var = TRUE, balance = TRUE, mode = c("regression", "canonical") )PLSDA_batch( X, Y.trt = NULL, Y.bat, ncomp.trt = 2, ncomp.bat = 2, keepX.trt = NULL, keepX.bat = NULL, max.iter = 500, tol = 1e-06, near.zero.var = TRUE, balance = TRUE, mode = c("regression", "canonical") )
X |
A numeric matrix as an explanatory matrix.
|
Y.trt |
A factor or a class vector for the treatment grouping
information (categorical outcome variable). Without the input of
|
Y.bat |
A factor or a class vector for the batch grouping information (categorical outcome variable). |
ncomp.trt |
Integer, the number of treatment associated dimensions to include in the model. |
ncomp.bat |
Integer, the number of batch associated dimensions to include in the model. |
keepX.trt |
A numeric vector of length |
keepX.bat |
A numeric vector of length |
max.iter |
Integer, the maximum number of iterations. |
tol |
Numeric, convergence stopping value. |
near.zero.var |
Logical, should be set to |
balance |
Logical, should be set to |
mode |
Character, either |
PLSDA_batch returns a list that contains
the following components:
X |
The original explanatory matrix |
X.nobatch |
The batch corrected matrix with the same dimension as the input matrix. |
X.notrt |
The matrix from which treatment variation is removed. |
Y |
The original outcome variables |
latent_var.trt |
The treatment associated latent components calculated with corresponding latent dimensions. |
latent_var.bat |
The batch associated latent components calculated with corresponding latent dimensions. |
loadings.trt |
The estimated treatment associated latent dimensions. |
loadings.bat |
The estimated batch associated latent dimensions. |
tol |
The tolerance used in the iterative algorithm, convergence stopping value. |
max.iter |
The maximum number of iterations. |
iter.trt |
Number of iterations of the algorthm for each treatment associated component. |
iter.bat |
Number of iterations of the algorthm for each batch associated component. |
explained_variance.trt |
The amount of data variance explained per treatment associated component. |
explained_variance.bat |
The amount of data variance explained per batch associated component. |
weight |
The sample weights, all |
Yiwen Wang, Kim-Anh Le Cao
Wang Y, LêCao K (2020). “Managing batch effects in microbiome data.” Briefings in bioinformatics, 21(6), 1954–1970.
Wang Y, Lê Cao K (2023). “PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data.” Briefings in Bioinformatics, 24(2), bbac622.
linear_regres and percentile_norm as
the other methods for batch effect management.
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { ## First example ## PLSDA-batch data("AD_data") X <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value # centered log ratio transformed data Y.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information Y.bat <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information names(Y.bat) <- names(Y.trt) <- rownames(AD_data$EgData) ad_plsda_batch <- PLSDA_batch(X, Y.trt, Y.bat, ncomp.trt = 1, ncomp.bat = 4, mode = "regression" ) ad_X.corrected <- ad_plsda_batch$X.nobatch # batch corrected data ## Second example ## sparse PLSDA-batch ad_splsda_batch <- PLSDA_batch(X, Y.trt, Y.bat, ncomp.trt = 1, keepX.trt = 100, ncomp.bat = 4, mode = "regression" ) ## Third example ## weighted PLSDA-batch ad_wplsda_batch <- PLSDA_batch(X, Y.trt, Y.bat, ncomp.trt = 1, ncomp.bat = 4, balance = FALSE, mode = "regression" ) }if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { ## First example ## PLSDA-batch data("AD_data") X <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value # centered log ratio transformed data Y.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information Y.bat <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information names(Y.bat) <- names(Y.trt) <- rownames(AD_data$EgData) ad_plsda_batch <- PLSDA_batch(X, Y.trt, Y.bat, ncomp.trt = 1, ncomp.bat = 4, mode = "regression" ) ad_X.corrected <- ad_plsda_batch$X.nobatch # batch corrected data ## Second example ## sparse PLSDA-batch ad_splsda_batch <- PLSDA_batch(X, Y.trt, Y.bat, ncomp.trt = 1, keepX.trt = 100, ncomp.bat = 4, mode = "regression" ) ## Third example ## weighted PLSDA-batch ad_wplsda_batch <- PLSDA_batch(X, Y.trt, Y.bat, ncomp.trt = 1, ncomp.bat = 4, balance = FALSE, mode = "regression" ) }
This function prefilters microbiome count data to remove samples or microbial variables with very low abundance.
PreFL(data, keep.spl = 10, keep.var = 0.01)PreFL(data, keep.spl = 10, keep.var = 0.01)
data |
A numeric matrix or data frame with samples in rows and variables in columns. |
keep.spl |
Numeric, the minimum total count of a sample to be kept.
Samples with a total count smaller than |
keep.var |
Numeric, the minimum percentage (between |
PreFL returns a list that contains the following components:
data.filter |
The filtered data matrix. |
sample.idx |
The indices of samples kept. |
var.idx |
The indices of variables kept. |
zero.prob.before |
The proportion of zeros in the input data. |
zero.prob.after |
The proportion of zeros after filtering. |
Yiwen Wang, Kim-Anh Le Cao
Le Cao K, Costello M, Lakis VA, Bartolo F, Chua X, Brazeilles R, Rondeau P (2016). “MixMC: a multivariate statistical framework to gain insight into microbial communities.” PloS One, 11(8), e0160169.
if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") ad.count <- SummarizedExperiment::assays(AD_data$FullData)$Count ad.filter.res <- PreFL(data = ad.count) ad.filter <- ad.filter.res$data.filter ad.zero.before <- ad.filter.res$zero.prob.before ad.zero.after <- ad.filter.res$zero.prob.after }if (requireNamespace("SummarizedExperiment", quietly = TRUE)) { data("AD_data") ad.count <- SummarizedExperiment::assays(AD_data$FullData)$Count ad.filter.res <- PreFL(data = ad.count) ad.filter <- ad.filter.res$data.filter ad.zero.before <- ad.filter.res$zero.prob.before ad.zero.after <- ad.filter.res$zero.prob.after }
This function draws a sample scatter plot for two selected components, together with marginal density plots along each axis. It is generic in the sense that it only requires a matrix or dataframe of component scores, and can therefore be used with PCA, PLS or any other multivariate method that returns component scores.
Scatter_Density( components, comp = c(1, 2), expl.var = NULL, batch = NULL, trt = NULL, xlim = NULL, ylim = NULL, color.set = NULL, shape.set = NULL, batch.legend.title = "Batch", trt.legend.title = "Treatment", density.lwd = 0.2, title = NULL, title.cex = 1.5, legend.cex = 0.7, legend.title.cex = 0.75 )Scatter_Density( components, comp = c(1, 2), expl.var = NULL, batch = NULL, trt = NULL, xlim = NULL, ylim = NULL, color.set = NULL, shape.set = NULL, batch.legend.title = "Batch", trt.legend.title = "Treatment", density.lwd = 0.2, title = NULL, title.cex = 1.5, legend.cex = 0.7, legend.title.cex = 0.75 )
components |
A numeric matrix or data frame containing component scores. Samples are in rows and components in columns. |
comp |
Integer vector of length two indicating which components
to plot on the x and y axes respectively, for example |
expl.var |
Optional numeric vector giving the proportion of variance
explained by each component. If provided, it should have length at least
|
batch |
Optional factor or vector giving batch grouping information
(categorical outcome). If provided, its length must match the number of
samples in |
trt |
Optional factor or vector giving treatment grouping
information (categorical outcome). If provided, its length must match
the number of samples in |
xlim |
Optional numeric vector of length two giving the x-axis
limits for the scatter plot. If |
ylim |
Optional numeric vector of length two giving the y-axis
limits for the scatter plot. If |
color.set |
Optional character vector of colours (hexadecimal codes)
used to represent batch levels. If |
shape.set |
Optional numeric vector of plotting characters
used to represent treatment levels. If |
batch.legend.title |
Character string giving the legend title for batch. |
trt.legend.title |
Character string giving the legend title for treatment. |
density.lwd |
Numeric value giving the line width for the density curves in the marginal plots. |
title |
Character string giving the main title. |
title.cex |
Numeric value controlling the relative size of the main title. |
legend.cex |
Numeric value controlling the relative size of legend text. |
legend.title.cex |
Numeric value controlling the relative size of legend titles. |
A grob object containing the combined scatter and density plots.
Yiwen Wang, Kim-Anh Le Cao
box_plot, density_plot,
alignment_score and partVar_plot as other
methods for batch effect detection and batch effect removal assessment.
if (requireNamespace("mixOmics", quietly = TRUE) && requireNamespace("SummarizedExperiment", quietly = TRUE)) { ## Example using a PCA object from mixOmics data("AD_data") ## centered log-ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.pca <- mixOmics::pca(ad.clr, ncomp = 3, scale = TRUE) ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ## components and explained variance extracted from the PCA object comp.mat <- ad.pca$variates$X expl.var <- ad.pca$prop_expl_var$X ## Scatter plot of the first two components Scatter_Density( components = comp.mat, comp = c(1, 2), expl.var = expl.var, batch = ad.batch, trt = ad.trt ) ## Scatter plot of components 1 and 3, with a user-defined colour set cols <- rainbow(10) Scatter_Density( components = comp.mat, comp = c(1, 3), expl.var = expl.var, batch = ad.batch, trt = ad.trt, color.set = cols ) }if (requireNamespace("mixOmics", quietly = TRUE) && requireNamespace("SummarizedExperiment", quietly = TRUE)) { ## Example using a PCA object from mixOmics data("AD_data") ## centered log-ratio transformed data ad.clr <- SummarizedExperiment::assays(AD_data$EgData)$Clr_value ad.pca <- mixOmics::pca(ad.clr, ncomp = 3, scale = TRUE) ad.batch <- SummarizedExperiment::rowData(AD_data$EgData)$Y.bat # batch information ad.trt <- SummarizedExperiment::rowData(AD_data$EgData)$Y.trt # treatment information names(ad.batch) <- names(ad.trt) <- rownames(AD_data$EgData) ## components and explained variance extracted from the PCA object comp.mat <- ad.pca$variates$X expl.var <- ad.pca$prop_expl_var$X ## Scatter plot of the first two components Scatter_Density( components = comp.mat, comp = c(1, 2), expl.var = expl.var, batch = ad.batch, trt = ad.trt ) ## Scatter plot of components 1 and 3, with a user-defined colour set cols <- rainbow(10) Scatter_Density( components = comp.mat, comp = c(1, 3), expl.var = expl.var, batch = ad.batch, trt = ad.trt, color.set = cols ) }
This study investigated the relationship between metabolite concentration and microbial abundance of specific sponge tissues. The samples were collected from two types of tissues (Ectosome vs. Choanosome) and processed on two separate denaturing gradient gels in electrophoresis. This study includes relative abundance data only and a completely balanced batch x treatment design.
data('sponge_data')data('sponge_data')
A TreeSummarizedExperiment object containing the relative
abundance (Tss_value) and centered log ratio transformed values (Clr_value)
of 32 samples and 24 OTUs. The rowData includes Y.trt and
Y.bat. Y.trt is the effect of interest, which is a factor of
sponge tissues for each sample in the sponge study; Y.bat is the
batch effect, which is a factor of electrophoresis gels where each sample
processed. The rowTree is built based on the Y.bat.
None.
The raw data were downloaded from the referenced article. Filtering and normalisation described in https://evayiwenwang.github.io/PLSDAbatch_workflow/.
Sacristán-Soriano O, Banaigs B, Casamayor EO, Becerro MA (2011). “Exploring the links between natural products and bacterial assemblages in the sponge Aplysina aerophoba.” Appl. Environ. Microbiol., 77(3), 862–870.