Simultaneously generating multiple omic measurements (e.g. transcriptomics, metabolomics, proteomics or epigenomics) of the same molecular system for one particular study is not always possible. As a consequence, researchers sometimes combine compatible data generated in different labs or in different batches. In such cases, data will usually be affected by an unwanted effect associated to the experimentation event (lab, batch, technology, etc.) that, especially for high throughput molecular assays, may result in important levels of noise contaminating the biological signal. This unwanted source of variation is commonly known as ``batch effect’’ and is very frequently seen as the first source of variability in the omic dataset, standing out over the experimental conditions under study.
Removing batch effects becomes then necessary in order to obtain meaningful results from statistical analyses. Provided that the omic experiment has been designed in such a way that batch effects are not confounded with the effects of interest (treatment, disease, cell type, etc.), the so-called Batch Effect Correction Algorithms (BECAs) can be used to remove, or at least mitigate, systematic biases.Therefore these methods are extremely useful to combine data from different laboratories or measured at different times. One of these BECAs is the ARSyN method [1], which relies on the ANOVA-Simultaneous Components Analysis (ASCA) framework to decompose the omic signal into experimental effects and other unwanted effects. ARSyN applies Principal Component Analysis (PCA) to estimate the systematic variation due to batch effect and then removes it from the original data.
BECAs have been traditionally applied to remove batch effects from omic data of the same type, as for example gene expression. However, while removing batch effects from a single omic data type with an appropriate experimental design is relatively straightforward, it can become unapproachable when dealing with multiomic datasets. In the multiomic scenario, each omic modality may have been measured by a different lab or at a different moment in time, and so it is obtained within a different batch. When this is the case, the batch effect will be confounded with the ``omic type effect’’ and will be impossible to remove from the data. However, in some scenarios, the multiomic batch effect can be corrected. MultiBaC is the first BECA dealing with batch effect correction in multiomic datasets. MultiBaC can remove batch effects across different omics generated within separate batches provided that at least one common omic data type is included in all the batches.
The MultiBaC package includes two BECAs: the ARSyN method for correcting batch effect from a single omic data type and the MultiBac method, which deals with the batch effect problem on multi-omic assays.
ARSyN (ASCA Removal of Systematic Noise) is a method that combines Analysis of Variance (ANOVA) and Principal Component Analysis (PCA) for the identification of structured variation from the estimated ANOVA models for experimental and unwanted effects on an omic data matrix. ARSyN can remove undesired signals to obtain noise-filtered data for further analysis [1]. In MultiBaC package, ARSyN has been adapted for filtering the noise associated to identified or unidentified batch effects. This adaptation has been called ARSyNbac.
In the ARSyN method, the ANOVA model separates the signal identified with each one of the factors involved in the experimental design from the residuals. The algorithm can be applied on multi-factorial experimental designs. One of the factors in the model can be the batch each sample belongs to, if this information is known. In such case, the ANOVA model is applied to separate the batch effect from the remaining effects and residuals. The PCA analysis will hence detect the possible existence of a structured variation due to the batch effect, that is identified with the principal components explaining a given proportion of the total variation in the data, which can be set by the user.
However, ARSyN can also be applied when the batch factor is not known since the PCA on the residual matrix can detect correlated structure associated to a source of variation not included in the experimental design. We alert of this signal in the residuals when the first eigenvalues of the PCA are noticeably higher than the rest, because if there is not any structure the eigenvalues will be approximately equal. In this case the selection of components is controlled by the beta argument. Components that represent more than beta times the average variability are identified as systematic noise and removed from the original data.
Nueda MJ, Ferrer A, Conesa A.(2012). ARSyN: A method for the identification and removal of systematic noise in multifactorial time course microarray experiments. Biostatistics 13:553-66.
The yeast expression example data sets were collected from the Gene Expression Omnibus (GEO) database and from three different laboratories (batches). In all of them, the effect of glucose starvation in yeast was analyzed. Lab A is the Department of Biochemistry and Molecular Biology from Universitat de Valencia (accession number GSE11521) [4]; Lab B is the Department of Molecular and Cellular Biology from Harvard University (accession number GSE56622) [5]; and Lab C is the Department of Biology from Johns Hopkins University (accession number GSE43747) [6].
After a proper data pre-processing for each case, a voom transformation (with limma R package) was applied when necessary. Finally TMM normalization was performed on the whole set of samples from all labs. A reduced dataset was obtained by selecting 200 omic variables from each data matrix and just 3 samples from lab A. This yeast multiomic reduced dataset is included in MutiBaC package to illustrate the usage of the package. The gene expression matrices can be loaded by using the data(“multiyeast”) instruction.
The three studies used equivalent yeast strains and experimental conditions but, as shown in Figure @ref(fig:init) , the main effect on expression is due to data belonging to different labs, which are the batches in this case.
The MultiBaC package uses MultiAssayExperiment objects, a type of Bioconductor container for multiomic studies, that can be created from a list of matrices or data.frame objects. These matrices must have features in rows and samples in columns as shown next for one of the data matrices from the yeast example (A.rna). It is important that all data matrices share the variable space. If the number of omic variables and order are not the same, the createMbac function will select the common variables. Hence, it is mandatory that rows are named with the same type of identifiers.
## A.rna_Glu+_1 A.rna_Glu+_2 A.rna_Glu+_3 A.rna_Glu-_1 A.rna_Glu-_2
## YOR324C 7.174264 6.976815 7.482661 8.020596 7.736636
## YGL104C 4.239493 4.284775 3.100898 4.957403 3.673252
## YOR142W 8.819824 8.496966 9.026971 10.374525 10.294006
## YOR052C 6.721211 7.011932 7.557519 8.504503 8.586738
## YGR038W 5.878483 5.894121 6.468361 7.856822 7.806318
## YER087W 5.131089 5.295029 5.750657 2.920250 5.496136
## A.rna_Glu-_3
## YOR324C 7.922425
## YGL104C 1.587184
## YOR142W 10.979224
## YOR052C 8.221168
## YGR038W 8.108436
## YER087W 2.879254
A MultiAssayExperiment object needs to be created for each batch (lab in this example). The mbac new data structure is a list of MultiAssayExperiment objects and can be easily generated with the createMbac function in the package. The resulting mbac object will be the ARSyNbac input.
data_RNA<- createMbac (inputOmics = list(A.rna, B.rna, C.rna),
batchFactor = c("A", "B", "C"),
experimentalDesign = list("A" = c("Glu+", "Glu+",
"Glu+", "Glu-",
"Glu-", "Glu-"),
"B" = c("Glu+", "Glu+",
"Glu-", "Glu-"),
"C" = c("Glu+", "Glu+",
"Glu-", "Glu-")),
omicNames = "RNA")
These are the arguments for the createMbac function:
The mbac R structure generated by the createMbac function is an S3 object that initially contains just one slot, the ListOfBatches object. This mbac structure will incorporate more elements as they are created when running ARSyNbac or MultiBaC functions. These new slots are CorrectedData, PLSmodels, ARSyNmodels or InnerRelation and are described next:
In addition to plot, other method is supplied for visualizing mbac objects: summary, which show the structure of the object.
## [1] "Object of class mbac: It contains 3 different bacthes and 1 omic type(s)."
RNA | |
---|---|
A | TRUE |
B | TRUE |
C | TRUE |
The function to remove batch effects or unwanted noise from a single omic data matrix in the MultiBaC package is the ARSyNbac function, which allows for the following arguments:
ARSyNbac (mbac, batchEstimation = TRUE, filterNoise = TRUE,
Interaction=FALSE, Variability = 0.90, beta = 2,
modelName = "Model 1",
showplot = TRUE)
Therefore, the ARSyNbac function offers three types of analysis: ARSyNbac batch effect correction, when the batch information is provided, ARSyNbac noise reduction, if the batch information is unknown, and the combination of both modes when there is a known source of batch effect and another possible unknown source of unwanted variability. In the following sections we explain how to proceed with each one of them.
When the batch is identified in the batchFactor argument of the mbac input object, its effect can be estimated and removed by choosing batchEstimation = TRUE (considering one source of batch effect only, filterNoise = FALSE). Moreover, a possible interaction between the experimental factors and the batch factor can be studied by setting Interaction=TRUE.
par(mfrow = c(1,2))
arsyn_1 <- ARSyNbac(data_RNA, modelName = "RNA", Variability = 0.95,
batchEstimation = TRUE, filterNoise = FALSE, Interaction = FALSE)
plot(arsyn_1, typeP="pca.cor", bty = "L",
pch = custom_pch, cex = 3, col.per.group = custom_col,
legend.text = c("Color: Batch", names(data_RNA$ListOfBatches),
"Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)),
args.legend = list("x" = "topright",
"pch" = c(NA, 15, 15, 15,
NA, 15, 0),
"col" = c(NA, "brown2", "dodgerblue", "forestgreen",
NA, 1, 1),
"bty" = "n",
"cex" = 1.2))
According to the left plot in Figure @ref(fig:arsyn1), two principal components (PCs) have been selected to explain at least 95% of the total variability of batch effect. PCA of corrected data with this analysis is shown in the right panel. Now the main source of variability in the data (PC1) is given by the experimental condition, while samples are not clustered by batches anymore.
par(mfrow = c(1,2))
arsyn_2 <- ARSyNbac(data_RNA, modelName = "RNA", Variability = 0.95,
batchEstimation = TRUE, filterNoise = FALSE, Interaction = TRUE)
plot(arsyn_2, typeP="pca.cor", bty = "L",
pch = custom_pch, cex = 3, col.per.group = custom_col,
legend.text = c("Color: Batch", names(data_RNA$ListOfBatches),
"Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)),
args.legend = list("x" = "topright",
"pch" = c(NA, 15, 15, 15,
NA, 15, 0),
"col" = c(NA, "brown2", "dodgerblue", "forestgreen",
NA, 1, 1),
"bty" = "n",
"cex" = 1.2))
In Figure @ref(fig:arsyn2) (right panel) all the points with negative PC1 correspond to Glu- samples, and the positive PC1 to Glu+ samples, as happened when not including the interaction in the model (Figure @ref(fig:arsyn1), right panel). However, in this second model, PC1 explains a higher percentage of the variability in the data, indicating a better batch effect correction. In general, we recomend the use of the default argument (Interaction=FALSE), as including part of the signal as batch effect could lead to a dilution of the effect of the signal of interest. However, the interaction between batch and experimental condition is sometimes strong and we should consider to include it in the model in order to get a better correction of the data.
The PCA plots shown in Figures @ref(fig:init), @ref(fig:arsyn1) and @ref(fig:arsyn2) have been created by the customized plot function in MultiBaC package. More details about this function can be found at the “Visualization of ARSyN and MultiBaC results” section, where a complete description of the arguments in plot is given.
When batch is not identified, ARSyNbac analyses the existence of systematic noise in the residuals by setting batchEstimation = FALSE and filterNoise = TRUE.
par(mfrow = c(1,2))
arsyn_3 <- ARSyNbac(data_RNA, modelName = "RNA", beta = 0.5,
batchEstimation = FALSE, filterNoise = TRUE)
plot(arsyn_3, typeP="pca.cor", bty = "L",
pch = custom_pch, cex = 3, col.per.group = custom_col,
legend.text = c("Color: Batch", names(data_RNA$ListOfBatches),
"Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)),
args.legend = list("x" = "topright",
"pch" = c(NA, 15, 15, 15,
NA, 15, 0),
"col" = c(NA, "brown2", "dodgerblue", "forestgreen",
NA, 1, 1),
"bty" = "n",
"cex" = 1.2))
In Figure @ref(fig:arsyn3) we can see that even though the batch is considered unidentified (batchEstimation=FALSE), ARSyN has removed the noise from the data by estimating the main source of unwanted variation. In this noise reduction mode, we can modulate the magnitude of the residual noise removal with the beta parameter. Basically, components that represent more than beta times the average variability are identified as systematic noise in residuals (3 components were selected in this model). Thus a greater beta value leads to the selection of a lower number of components in the residuals.
When the source of the batch effect is known but there might be an extra unknown source of unwanted variability, ARSyNbac can perform both previous ways by setting batchEstimation = TRUE and filterNoise = TRUE. Note that this mode could also be useful if the known batch effect does not represent the main source of noise in our data.
par(mfrow = c(1,2))
arsyn_4 <- ARSyNbac(data_RNA, modelName = "RNA", beta = 0.5,
batchEstimation = TRUE, filterNoise = TRUE,
Interaction = TRUE,
Variability = 0.95)
plot(arsyn_4, typeP="pca.cor", bty = "L",
pch = custom_pch, cex = 3, col.per.group = custom_col,
legend.text = c("Color: Batch", names(data_RNA$ListOfBatches),
"Fill: Cond.", unique(colData(data_RNA$ListOfBatches$A)$tfactor)),
args.legend = list("x" = "topright",
"pch" = c(NA, 15, 15, 15,
NA, 15, 0),
"col" = c(NA, "brown2", "dodgerblue", "forestgreen",
NA, 1, 1),
"bty" = "n",
"cex" = 1.2))
In Figure @ref(fig:arsyn4) we can see that performing both modalities together, ARSyNbac reaches its maximum PC1 variance explained (specially compared to @ref(fig:arsyn2)). In this third mode, we can modulate the magnitude of the residual noise removal with the beta parameter and the batch effect associated explained variance. In this example, as shown in @ref(fig:arsyn3), known batch effect represents the main (and almost unique) source of unwanted variatio. Thus, in other scenarios with more than one batch source, the result with this third mode might be very different form the other two previous ways of operation.
Multiomic data integration has become a popular approach to understand how biological systems work. However, generating this kind of datasets is still costly and time consuming. Consequently, it is quite common that not all the samples or omic data types are produced at the same time or in the same lab, but in different batches. In addition, when research groups cannot produce their own multiomic datasets, they usually collect them from different public repositories and, therefore, from different studies or laboratories (again, from different batches). Thus, in both situations, batch effects need to be previously removed from such datasets for successful data integration. Methods to correct batch effects on a single data type cannot be applied to correct batch effects across omics and, hence, we developed the MultiBaC strategy, which corrects batch effects from multiomic datasets distributed across different labs or data acquisition events.
However, there are some requirements for the multi-omic data set in order to remove across-omics batch effects with MultiBaC:
There must be, at least, one common omic data type in all the batches. We may have, for instance, gene expression data measurements in all the batches, and then other different omic data types in each batch. It is not necessary that the commom omic (e.g. gene expression) is measured with the same technology. We could have microarray expression data in one batch and RNA-seq data in another batch, for example.
The omic feature identifiers must be the same for all the common data matrices. We cannot use, for instance, Ensembl identifiers in one batch and RefSeq in another batch. It is not necessary to have the same number of omic features, e.g. genes, in all the batches. MultiBac will extract the common identifiers from all the common data type matrices to perform the analysis.
Within the same batch, the experimental design must be the same for all the omic in that batch, that is, the same experimental groups, with the same number of replicates obtained from the same individuals. All these samples must be in the same order in all the omic data matrices.
Ugidos M, Tarazona S, Prats-Montalb'an JM, Ferrer A, Conesa A.(2020). MultiBaC: A strategy to remove batch effects between different omic data types. Statistical Methods in Medical Research.
The yeast multiomic dataset comes from the same studies that the yeast expression data described in ARSyNbac section. While the three labs produced gene expression data (RNA), each of them generated a different additional omic data type. Lab A collected transcription rates (GRO, with accession number GSE1002) [4]. Lab B obtained translation rates (RIBO, with accession number GSE56622) [5]. Finally, Lab C measured global PAR-CLIP data (PAR-CLIP, with accession number GSE43747) [6]. Therefore, labs have one shared (RNA) and one distinct (GRO, RIBO and PAR-CLIP, respectively) data types. This distributed multiomic scenario represents the type of correction problem MultiBaC addresses. A scheme of the data structure is shown in Figure @ref(fig:design).
This yeast multiomic dataset is included in the MutiBaC package to illustrate the usage of the package. The six matrices can be loaded by using the data(“multiyeast”) instruction.
As commented before, the MultiBaC package uses MultiAssayExperiment objects, a type of Bioconductor container for multiomic studies, that can be created from a list of matrices or data.frame objects. These matrices must have features in rows and samples in columns (see example in ARSyNbac section). Since MultiBaC computes regression models between omics from the same batch, it is important that matrices from the same batch have the same experimental design: the same number of samples and in the same order. MultiBaC relates the commonOmic information from the different batches as well. Thus, it is also important that commonOmic matrices share the variable space. In this case, if the number of omic variables and order are not the same, MultiBaC will take the common variables for the model. Hence, it is mandatory that rows are named with the same type of identifiers.
The mbac object that will be the MultiBaC function input can be easily generated with the createMbac function in the package:
my_mbac <- createMbac (inputOmics = list(A.rna, A.gro,
B.rna, B.ribo,
C.rna, C.par),
batchFactor = c("A", "A",
"B", "B",
"C", "C"),
experimentalDesign = list("A" = c("Glu+", "Glu+",
"Glu+", "Glu-",
"Glu-", "Glu-"),
"B" = c("Glu+", "Glu+",
"Glu-", "Glu-"),
"C" = c("Glu+", "Glu+",
"Glu-", "Glu-")),
omicNames = c("RNA", "GRO",
"RNA", "RIBO",
"RNA", "PAR"))
More details about the createMbac function can be found in ARSyNbac input data Section. Note that we do not need to indicate which is the common omic (in commonOmic argument) since there is only one omic in common (RNA) for all the batches (labs) and the function detects it as the common omic.
Once the mbac object has been created with the createMbac function, it is used as the input data for MultiBaC function (mbac argument), which is the wrapper function for correction of multiomic batch effects.
MultiBaC (mbac,
test.comp = NULL, scale = FALSE,
center = TRUE, crossval = NULL,
Interaction = FALSE,
Variability = 0.90,
showplot = TRUE,
showinfo = TRUE)
The arguments of the MultiBaC function correspond to the different steps of the MultiBaC method:
The usage of MultiBaC function on the yeast example data is shown bellow:
my_final_mbac <- MultiBaC (my_mbac,
test.comp = NULL, scale = FALSE,
center = TRUE, crossval = NULL,
Interaction = TRUE,
Variability = 0.9,
showplot = TRUE,
showinfo = TRUE)
## Step 1: Create PLS models
## - Model for batch A
## - Model for batch B
## - Model for batch C
## Step 2: Generating missing omics
## Step 3: Batch effect correction using ARSyNbac
## [1] "Table: Inner correlation between scores for each PLS model computed."
##
##
## A: GRO B: RIBO C: PAR
## ------------- ---------- ---------- ----------
## Component: 1 0.9471967 0.9987790 0.9396986
## Component: 2 0.9277123 0.9991239 0.9998556
## Component: 3 0.9992129 1.0000000 1.0000000
## Component: 4 0.9999867 NA NA
## Component: 5 1.0000000 NA NA
By default (showinfo = TRUE), the table containing the inner correlations of PLS models is displayed in propmt. Moreover, the default plots (showplot = TRUE) are “Q2 plot” and “Explained variance plot” (see Figure @ref(fig:main-multibac2)), which contain important information about MultiBaC performance. The “Q2 plot” represents the PLS model prediction capability given by the Q2 score. The x axis indicates the number of components extracted for the PLS models and the y axis the Q2 value. The performance of the MultiBaC method will be better for higher Q2 values, since a high Q2 indicates a good PLS prediction of the missing omics and hence will result in a better estimation of the batch effect. Note that, depending on the rank of the matrices, each PLS model could have a different maximum number of components. The “Explained variance plot” provides the batch effect related variability explained using the ASCA decomposition that ARSyN method provides. The x axis indicates the number of components extracted for the ASCA model and y axis reflects the percentage of explained variance. In this case, a higher explained variance leads to a better batch effect estimation. In both plots, the number of components selected for each model is indicated with a triangule symbol. In the “Q2 plot”, the selected number of components is the one that maximize the Q2 value while in the “Explained variance plot”, this number is the minimum number of components that reaches a explained variability (y axis) higher than the Variability argument of the function (gray dashed line).
All the different steps performed by the MultiBaC wrapper function can be independently computed with specific functions, as described next, from the initial mbac object. The MultiBaC strategy can be divided into three main steps that will be described next in detail: 1) PLS model fitting, 2) Prediction of missing omics, and 3) Batch effect correction.
The genModelList function produces the PLS models between distinct and common omic data types. It computes the optimal number of components via a crossvalidation approach.
my_mbac_2 <- genModelList (my_mbac, test.comp = NULL,
scale = FALSE, center = TRUE,
crossval = NULL,
showinfo = TRUE)
## - Model for batch A
## - Model for batch B
## - Model for batch C
The arguments of the genModelList function are:
The output of genModelList is a mbac object with a new slot, PLSmodels, a list of the PLS models obtained with the ropls package. Each slot of the output corresponds to a batch in ListOfBatches. If one batch contains more than one non-common omic, the “batch” element in genModelList contains in turn as many elements as non-common omics in that batch, i.e. one PLS model per non-common omic.
The prediction of the initially missing omics is performed with the genMissingOmics function from the output of the genModelList function.
The result after running genMissingOmics is a list of MultiAssayExperiment structures. In this case, each batch contains all the omics introduced in MultiBaC. For instance, if two batches are being studying, “A” and “B”, given that “A” contains “RNA-seq” and “GRO-seq” data and “B” contains “RNA-seq” and “Metabolomics” data, after applying genMissingOmics function, batch “A” will contain “RNA-seq”, “GRO-seq” and also the predicted “Metabolomics” data.
my_finalwise_mbac <- batchCorrection(my_mbac_2,
multiBatchDesign = multiBatchDesign,
Interaction = TRUE,
Variability = 0.90)
As described before, ARSyN applies an ANOVA-like decomposition to the data matrix in order to estimate the batch effect and, next, a PCA is applied on each submatrix. The number of principal components for each PCA is adjusted by the Variability argument. The output of this function consists of two different objects: CorrectedData and ARSyNmodels. The first one has the same structure than ListOfBatches slot. However, in this case, all batches contain all the omics introduced in MultiBaC after correcting the batch effect on each omic data type separately. Note that we discard the predicted omic matrices and only use the corrected original matrices for further statistical analyses. The ARSyNmodels slot contains the ASCA decomposition models for each omic data type.
As mentioned before, the ARSyNbac and MultiBaC outputs are mbac type objects. Since the mbac class incorporates a plotting method, the plot function can by applied on mbac objects to graphically display additional information about the performance of the methods and the data characteristics. The plot function for mbac objects accepts several additional arguments:
plot (x, typeP = "def",
col.by.batch = TRUE,
col.per.group = NULL,
comp2plot = c(1,2),
legend.text = NULL,
args.legend = NULL, ...)
Description of the arguments:
While the plot function can generate all the plot types described above, each plot can also be independently generated by its corresponding function: Q2_plot (mbac), explained_varPlot (mbac), plot_pca (mbac, typeP = c(“pca.org”, “pca.cor”, “pca.both”), col.by.batch, col.per.group, comp2plot, legend.text, args.legend), batchEstPlot (mbac), or inner_relPlot (mbac, comp2plot = c(1,2)).
All these plots are useful to validate and understand MultiBaC or ARSyNbac performance. All of them can be used with a MultiBaC output, while those that show information related to the PLS models are not available for an ARSyNbac output (see typeP argument).
In addition to “Q2 plot” and “Explained variance plot” (Figure @ref(fig:main-multibac2)), which are the default plots, and were explained in previous sections, next sections are devoted to describe the rest of plots in MultiBaC package.
An important aspect to be validated in MultiBaC is the inner correlation between X and Y scores in PLS models. As we indicated before, MultiBaC function displays by default this information as numerical output but a visual representation can also be invoked.
## Hit <Return> to see next plot:
## Inner correlation of scores. Batch: A; Model for omic: GRO
## Warning in par(initpar): graphical parameter "cin" cannot be set
## Warning in par(initpar): graphical parameter "cra" cannot be set
## Warning in par(initpar): graphical parameter "csi" cannot be set
## Warning in par(initpar): graphical parameter "cxy" cannot be set
## Warning in par(initpar): graphical parameter "din" cannot be set
## Warning in par(initpar): graphical parameter "page" cannot be set
The inner correlation between scores of the PLS model that relates
both omic data types in batch “A” is shown in Figure @ref(fig:inner).
While we have only shown the plot for batch “A”, running plot
(typeP = “inner”)}, the inner correlation plots for all the PLS models
generated during MultiBaC performance are displayed using the tag
“Hit
This plot illustrates the magnitude of the estimated batch effects. Tipically, this plot is used before MultiBaC or ARSyNbac performance since it just requires a basic mbac returned by createMbac function.
Theoretical batch effect magnitudes for the yeast example are displayed in Figure @ref(fig:batchest). The violin plot shows the distribution of batch effect coefficients along the variable space (genes in case of RNA-seq data). Coefficients with higher values are the one that contribute the most to the existence of a batch effect, thus when the distribution is closer to zero, the batch effect is lower. MultiBaC correction performance has been tested and validated with small and medium batch effect magnitudes while it decreases at high magnitudes. ARSyN is not so much affected by the batch effect magnitude.
The goodness of ARSyNbac or MultiBaC correction can be assessed with the PCA plots before and after the correction. In the case of MultiBaC, this plot can only be generated when all the omic data matrices share the same variable space. In our yeast example, every omic data type was obtained as gene-related information, thus matrices can be merged by variables (genes) and the PCA is feasible.
An example of the usage of these PCA plots for the ARSyNbac output can be found in the ARSyN section. Here we illustrate how to generate and interpret them for MultiBaC correction. The PCA on the original data (Figure @ref(fig:pca-org)) and on the corrected data (Figure @ref(fig:pca-cor)) were obtained with the plot function by using either “typeP = pca.org” or “typeP = pca.cor”, respectively.
plot (my_final_mbac, typeP = "pca.org",
cex.axis = 1, cex.lab = 1, cex = 3, bty = "L",
cex.main = 1.2, pch = 19)
plot (my_final_mbac, typeP = "pca.cor",
cex.axis = 1, cex.lab = 1, cex = 3, bty = "L",
cex.main = 1.2, pch = 19)
By default, this function takes random colors to represent each group (batches by default). However, it would be useful to display the experimental factors information too. For that, we recommend the use of custom col.per.group and pch arguments. An example is shown in Figure @ref(fig:pcaplot2), using typeP = pca.both to show both PCA plots together. We could also plot more than two components indicating the desired number with comp2plot argument. The user can also include a custom legend by using two arguments: legend.text and args.legend. With legend.text we indicate the text labels of the legend and the rest of the legend arguments are collected in args.legend (x, y, pch, fill, col, bty, etc). If legend.text is not provided to the function, args.legend is not considered.
custom_col <- c("brown2", "dodgerblue", "forestgreen")
custom_pch <- c(19,19,19,1,1,1,15,15,15,0,0,0, # batch A
19,19,1,1,17,17,2,2, # batch B
19,19,1,1,18,18,5,5) # batch C
plot(my_final_mbac, typeP = "pca.both", col.by.batch = TRUE,
col.per.group = custom_col, comp2plot = 1:3,
cex.axis = 1.3, cex.lab = 1.2, cex = 3, bty = "L",
cex.main = 1.7, pch = custom_pch,
legend.text = c("Color", names(my_final_mbac$ListOfBatches),
"Shape", c("RNA", "GRO", "RIBO", "PAR"),
"Fill", levels(colData(my_final_mbac$ListOfBatches$A)$tfactor)),
args.legend = list("x" = "topright",
"pch" = c(NA, 15, 15, 15,
NA, 19, 15, 17, 18,
NA, 19, 1),
"col" = c(NA, "brown2", "dodgerblue", "forestgreen",
NA, rep(1, 4),
NA, 1, 1),
"bty" = "n",
"cex" = 2))
In this case, batch effect correction is observable in common data (RNA-seq, dots). Batch effect has been removed as common data is placed all together and after the correction, the components (especially the second and the third component) separate the common data based on the experimental condition instead of separating batches. As shown in the legend, point shape indicates the omic data type, however batch effect correction is only visible in common data.
Here is the output of sessionInfo() on the system on which this document was compiled:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ropls_1.37.6 ggplot2_3.5.1
## [3] MultiAssayExperiment_1.31.5 SummarizedExperiment_1.35.5
## [5] Biobase_2.67.0 GenomicRanges_1.57.2
## [7] GenomeInfoDb_1.41.2 IRanges_2.39.2
## [9] S4Vectors_0.43.2 BiocGenerics_0.53.0
## [11] MatrixGenerics_1.17.1 matrixStats_1.4.1
## [13] MultiBaC_1.17.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4
## [4] fastmap_1.2.0 promises_1.3.0 digest_0.6.37
## [7] mime_0.12 lifecycle_1.0.4 ellipsis_0.3.2
## [10] statmod_1.5.0 magrittr_2.0.3 compiler_4.4.1
## [13] rlang_1.1.4 sass_0.4.9 tools_4.4.1
## [16] plotrix_3.8-4 utf8_1.2.4 yaml_2.3.10
## [19] calibrate_1.7.7 knitr_1.48 labeling_0.4.3
## [22] S4Arrays_1.5.11 htmlwidgets_1.6.4 pkgbuild_1.4.5
## [25] DelayedArray_0.33.1 pkgload_1.4.0 abind_1.4-8
## [28] miniUI_0.1.1.1 withr_3.0.2 purrr_1.0.2
## [31] sys_3.4.3 desc_1.4.3 grid_4.4.1
## [34] fansi_1.0.6 urlchecker_1.0.1 profvis_0.4.0
## [37] xtable_1.8-4 colorspace_2.1-1 MASS_7.3-61
## [40] scales_1.3.0 cli_3.6.3 rmarkdown_2.28
## [43] crayon_1.5.3 generics_0.1.3 remotes_2.5.0
## [46] rstudioapi_0.17.1 httr_1.4.7 BiocBaseUtils_1.9.0
## [49] sessioninfo_1.2.2 cachem_1.1.0 zlibbioc_1.51.2
## [52] BiocManager_1.30.25 XVector_0.45.0 vctrs_0.6.5
## [55] devtools_2.4.5 Matrix_1.7-1 jsonlite_1.8.9
## [58] qqman_0.1.9 maketools_1.3.1 limma_3.61.12
## [61] jquerylib_0.1.4 MultiDataSet_1.33.0 glue_1.8.0
## [64] gtable_0.3.6 later_1.3.2 UCSC.utils_1.1.0
## [67] munsell_0.5.1 tibble_3.2.1 pillar_1.9.0
## [70] pcaMethods_1.97.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [73] R6_2.5.1 rprojroot_2.0.4 evaluate_1.0.1
## [76] shiny_1.9.1 lattice_0.22-6 highr_0.11
## [79] memoise_2.0.1 httpuv_1.6.15 bslib_0.8.0
## [82] Rcpp_1.0.13 SparseArray_1.5.45 xfun_0.48
## [85] fs_1.6.4 buildtools_1.0.0 usethis_3.0.0
## [88] pkgconfig_2.0.3