Introduction

Aggregating single-cell RNA sequencing (scRNA-seq) datasets from multiple sources introduces technical variability, known as batch effects, arising from differences in operator handling, reagent lots, sequencing platforms, and experimental timing. Batch effects can distort downstream analyses and obscure the true biological variation, compromising the interpretability of the results.

Numerous batch correction methods have been proposed in the literature, based on different mathematical approaches. However, their performance is highly dependent on the intrinsic characteristics of the data.

BatChef is an R package that:

implements a variety of correction methods;
provides quantitative metrics to evaluate the performance of the correction methods, including the Wasserstein distance, Local Inverse Simpson’s Index (LISI), Average Silhouette Width (ASW), and Adjusted Rand Index (ARI).
can be used as a guideline to identify the most appropriate batch effects correction method based on data-specific characteristics.

Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("BatChef")

Setup

To demonstrate, we simulate a dataset with 2,148 genes, 3,461 cells, 2 batches and 3 cell types, using the simulate_data() function. This function applies the same parameters as the Splatter package’s ones. Additionally, it normalizes the data, identifies highly variable genes and performs principal component analysis using the scrapper package.

The BatChef package allows to use three different input data: SingleCellExperiment, Seurat and AnnData. These can be simulated using the the output_format parameter in the simulate_data() function.

library(BatChef)
sce <- simulate_data(
  n_genes = 2148, batch_cells = c(1215, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.15, 0.15), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.1, lib_loc = 10,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "SingleCellExperiment"
)
sce

## class: SingleCellExperiment 
## dim: 2148 3466 
## metadata(0):
## assays(2): counts logcounts
## rownames(2148): Gene1 Gene2 ... Gene2147 Gene2148
## rowData names(5): means variances fitted residuals hvg
## colnames(3466): Cell1 Cell2 ... Cell3465 Cell3466
## colData names(5): Cell Batch Group ExpLibSize sizeFactor
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):

so <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "Seurat"
)

adata <- simulate_data(
  n_genes = 2148, batch_cells = c(1210, 2251), compute_hvgs = TRUE,
  batch_fac_loc = c(0.1, 0.1), batch_fac_scale = c(0.1, 0.1),
  mean_shape = 0.4, lib_loc = 11.5,
  group_prob = c(0.17, 0.46, 0.37),
  compute_pca = TRUE, pca_ncomp = 10, output_format = "AnnData"
)

Batch correction method prediction

Users can predict the optimal batch correction method for a given dataset via the suggested_method() function. This function employs a Support Vector Machine (SVM) algorithm to predict the most suitable batch correction method and visualize the dataset’s position within a two-dimensional embedding space. The embedding was constructed by analyzing 130 datasets and their associated characteristics.

pred <- suggested_method(input = sce, batch = "Batch")

## Optimal method: liger

pred

In this case, the optimal batch correction is LIGER for our data.

Batch effects correction

The batchCorrect() function allows to specify the desired correction method. It can have these values:

LimmaParams
CombatParams
Seuratv3Params
Seuratv5Params
FastMNNParams
HarmonyParams
ScMerge2Params
LigerParams

To illustrate the functionality of the package, batch correction was performed using the LIGER method. To improve computational efficiency, the analysis was restricted to the first 1,000 highly variable genes.

sce <- batchCorrect(input = sce, batch = "Batch", 
                    params = LigerParams(
                      features = rownames(sce)[SingleCellExperiment::rowData(sce)$hvg]
                      )
                    )

The batchCorrect() function returns an object of the same class as the input. In this example, the output is therefore a SingleCellExperiment object. The corrected gene expression matrix and/or the corrected dimensionality reduction are stored in the returned object.

sce

## class: SingleCellExperiment 
## dim: 2148 3466 
## metadata(0):
## assays(2): counts logcounts
## rownames(2148): Gene1 Gene2 ... Gene2147 Gene2148
## rowData names(5): means variances fitted residuals hvg
## colnames(3466): Cell1 Cell2 ... Cell3465 Cell3466
## colData names(5): Cell Batch Group ExpLibSize sizeFactor
## reducedDimNames(2): PCA liger
## mainExpName: NULL
## altExpNames(0):

Performance evaluation

To evaluate batch correction performance, the user can use the metrics() function to compute multiple quantitative metrics, including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Local Inverse Simpson’s Index (LISI), Average Silhouette Width (ASW), and the Wasserstein distance.

The Wasserstein distance is computed on resampled data to account for the different numbers of cells in each batch. The parameter rep specifies how many times the calculation is repeated (in this case, 1).

red <- SingleCellExperiment::reducedDimNames(sce)
metrics <- lapply(red, function(x) {
  metrics(
    input = sce, batch = "Batch",
    group = "Group", reduction = x,
    rep = 1
  )
})
metrics <- do.call(rbind, metrics)
metrics

##   method wasserstein        iasw    ilisi       ari       nmi      casw clisi
## 1    PCA  8.19805631 0.331117445 1.000000 0.5989415 0.7608478 0.2944965     1
## 2  liger  0.02359227 0.005339738 1.766324 0.9369298 0.8981508 0.2050856     1

To interpret the results, we focused on the Wasserstein distance and the ARI score, which measure batch effect removal and the preservation of biological effects, respectively.

metrics[, c(1:2, 5)]

##   method wasserstein       ari
## 1    PCA  8.19805631 0.5989415
## 2  liger  0.02359227 0.9369298

A low Wasserstein distance indicates that the method effectively removes batch effects, while a high Adjusted Rand Index (ARI) indicates that the biological effects are well preserved. Overall, LIGER performs well on both metrics.

sessionInfo()

sessionInfo()

## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BatChef_1.1.1    BiocStyle_2.41.0
## 
## loaded via a namespace (and not attached):
##   [1] IRanges_2.47.2              nnet_7.3-20                
##   [3] goftest_1.2-3               Biostrings_2.81.3          
##   [5] HDF5Array_1.41.0            rstan_2.32.7               
##   [7] vctrs_0.7.3                 spatstat.random_3.5-0      
##   [9] digest_0.6.39               png_0.1-9                  
##  [11] shape_1.4.6.1               proxy_0.4-29               
##  [13] ggrepel_0.9.8               deldir_2.0-4               
##  [15] parallelly_1.47.0           batchelor_1.29.0           
##  [17] MASS_7.3-65                 reshape2_1.4.5             
##  [19] withr_3.0.3                 httpuv_1.6.17              
##  [21] BiocGenerics_0.59.7         ggrastr_1.0.2              
##  [23] scCustomize_3.3.0           xfun_0.59                  
##  [25] survival_3.8-6              memoise_2.0.1              
##  [27] proxyC_0.5.2                ggbeeswarm_0.7.3           
##  [29] janitor_2.2.1               Seqinfo_1.3.0              
##  [31] zoo_1.8-15                  GlobalOptions_0.1.4        
##  [33] gtools_3.9.5                V8_8.2.0                   
##  [35] pbapply_1.7-4               DEoptimR_1.2-0             
##  [37] Formula_1.2-5               sys_3.4.3                  
##  [39] KEGGREST_1.53.4             rematch2_2.1.2             
##  [41] promises_1.5.0              otel_0.2.0                 
##  [43] httr_1.4.8                  globals_0.19.1             
##  [45] fitdistrplus_1.2-6          cvTools_0.3.3              
##  [47] rhdf5filters_1.25.0         rhdf5_2.57.1               
##  [49] rstudioapi_0.19.0           units_1.0-1                
##  [51] miniUI_0.1.2                generics_0.1.4             
##  [53] dir.expiry_1.21.0           base64enc_0.1-6            
##  [55] curl_7.1.0                  S4Vectors_0.51.3           
##  [57] sfsmisc_1.1-24              ScaledMatrix_1.21.0        
##  [59] h5mread_1.5.0               polyclip_1.10-7            
##  [61] SparseArray_1.13.2          xtable_1.8-8               
##  [63] stringr_1.6.0               evaluate_1.0.5             
##  [65] S4Arrays_1.13.0             GenomicRanges_1.65.0       
##  [67] irlba_2.3.7                 filelock_1.0.3             
##  [69] colorspace_2.1-2            ROCR_1.0-12                
##  [71] harmony_2.0.5               reticulate_1.46.0          
##  [73] spatstat.data_3.1-9         magrittr_2.0.5             
##  [75] lmtest_0.9-40               snakecase_0.11.1           
##  [77] later_1.4.8                 buildtools_1.0.0           
##  [79] viridis_0.6.5               lattice_0.22-9             
##  [81] genefilter_1.95.0           spatstat.geom_3.8-1        
##  [83] future.apply_1.20.2         robustbase_0.99-7          
##  [85] XML_3.99-0.23               scattermore_1.2            
##  [87] scuttle_1.23.1              cowplot_1.2.0              
##  [89] matrixStats_1.5.0           RcppAnnoy_0.0.23           
##  [91] maketools_1.3.2             class_7.3-23               
##  [93] Hmisc_5.2-6                 pillar_1.11.1              
##  [95] StanHeaders_2.32.10         nlme_3.1-169               
##  [97] caTools_1.18.3              compiler_4.6.1             
##  [99] beachmat_2.29.0             RSpectra_0.16-2            
## [101] stringi_1.8.7               sf_1.1-1                   
## [103] tensor_1.5.1                SummarizedExperiment_1.43.0
## [105] lubridate_1.9.5             plyr_1.8.9                 
## [107] crayon_1.5.3                abind_1.4-8                
## [109] scater_1.41.1               locfit_1.5-9.12            
## [111] sp_2.2-1                    bit_4.6.0                  
## [113] dplyr_1.2.1                 codetools_0.2-20           
## [115] BiocSingular_1.29.0         bslib_0.11.0               
## [117] QuickJSR_1.10.0             splatter_1.37.1            
## [119] e1071_1.7-17                paletteer_1.7.0            
## [121] plotly_4.12.0               mime_0.13                  
## [123] splines_4.6.1               circlize_0.4.18            
## [125] Rcpp_1.1.1-1.1              fastDummies_1.7.6          
## [127] basilisk_1.25.0             sparseMatrixStats_1.25.0   
## [129] scrapper_1.7.3              blob_1.3.0                 
## [131] knitr_1.51                  reldist_1.7-2              
## [133] listenv_1.0.0               checkmate_2.3.4            
## [135] DelayedMatrixStats_1.35.0   pkgbuild_1.4.8             
## [137] tibble_3.3.1                Matrix_1.7-5               
## [139] statmod_1.5.2               startupmsg_1.0.0           
## [141] pkgconfig_2.0.3             tools_4.6.1                
## [143] cachem_1.1.0                aricode_1.1.0              
## [145] RSQLite_3.53.2              viridisLite_0.4.3          
## [147] DBI_1.3.0                   numDeriv_2016.8-1.1        
## [149] fastmap_1.2.0               rmarkdown_2.31             
## [151] scales_1.4.0                grid_4.6.1                 
## [153] ica_1.0-3                   Seurat_5.5.0               
## [155] sass_0.4.10                 patchwork_1.3.2            
## [157] ggprism_1.0.7               BiocManager_1.30.27        
## [159] dotCall64_1.2               transport_0.15-4           
## [161] RANN_2.6.2                  rpart_4.1.27               
## [163] farver_2.1.2                mgcv_1.9-4                 
## [165] yaml_2.3.12                 MatrixGenerics_1.25.0      
## [167] foreign_0.8-91              cli_3.6.6                  
## [169] purrr_1.2.2                 stats4_4.6.1               
## [171] lifecycle_1.0.5             uwot_0.2.4                 
## [173] M3Drop_1.39.0               Biobase_2.73.1             
## [175] RcppPlanc_2.0.15            mvtnorm_1.4-1              
## [177] bluster_1.23.0              backports_1.5.1            
## [179] annotate_1.91.0             BiocParallel_1.47.0        
## [181] distr_2.9.7                 timechange_0.4.0           
## [183] gtable_0.3.6                ggridges_0.5.7             
## [185] densEstBayes_1.0-2.2        progressr_0.19.0           
## [187] parallel_4.6.1              limma_3.69.2               
## [189] jsonlite_2.0.0              edgeR_4.11.3               
## [191] RcppHNSW_0.7.0              bitops_1.0-9               
## [193] ggplot2_4.0.3               bit64_4.8.2                
## [195] assertthat_0.2.1            Rtsne_0.17                 
## [197] spatstat.utils_3.2-3        BiocNeighbors_2.7.2        
## [199] SeuratObject_5.4.0          RcppParallel_5.1.11-2      
## [201] bdsmatrix_1.3-7             jquerylib_0.1.4            
## [203] metapod_1.21.0              dqrng_0.4.1                
## [205] loo_2.9.0                   spatstat.univar_3.2-0      
## [207] lazyeval_0.2.3              shiny_1.14.0               
## [209] ruv_0.9.7.1                 htmltools_0.5.9            
## [211] sctransform_0.4.3           glue_1.8.1                 
## [213] spam_2.11-4                 rliger_2.2.1               
## [215] ResidualMatrix_1.23.0       XVector_0.53.0             
## [217] scMerge_1.29.0              scran_1.41.1               
## [219] mclust_6.1.2                classInt_0.4-11            
## [221] gridExtra_2.3.1             sccore_1.0.7               
## [223] igraph_2.3.2                R6_2.6.1                   
## [225] sva_3.61.0                  tidyr_1.3.2                
## [227] SingleCellExperiment_1.35.1 gplots_3.3.0               
## [229] forcats_1.0.1               anndata_0.8.0              
## [231] zellkonverter_1.23.0        cluster_2.1.8.2            
## [233] bbmle_1.0.25.1              Rhdf5lib_2.1.0             
## [235] mcprogress_0.1.1            rstantools_2.6.0           
## [237] DelayedArray_0.39.3         tidyselect_1.2.1           
## [239] vipor_0.4.7                 htmlTable_2.5.0            
## [241] inline_0.3.21               AnnotationDbi_1.75.0       
## [243] future_1.70.0               leidenAlg_1.1.8            
## [245] rsvd_1.0.5                  KernSmooth_2.23-26         
## [247] S7_0.2.2                    data.table_1.18.4          
## [249] htmlwidgets_1.6.4           RColorBrewer_1.1-3         
## [251] rlang_1.2.0                 spatstat.sparse_3.2-0      
## [253] spatstat.explore_3.8-1      beeswarm_0.4.0