The batchelor
package provides the batchCorrect()
generic that dispatches
on its PARAM
argument. Users writing code using
batchCorrect()
can easily change one method for another by
simply modifying the class of object supplied as PARAM
. For
example:
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1
B2 <- matrix(rnorm(10000), ncol=50) # Batch 2
# Switching easily between batch correction methods.
m.out <- batchCorrect(B1, B2, PARAM=ClassicMnnParam())
f.out <- batchCorrect(B1, B2, PARAM=FastMnnParam(d=20))
r.out <- batchCorrect(B1, B2, PARAM=RescaleParam(pseudo.count=0))
Developers of other packages can extend this further by adding their batch correction methods to this dispatch system. This improves interoperability across packages by allowing users to easily experiment with different methods.
You will need to Imports: batchelor, methods
in your
DESCRIPTION
file. You will also need to add
import(methods)
,
importFrom(batchelor, "batchCorrect")
and
importClassesFrom(batchelor, "BatchelorParam")
in your
NAMESPACE
file.
Obviously, you will also need to have a function that implements your batch correction method. For demonstration purposes, we will use an identity function that simply returns the input values1. This is implemented like so:
BatchelorParam
subclassWe need to define a new BatchelorParam
subclass that
instructs the batchCorrect()
generic to dispatch to our new
method. This is most easily done like so:
Note that BatchelorParam
itself is derived from a
SimpleList
and can be modified with standard list operators
like $
.
## NothingParam of length 0
## NothingParam of length 1
## names(1): some_value
If no parameters are set, the default values in the function will be used2. Additional slots can be specified in the class definition if there are important parameters that need to be manually specified by the user.
batchCorrect
methodThe batchCorrect()
generic looks like this:
## standardGeneric for "batchCorrect" defined from package "batchelor"
##
## function (..., batch = NULL, restrict = NULL, subset.row = NULL,
## correct.all = FALSE, assay.type = NULL, PARAM)
## standardGeneric("batchCorrect")
## <bytecode: 0x557a68c78680>
## <environment: 0x557a68c699d0>
## Methods may be defined for arguments: PARAM
## Use showMethods(batchCorrect) for currently available ones.
Any implemented method must accept one or more matrix-like objects
containing single-cell gene expression matrices in ...
.
Rows are assumed to be genes and columns are assumed to be cells. If
only one object is supplied, batch
must be specified to
indicate the batches to which each cell belongs.
Alternatively, one or more SingleCellExperiment
objects
can be supplied, containing the gene expression matrix in the
assay.type
assay. These should not be mixed with
matrix-like objects, i.e., if one object is a
SingleCellExperiment
, all objects should be
SingleCellExperiment
s.
The subset.row=
argument specifies a subset of genes on
which to perform the correction. The correct.all=
argument
specifies whether corrected values should be returned for all genes, by
“extrapolating” from the results for the genes that were used3. See the
Output section below for the expected output from each combination of
these settings.
The restrict=
argument allows users to compute the
correction using only a subset of cells in each batch (e.g., from cell
controls). The correction is then “extrapolated” to all cells in the
batch4,
such that corrected values are returned for all cells.
Any implemented method must return a
SingleCellExperiment
where the first assay contains
corrected gene expression values for all genes. Corrected values should
be returned for all genes if correct.all=TRUE
or
subset.row=NULL
. If correct.all=FALSE
and
subset.row
is not NULL
, values should only be
returned for the selected genes.
Cells should be reported in the same order that they are supplied in
...
. In cases with multiple batches, the cell identities
are simply concatenated from successive objects in their specified
order, i.e., all cells from the first object (in their provided order),
then all cells from the second object, and so on. If there is only a
single batch, the order of cells in that batch should be preserved.
The output object should have row names equal to the row names of the
input objects. Column names should be equivalent to the concatenated
column names of the input objects, unless all are NULL
, in
which case the column names in the output can be NULL
. In
situations where some input objects have column names, and others are
NULL
, those missing column names should be filled in with
empty strings. This represents the expected behaviour when
cbind
ing multiple matrices together.
Finally, the colData
slot should contain ‘batch’, a
vector specifying the batch of origin for each cell.
Finally, we define a method that calls our noCorrect
function while satisfying all of the above input/output requirements. To
be clear, it is not mandatory to lay out the code as shown below; this
is simply one way that all the requirements can be satisfied. We have
used some internal batchelor
functions for brevity - please contact us if you find these useful and
want them to be exported.
setMethod("batchCorrect", "NothingParam", function(..., batch = NULL,
restrict=NULL, subset.row = NULL, correct.all = FALSE,
assay.type = "logcounts", PARAM)
{
batches <- list(...)
checkBatchConsistency(batches)
# Pulling out information from the SCE objects.
is.sce <- checkIfSCE(batches)
if (any(is.sce)) {
batches[is.sce] <- lapply(batches[is.sce], assay, i=assay.type)
}
# Subsetting by 'batch', if only one object is supplied.
do.split <- length(batches)==1L
if (do.split) {
divided <- divideIntoBatches(batches[[1]], batch=batch, restrict=restrict)
batches <- divided$batches
restrict <- divided$restricted
}
# Subsetting by row.
# This is a per-gene "method", so correct.all=TRUE will ignore subset.row.
# More complex methods will need to handle this differently.
if (correct.all) {
subset.row <- NULL
} else if (!is.null(subset.row)) {
subset.row <- normalizeSingleBracketSubscript(originals[[1]], subset.row)
batches <- lapply(batches, "[", i=subset.row, , drop=FALSE)
}
# Don't really need to consider restrict!=NULL here, as this function
# doesn't do anything with the cells anyway.
output <- do.call(noCorrect, batches)
# Reordering the output for correctness if it was previously split.
if (do.split) {
d.reo <- divided$reorder
output <- output[,d.reo,drop=FALSE]
}
ncells.per.batch <- vapply(batches, FUN=ncol, FUN.VALUE=0L)
batch.names <- names(batches)
if (is.null(batch.names)) {
batch.names <- seq_along(batches)
}
SingleCellExperiment(list(corrected=output),
colData=DataFrame(batch=rep(batch.names, ncells.per.batch)))
})
And it works5:
## class: SingleCellExperiment
## dim: 200 100
## metadata(0):
## assays(1): corrected
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(1): batch
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Remember to export both the new method and the
NothingParam
class and constructor.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.35.0 ggplot2_3.5.1
## [3] scran_1.35.0 scuttle_1.17.0
## [5] scRNAseq_2.20.0 batchelor_1.23.0
## [7] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [9] Biobase_2.67.0 GenomicRanges_1.59.1
## [11] GenomeInfoDb_1.43.2 IRanges_2.41.1
## [13] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [15] generics_0.1.3 MatrixGenerics_1.19.0
## [17] matrixStats_1.4.1 knitr_1.49
## [19] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9
## [3] magrittr_2.0.3 ggbeeswarm_0.7.2
## [5] GenomicFeatures_1.59.1 gypsum_1.3.0
## [7] farver_2.1.2 rmarkdown_2.29
## [9] BiocIO_1.17.1 zlibbioc_1.52.0
## [11] vctrs_0.6.5 memoise_2.0.1
## [13] Rsamtools_2.23.1 DelayedMatrixStats_1.29.0
## [15] RCurl_1.98-1.16 htmltools_0.5.8.1
## [17] S4Arrays_1.7.1 AnnotationHub_3.15.0
## [19] curl_6.0.1 BiocNeighbors_2.1.1
## [21] Rhdf5lib_1.29.0 SparseArray_1.7.2
## [23] rhdf5_2.51.0 sass_0.4.9
## [25] alabaster.base_1.7.2 bslib_0.8.0
## [27] alabaster.sce_1.7.0 httr2_1.0.7
## [29] cachem_1.1.0 ResidualMatrix_1.17.0
## [31] buildtools_1.0.0 GenomicAlignments_1.43.0
## [33] igraph_2.1.1 lifecycle_1.0.4
## [35] pkgconfig_2.0.3 rsvd_1.0.5
## [37] Matrix_1.7-1 R6_2.5.1
## [39] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [41] digest_0.6.37 colorspace_2.1-1
## [43] AnnotationDbi_1.69.0 dqrng_0.4.1
## [45] irlba_2.3.5.1 ExperimentHub_2.15.0
## [47] RSQLite_2.3.8 beachmat_2.23.2
## [49] labeling_0.4.3 filelock_1.0.3
## [51] fansi_1.0.6 httr_1.4.7
## [53] abind_1.4-8 compiler_4.4.2
## [55] withr_3.0.2 bit64_4.5.2
## [57] BiocParallel_1.41.0 viridis_0.6.5
## [59] DBI_1.2.3 HDF5Array_1.35.1
## [61] alabaster.ranges_1.7.0 alabaster.schemas_1.7.0
## [63] rappdirs_0.3.3 DelayedArray_0.33.2
## [65] rjson_0.2.23 bluster_1.17.0
## [67] tools_4.4.2 vipor_0.4.7
## [69] beeswarm_0.4.0 glue_1.8.0
## [71] restfulr_0.0.15 rhdf5filters_1.19.0
## [73] grid_4.4.2 Rtsne_0.17
## [75] cluster_2.1.6 gtable_0.3.6
## [77] ensembldb_2.31.0 metapod_1.15.0
## [79] BiocSingular_1.23.0 ScaledMatrix_1.15.0
## [81] utf8_1.2.4 XVector_0.47.0
## [83] ggrepel_0.9.6 BiocVersion_3.21.1
## [85] pillar_1.9.0 limma_3.63.2
## [87] dplyr_1.1.4 BiocFileCache_2.15.0
## [89] lattice_0.22-6 rtracklayer_1.67.0
## [91] bit_4.5.0 tidyselect_1.2.1
## [93] locfit_1.5-9.10 maketools_1.3.1
## [95] Biostrings_2.75.1 gridExtra_2.3
## [97] ProtGenerics_1.39.0 edgeR_4.5.0
## [99] xfun_0.49 statmod_1.5.0
## [101] UCSC.utils_1.3.0 lazyeval_0.2.2
## [103] yaml_2.3.10 evaluate_1.0.1
## [105] codetools_0.2-20 tibble_3.2.1
## [107] alabaster.matrix_1.7.2 BiocManager_1.30.25
## [109] cli_3.6.3 munsell_0.5.1
## [111] jquerylib_0.1.4 Rcpp_1.0.13-1
## [113] dbplyr_2.5.0 png_0.1-8
## [115] XML_3.99-0.17 parallel_4.4.2
## [117] blob_1.2.4 AnnotationFilter_1.31.0
## [119] sparseMatrixStats_1.19.0 bitops_1.0-9
## [121] viridisLite_0.4.2 alabaster.se_1.7.0
## [123] scales_1.3.0 crayon_1.5.3
## [125] rlang_1.1.4 KEGGREST_1.47.0
Not a very good correction, but that’s not the point right now.↩︎
Here there are none in noCorrect()
, but
presumably your function is more complex than that.↩︎
If your method cannot support this option, setting it to
TRUE
should raise an error.↩︎
Again, if your method cannot support this, any
non-NULL
value of restrict
should raise an
error.↩︎
In a strictly programming sense, as the method itself does no correction at all.↩︎