Title: | Single-Cell Batch Correction Methods |
---|---|
Description: | Implements a variety of methods for batch correction of single-cell (RNA sequencing) data. This includes methods based on detecting mutually nearest neighbors, as well as several efficient variants of linear regression of the log-expression values. Functions are also provided to perform global rescaling to remove differences in depth between batches, and to perform a principal components analysis that is robust to differences in the numbers of cells across batches. |
Authors: | Aaron Lun [aut, cre], Laleh Haghverdi [ctb] |
Maintainer: | Aaron Lun <[email protected]> |
License: | GPL-3 |
Version: | 1.23.0 |
Built: | 2024-12-29 04:10:16 UTC |
Source: | https://github.com/bioc/batchelor |
A generalization of applySCE
to apply a function to corresponding parts of multiple SingleCellExperiments,
each of which have one or more alternative Experiments.
applyMultiSCE( ..., FUN, WHICH = NULL, COMMON.ARGS = list(), MAIN.ARGS = list(), ALT.ARGS = list(), SIMPLIFY = TRUE )
applyMultiSCE( ..., FUN, WHICH = NULL, COMMON.ARGS = list(), MAIN.ARGS = list(), ALT.ARGS = list(), SIMPLIFY = TRUE )
... |
One or more SingleCellExperiment objects containing counts and size factors. Each object should contain the same number of rows, corresponding to the same genes in the same order. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, Alternatively, one or more lists of SingleCellExperiments can be provided;
this is flattened as if the objects inside were passed directly to |
FUN |
Any function that accepts multiple SummarizedExperiment or SingleCellExperiment objects. |
WHICH |
A character or integer vector containing the names or positions of alternative Experiments to loop over.
Defaults to all alternative Experiments that are present in each SingleCellExperiment in |
COMMON.ARGS |
Further (named) arguments to pass to all calls to |
MAIN.ARGS |
A named list of arguments to pass to calls to |
ALT.ARGS |
A named list where each entry is named after an alternative Experiment and contains named arguments to use in |
SIMPLIFY |
Logical scalar indicating whether the output should be simplified to one or more SingleCellExperiments. |
This function is a generalization of applySCE
whereby corresponding Experiments from ...
are passed to FUN
.
To illustrate, if we passed objects x
, y
and z
in ...
:
We first call FUN
on the set of all main Experiments from ...
, obtaining a result equivalent to FUN(x, y, z)
(more on the other arguments later).
Then we call FUN
on the set of all first alternative Experiments.
This is equivalent to FUN(altExp(x), altExp(y), altExp(z))
.
Then we call FUN
on the set of all second alternative Experiments.
This is equivalent to FUN(altExp(x, 2), altExp(y, 2), altExp(z, 2))
.
And so on.
In effect, much like applySCE
is analogous to lapply
, applyMultiSCE
is analogous to mapply
.
This allows users to easily apply the same function to all the Experiments (main and alternative) in a list of SingleCellExperiment objects.
Arguments in COMMON.ARGS
(plus some extra arguments, see below) are passed to all calls to FUN
.
Arguments in MAIN.ARGS
are only used in the call to FUN
on the main Experiments.
Arguments in ALT.ARGS
are passed to the call to FUN
on the alternative Experiments of the same name.
For the last two, any arguments therein will override arguments of the same name in COMMON.ARGS
.
Arguments in ...
that are not SingleCellExperiments are actually treated as additional arguments for COMMON.ARGS
.
This is purely intended as a user convenience, to avoid the need to write COMMON.ARGS=list()
when specifying these arguments.
However, explicitly using COMMON.ARGS
is the safer approach and recommended for developers.
By default, looping is performed over alternative Experiments with names that are present across all entries of ...
.
Values of WHICH
should be unique if any simplification of the output is desired.
If MAIN.ARGS=NULL
, the main Experiment is ignored and the function is only applied to the alternative Experiments.
The default of SIMPLIFY=TRUE
is aims to make the output easier to manipulate.
If FUN
returns a SingleCellExperiment, the outputs across main and alternative Experiments are simplified into a single SingleCellExperiment.
If FUN
returns a list of SingleCellExperiments of the same length, the outputs are simplified into one list of SingleCellExperiments.
This assumes that WHICH
contains no more than one reference to each alternative Experiment in x
.
In most cases or when SIMPLIFY=FALSE
, a list is returned containing the output of FUN
applied to each corresponding Experiment across all ...
.
If MAIN.ARGS
is not NULL
, the first entry corresponds to the result generated from the main Experiments;
all other results are generated according to the entries specified in WHICH
and are named accordingly.
If SIMPLIFY=TRUE
and certain conditions are fulfilled, we can either return:
A single SingleCellExperiment, if all calls to FUN
return a SingleCellExperiment.
Here, the results of FUN
on the main/alternative Experiments in ...
are mapped to the main or alternative Experiments of the same name in the output.
A list of SingleCellExperiments, if all calls to FUN
return a list of SingleCellExperiments of the same length.
The altExps
of each output SingleCellExperiment contains the results from the corresponding call to FUN
on the alternative Experiments of the same name in ...
.
In both cases, the aim is to mirror the organization of Experiments in each entry of ...
.
Aaron Lun
applySCE
, for the simpler version that involves only one SingleCellExperiment object.
simplifyToSCE
, for the conditions required for simplification.
# Setting up some objects with alternative Experiments. d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100) sce1 <- SingleCellExperiment(list(counts=d1)) sizeFactors(sce1) <- runif(ncol(d1)) altExp(sce1, "Spike") <- sce1 altExp(sce1, "Protein") <- sce1 d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40) sce2 <- SingleCellExperiment(list(counts=d2)) sizeFactors(sce2) <- runif(ncol(d2)) altExp(sce2, "Spike") <- sce2 altExp(sce2, "Protein") <- sce2 # Applying a function over the main and alternative experiments. normed <- applyMultiSCE(sce1, sce2, FUN=multiBatchNorm) normed altExp(normed[[1]]) # contains log-normalized values regressed <- applyMultiSCE(normed, FUN=regressBatches) regressed altExp(regressed) # contains corrected expression values rescaled <- applyMultiSCE(normed, FUN=rescaleBatches) rescaled altExp(rescaled) # contains corrected expression values # We can also specify 'batch=' directly. combined <- cbind(sce1, sce2) batch <- rep(1:2, c(ncol(sce1), ncol(sce2))) normed <- applyMultiSCE(combined, batch=batch, FUN=multiBatchNorm) normed altExp(normed) # contains log-normalized values regressed <- applyMultiSCE(normed, batch=batch, FUN=regressBatches) regressed altExp(regressed) # contains corrected expression values rescaled <- applyMultiSCE(normed, batch=batch, FUN=rescaleBatches) rescaled altExp(rescaled) # contains corrected expression values
# Setting up some objects with alternative Experiments. d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100) sce1 <- SingleCellExperiment(list(counts=d1)) sizeFactors(sce1) <- runif(ncol(d1)) altExp(sce1, "Spike") <- sce1 altExp(sce1, "Protein") <- sce1 d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40) sce2 <- SingleCellExperiment(list(counts=d2)) sizeFactors(sce2) <- runif(ncol(d2)) altExp(sce2, "Spike") <- sce2 altExp(sce2, "Protein") <- sce2 # Applying a function over the main and alternative experiments. normed <- applyMultiSCE(sce1, sce2, FUN=multiBatchNorm) normed altExp(normed[[1]]) # contains log-normalized values regressed <- applyMultiSCE(normed, FUN=regressBatches) regressed altExp(regressed) # contains corrected expression values rescaled <- applyMultiSCE(normed, FUN=rescaleBatches) rescaled altExp(rescaled) # contains corrected expression values # We can also specify 'batch=' directly. combined <- cbind(sce1, sce2) batch <- rep(1:2, c(ncol(sce1), ncol(sce2))) normed <- applyMultiSCE(combined, batch=batch, FUN=multiBatchNorm) normed altExp(normed) # contains log-normalized values regressed <- applyMultiSCE(normed, batch=batch, FUN=regressBatches) regressed altExp(regressed) # contains corrected expression values rescaled <- applyMultiSCE(normed, batch=batch, FUN=rescaleBatches) rescaled altExp(rescaled) # contains corrected expression values
A common interface for single-cell batch correction methods.
batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = NULL, PARAM ) ## S4 method for signature 'ClassicMnnParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'FastMnnParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'RescaleParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'RegressParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'NoCorrectParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM )
batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = NULL, PARAM ) ## S4 method for signature 'ClassicMnnParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'FastMnnParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'RescaleParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'RegressParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM ) ## S4 method for signature 'NoCorrectParam' batchCorrect( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM )
... |
One or more matrix-like objects containing single-cell gene expression matrices. Alternatively, one or more SingleCellExperiment objects can be supplied. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
Objects of different types can be mixed together.
If a single object is supplied, |
batch |
A factor specifying the batch of origin for each cell if only one batch is supplied in |
restrict |
A list of length equal to the number of objects in |
subset.row |
A vector specifying the subset of genes to use for correction.
Defaults to |
correct.all |
A logical scalar indicating whether to return corrected expression values for all genes, even if |
assay.type |
A string or integer scalar specifying the assay to use for correction. Only used for SingleCellExperiment inputs. |
PARAM |
A BatchelorParam object specifying the batch correction method to dispatch to, and the parameters with which it should be run.
ClassicMnnParam will dispatch to |
Users can pass parameters to each method directly via ...
or via the constructors for PARAM
.
While there is no restriction on which parameters go where, we recommend only passing data-agnostic and method-specific parameters to PARAM
.
Data-dependent parameters - and indeed, the data themselves - should be passed in via ...
.
This means that different data sets can be used without modifying PARAM
, allowing users to switch to a different algorithm by only changing PARAM
.
A SingleCellExperiment where the first assay contains corrected gene expression values for all genes.
Corrected values should be returned for all genes if subset.row=NULL
or if correct.all=TRUE
;
otherwise they should only be returned for the genes in the subset.
Cells should be reported in the same order that they are supplied. In cases with multiple batches, the cell identities are simply concatenated from successive objects in their specified order, i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on. For a single input object, cells should be reported in the same order as the input.
The colData
slot should contain batch
, a vector specifying the batch of origin for each cell.
Aaron Lun
BatchelorParam classes to determine dispatch.
correctExperiments
, to obtain corrected values while retaining the original expression data.
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 # Switching easily between batch correction methods. m.out <- batchCorrect(B1, B2, PARAM=ClassicMnnParam()) f.out <- batchCorrect(B1, B2, PARAM=FastMnnParam(d=20)) r.out <- batchCorrect(B1, B2, PARAM=RescaleParam(pseudo.count=0)) n.out <- batchCorrect(B1, B2, PARAM=NoCorrectParam())
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 # Switching easily between batch correction methods. m.out <- batchCorrect(B1, B2, PARAM=ClassicMnnParam()) f.out <- batchCorrect(B1, B2, PARAM=FastMnnParam(d=20)) r.out <- batchCorrect(B1, B2, PARAM=RescaleParam(pseudo.count=0)) n.out <- batchCorrect(B1, B2, PARAM=NoCorrectParam())
Using restriction
It is possible to compute the correction using only a subset of cells in each batch, and then extrapolate that correction to all other cells. This may be desirable in experimental designs where a control set of cells from the same source population were run on different batches. Any difference in the controls must be artificial in origin and can be directly removed without making further biological assumptions. Similarly, if certain cells are known to be of a batch-specific subpopulation, it may be desirable to exclude them to ensure that they are not inadvertently used during the batch correction.
restrict
argumentTo perform restriction, users should set restrict
to specify the subset of cells in each batch to be used for correction.
This should be set to a list of length equal to the number of objects passed to the ...
argument of the batch correction function.
Each element of this list should be a subsetting vector to be applied to the columns of the corresponding batch.
A NULL
element indicates that all the cells from a batch should be used.
In situations where one input object contains multiple batches, restrict
should simply a list containing a single subsetting vector for that object.
Correction functions that support restrict
will only use the restricted subset of cells in each batch to perform the correction.
For example, fastMNN
will only use the restricted cells to identify MNN pairs and the center of the orthogonalization.
However, it will apply the correction to all cells in each batch - hence the extrapolation.
This means that the output is always of the same dimensionality, regardless of whether restrict
is specified.
As a general rule, users can expect the corrected values in the restricted cells to be the same as if the inputs were directly subsetted to only contain those cells (see Examples).
This is appealing as it demonstrates that correction only uses information from the restricted subset of cells.
If batch correction functions do not follow this rule, they will explicitly state so, e.g., in ?fastMNN
.
Aaron Lun
rescaleBatches
, regressBatches
, fastMNN
and mnnCorrect
,
as examples of batch correction methods that support restriction.
means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- regressBatches(B1, B2, restrict=list(1:10, 1:10)) assay(out)[,c(1:10, 50+1:10)] # Compare to actual subsetting: out.sub <- regressBatches(B1[,1:10], B2[,1:10]) assay(out.sub)
means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- regressBatches(B1, B2, restrict=list(1:10, 1:10)) assay(out)[,c(1:10, 50+1:10)] # Compare to actual subsetting: out.sub <- regressBatches(B1[,1:10], B2[,1:10]) assay(out.sub)
Constructors and methods for the batchelor parameter classes.
ClassicMnnParam(...) FastMnnParam(...) RescaleParam(...) RegressParam(...) NoCorrectParam(...)
ClassicMnnParam(...) FastMnnParam(...) RescaleParam(...) RegressParam(...) NoCorrectParam(...)
... |
Named arguments to pass to individual methods upon dispatch.
These should not include arguments named in the |
BatchelorParam objects are intended to store method-specific parameter settings to pass to the batchCorrect
generic.
These values should refer to data-agnostic parameters; parameters that depend on data (or the data itself) should be specified directly in the batchCorrect
call.
The BatchelorParam classes are all derived from SimpleList objects and have the same available methods, e.g., [[
, $
.
These can be used to access or modify the object after construction.
Note that the BatchelorParam class itself is not useful and should not be constructed directly. Instead, users should use the constructors shown above to create instances of the desired subclass.
The constructors will return a BatchelorParam object of the specified subclass, containing parameter settings for the corresponding batch correction method.
Aaron Lun
batchCorrect
, where the BatchelorParam objects are used for dispatch to individual methods.
# Specifying the number of neighbors, dimensionality. fp <- FastMnnParam(k=20, d=10) fp # List-like behaviour: fp$k fp$k <- 10 fp$k
# Specifying the number of neighbors, dimensionality. fp <- FastMnnParam(k=20, d=10) fp # List-like behaviour: fp$k fp$k <- 10 fp$k
Utilities to check inputs into batch correction functions.
checkBatchConsistency(batches, cells.in.columns = TRUE) checkIfSCE(batches) checkRestrictions(batches, restrictions, cells.in.columns = TRUE)
checkBatchConsistency(batches, cells.in.columns = TRUE) checkIfSCE(batches) checkRestrictions(batches, restrictions, cells.in.columns = TRUE)
batches |
A list of batches, usually containing gene expression matrices or SingleCellExperiment objects. |
cells.in.columns |
A logical scalar specifying whether batches contain cells in the columns. |
restrictions |
A list of length equal to |
These functions are intended for internal use and other package developers.
checkBatchConsistency
will check whether the input batches
are consistent with respect to the size of the dimension containing features (i.e., not cells).
It will also verify that the dimension names are consistent, to avoid problems from variable ordering of rows/columns in the inputs.
checkRestrictions
will check whether restrictions
are consistent with the supplied batches
,
in terms of the length and names of the two lists.
It will also check that each batch contains at least one usable cell after restriction.
checkBatchConsistency
return an invisible NULL
if there are no errors.
checkIfSCE
will return a logical vector specifying whether each element of batches
is a SingleCellExperiment objects.
checkRestrictions
will return NULL
if restrictions=NULL
.
Otherwise, it will return a list by taking restrictions
and converting each non-NULL
element into an integer subsetting vector.
Aaron Lun
checkBatchConsistency(list(cbind(1:5), cbind(1:5, 2:6))) try( # fails checkBatchConsistency(list(cbind(1:5), cbind(1:4, 2:5))) )
checkBatchConsistency(list(cbind(1:5), cbind(1:5, 2:6))) try( # fails checkBatchConsistency(list(cbind(1:5), cbind(1:4, 2:5))) )
Perform MNN correction based on cluster centroids, using the corrected centroid coordinates to correct the per-cell expression values with a variable bandwidth Gaussian kernel.
clusterMNN( ..., batch = NULL, restrict = NULL, clusters, cluster.d = 50, cos.norm = TRUE, merge.order = NULL, auto.merge = FALSE, min.batch.skip = 0, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", BSPARAM = IrlbaParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
clusterMNN( ..., batch = NULL, restrict = NULL, clusters, cluster.d = 50, cos.norm = TRUE, merge.order = NULL, auto.merge = FALSE, min.batch.skip = 0, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", BSPARAM = IrlbaParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns.
Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, it is assumed to contain cells from all batches, so Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside each list were passed directly to |
batch |
A vector or factor specifying the batch of origin for all cells when only a single object is supplied in |
restrict |
A list of length equal to the number of objects in |
clusters |
A list of length equal to |
cluster.d |
Integer scalar indicating how many PCs should be used in clustering.
Only used if |
cos.norm |
A logical scalar indicating whether cosine normalization should be performed on the input data prior to PCA. |
merge.order |
An integer vector containing the linear merge order of batches in |
auto.merge |
Logical scalar indicating whether to automatically identify the “best” merge order. |
min.batch.skip |
Numeric scalar specifying the minimum relative magnitude of the batch effect, below which no correction will be performed at a given merge step. |
subset.row |
A vector specifying which features to use for correction. |
correct.all |
Logical scalar indicating whether corrected expression values should be computed for genes not in |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA.
Only used if |
BNPARAM |
A BiocNeighborParam object specifying the nearest neighbor algorithm. |
BPPARAM |
A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized. |
These functions are motivated by the scenario where each batch has been clustered separately
and each cluster has already been annotated with some meaningful biological state.
We want to identify which biological states match to each other across batches;
this is achieved by identifying mutual nearest neighbors based on the cluster centroids with reducedMNN
.
MNN pairs are identified with k=1
to ensure that each cluster has no more than one match in another batch.
This reduces the risk of inadvertently merging together different clusters from the same batch.
By comparison, higher values of k
may result in many-to-one mappings between batches
such that the correction will implicitly force different clusters together.
Using this guarantee of no-more-than-one mappings across batches, we can form meta-clusters by identifying all components of the resulting MNN graph. Each meta-cluster can be considered to represent some biological state (e.g., cell type), and all of its constituents are the matching clusters within each batch.
As an extra courtesy, clusterMNN
will also compute corrected values for each cell.
This is done by applying a Gaussian kernel to the correction vectors for the centroids,
where the bandwidth is proportional to the distance between that cell and the closest cluster centroid.
This yields a smooth correction function that avoids edge effects at cluster boundaries.
If clusters
is set to a BlusterParam object (see the bluster package),
a PCA is performed in each batch with the specified BSPARAM
.
The PCs are then used in clustering with clusterRows
to obtain a list of clusters.
This can be used to mimic per-cell batch correction in the absence of a priori clusters.
A SingleCellExperiment containing per-cell expression values where each row is a gene and each column is a cell.
This has the same format as the output of fastMNN
but with an additional cluster
field in the colData
containing the cluster identity of each cell.
The metadata
contains:
merge.info
, a DataFrame with the same format as the output of fastMNN
.
However, the pairs
and lost.var
refer to the cluster centroids, not the cells.
clusters
, a DataFrame with one row for each cluster across all batches in ...
.
This can be row-indexed by the values in pairs
to determine the identity of the clusters in each MNN pair.
An additional meta
column is provided that describes the meta-cluster to which each cluster belongs.
Aaron Lun
Lun ATL (2019). Cluster-based mutual nearest neighbors correction https://marionilab.github.io/FurtherMNN2018/theory/clusters.html
Lun ATL (2019). A discussion of the known failure points of the fastMNN algorithm. https://marionilab.github.io/FurtherMNN2018/theory/failure.html
reducedMNN
, which is used internally to perform the correction.
# Mocking up some data for multiple batches: means <- matrix(rnorm(3000), ncol=3) colnames(means) <- LETTERS[1:3] B1 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)] B1 <- B1 + rnorm(length(B1)) B2 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)] B2 <- B2 + rnorm(length(B2)) + rnorm(nrow(B2)) # batch effect. # Applying the correction with some made-up clusters: cluster1 <- kmeans(t(B1), centers=10)$cluster cluster2 <- kmeans(t(B2), centers=10)$cluster out <- clusterMNN(B1, B2, clusters=list(cluster1, cluster2)) rd <- reducedDim(out, "corrected") plot(rd[,1], rd[,2], col=out$batch) # Obtaining the clusters internally. out2 <- clusterMNN(B1, B2, clusters=bluster::NNGraphParam()) rd2 <- reducedDim(out2, "corrected") plot(rd2[,1], rd2[,2], col=out$batch)
# Mocking up some data for multiple batches: means <- matrix(rnorm(3000), ncol=3) colnames(means) <- LETTERS[1:3] B1 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)] B1 <- B1 + rnorm(length(B1)) B2 <- means[,sample(LETTERS[1:3], 500, replace=TRUE)] B2 <- B2 + rnorm(length(B2)) + rnorm(nrow(B2)) # batch effect. # Applying the correction with some made-up clusters: cluster1 <- kmeans(t(B1), centers=10)$cluster cluster2 <- kmeans(t(B2), centers=10)$cluster out <- clusterMNN(B1, B2, clusters=list(cluster1, cluster2)) rd <- reducedDim(out, "corrected") plot(rd[,1], rd[,2], col=out$batch) # Obtaining the clusters internally. out2 <- clusterMNN(B1, B2, clusters=bluster::NNGraphParam()) rd2 <- reducedDim(out2, "corrected") plot(rd2[,1], rd2[,2], col=out$batch)
Convert low-dimensional corrected PCs to a SingleCellExperiment containing corrected expression values. This is a low-level function and most users should not need to call it.
convertPCsToSCE( corrected.df, pc.info, assay.name = "reconstructed", dimred.name = "corrected" )
convertPCsToSCE( corrected.df, pc.info, assay.name = "reconstructed", dimred.name = "corrected" )
corrected.df |
A DataFrame containing a nested matrix of low-dimensional |
pc.info |
A list containing PCA statistics, in particular a |
assay.name |
String specifying the name of the assay to use to store the corrected expression values. |
dimred.name |
String containing the name fo the |
The corrected expression values are obtained by simply taking the crossproduct of the corrected PCs with the rotation matrix. This reverses the original projection to PC space while retaining the effect of the correction. These values are best used for visualization; the low-dimensional corrected coordinates are more efficient for per-cell operations like clustering, while the original uncorrected expression values are safer to interpret for per-gene analyses.
A SingleCellExperiment containing a LowRankMatrix with the corrected per-gene expression values.
The colData
contains the batch identities, the rowData
contains the rotation matrix,
and the reducedDims
contains the low-dimensional corrected values (if dimred.name
is not NULL
).
All additional metadata from corrected.df
and pc.info
is stored in metadata
.
Aaron Lun
reducedMNN
, to compute corrected.df
; and multiBatchPCA
, to compute pc.info
.
fastMNN
, which uses this function to obtain low-rank corrected values.
B1 <- matrix(rnorm(10000), nrow=50) # Batch 1 B2 <- matrix(rnorm(10000), nrow=50) # Batch 2 # Equivalent to fastMNN(). cB1 <- cosineNorm(B1) cB2 <- cosineNorm(B2) pcs <- multiBatchPCA(cB1, cB2) mnn.out <- reducedMNN(pcs[[1]], pcs[[2]]) sce <- convertPCsToSCE(mnn.out, metadata(pcs)) sce
B1 <- matrix(rnorm(10000), nrow=50) # Batch 1 B2 <- matrix(rnorm(10000), nrow=50) # Batch 2 # Equivalent to fastMNN(). cB1 <- cosineNorm(B1) cB2 <- cosineNorm(B2) pcs <- multiBatchPCA(cB1, cB2) mnn.out <- reducedMNN(pcs[[1]], pcs[[2]]) sce <- convertPCsToSCE(mnn.out, metadata(pcs)) sce
Apply a correction to multiple SingleCellExperiment objects,
while also combining the assay data and column metadata for easy downstream use.
This augments the simpler batchCorrect
function, which returns only the corrected values.
correctExperiments( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM = FastMnnParam(), combine.assays = NULL, combine.coldata = NULL, include.rowdata = TRUE, add.single = TRUE )
correctExperiments( ..., batch = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", PARAM = FastMnnParam(), combine.assays = NULL, combine.coldata = NULL, include.rowdata = TRUE, add.single = TRUE )
... |
One or more SingleCellExperiment objects.
If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, Alternatively, one or more lists of SingleCellExperiments can be provided;
this is flattened so that it is as if the objects inside were passed directly to |
batch |
A factor specifying the batch of origin for each cell if only one batch is supplied in |
restrict |
A list of length equal to the number of objects in |
subset.row |
A vector specifying the subset of genes to use for correction.
Defaults to |
correct.all |
A logical scalar indicating whether to return corrected expression values for all genes, even if |
assay.type |
A string or integer scalar specifying the assay to use for correction. |
PARAM |
A BatchelorParam object specifying the batch correction method to dispatch to, and the parameters with which it should be run.
ClassicMnnParam will dispatch to |
combine.assays |
Character vector specifying the assays from each entry of |
combine.coldata |
Character vector specifying the column metadata fields from each entry of |
include.rowdata |
Logical scalar indicating whether the function should attempt to include |
add.single |
Logical scalar indicating whether merged fields should be added to the original SingleCellExperiment.
Only relevant when a single object is provided in |
This function makes it easy to retain information from the original SingleCellExperiment objects in the post-merge object. Operations like differential expression analyses can be easily performed on the uncorrected expression values, while common annotation can be leveraged in cell-based analyses like clustering.
All assays shared across the original objects are cbind
ed and added to the merged object.
This can be controlled with combine.assays
.
Any original assay with the same name as an assay in the output of batchCorrect
will be ignored with a warning.
Any column metadata fields that are shared will also be included in the merged object.
This can be controlled with combine.coldata
.
If any existing field has the same name as any colData
field produced by batchCorrect
,
it will be ignored in favor of the latter.
Row metadata from ...
is included in the merged object if include.rowdata=TRUE
.
In such cases, only non-conflicting row data fields are preserved,
i.e., fields with different names or identically named fields with the same values between objects in ...
.
Any conflicting fields are ignored with a warning.
rowRanges
are only preserved if they are identical (ignoring the mcols
) for all objects in ...
.
If a single SingleCellExperiment object was supplied in ...
, the default behavior is to prepend all assays
, reducedDims
, colData
, rowData
and metadata
fields from the merged object into the original (removing any original entries with names that overlap those of the merged object).
This is useful as it preserves all (non-overlapping) aspects of the original object, especially the reduced dimensions that cannot, in general, be sensibly combined across multiple objects.
Setting add.single=FALSE
will force the creation of a new SingleCellExperiment rather than prepending.
A SingleCellExperiment containing the merged expression values in the first assay
and a batch
column metadata field specifying the batch of origin for each cell,
as described in batchCorrect
.
Aaron Lun
batchCorrect
, which does the correction inside this function.
noCorrect
, for another method to combine uncorrected assay values.
sce1 <- scuttle::mockSCE() sce1 <- scuttle::logNormCounts(sce1) sce2 <- scuttle::mockSCE() sce2 <- scuttle::logNormCounts(sce2) f.out <- correctExperiments(sce1, sce2) colData(f.out) assayNames(f.out)
sce1 <- scuttle::mockSCE() sce1 <- scuttle::logNormCounts(sce1) sce2 <- scuttle::mockSCE() sce2 <- scuttle::logNormCounts(sce2) f.out <- correctExperiments(sce1, sce2) colData(f.out) assayNames(f.out)
Perform cosine normalization on the column vectors of an expression matrix.
cosineNorm( x, mode = c("matrix", "all", "l2norm"), subset.row = NULL, BPPARAM = SerialParam() )
cosineNorm( x, mode = c("matrix", "all", "l2norm"), subset.row = NULL, BPPARAM = SerialParam() )
x |
A gene expression matrix with cells as columns and genes as rows. |
mode |
A string specifying the output to be returned. |
subset.row |
A vector specifying which features to use to compute the L2 norm. |
BPPARAM |
A BiocParallelParam object specifying how parallelization is to be performed.
Only used when |
Cosine normalization removes scaling differences between expression vectors.
In the context of batch correction, this is usually applied to remove differences between batches that are normalized separately.
For example, fastMNN
uses this function on the log-expression vectors by default.
Technically, separate normalization introduces scaling differences in the normalized expression, which should manifest as a shift in the log-transformed expression. However, in practice, single-cell data will contain many small counts (where the log function is near-linear) or many zeroes (which remain zero when the pseudo-count is 1). In these applications, scaling differences due to separate normalization are better represented as scaling differences in the log-transformed values.
If applied to the raw count vectors, cosine normalization is similar to library size-related (i.e., L1) normalization. However, we recommend using dedicated methods for computing size factors to normalize raw count data.
While the default is to directly return the cosine-normalized matrix, it may occasionally be desirable to obtain the L2 norm,
e.g., to apply an equivalent normalization to other matrices.
This can be achieved by setting mode
accordingly.
The function will return a DelayedMatrix if x
is a DelayedMatrix.
This aims to delay the calculation of cosine-normalized values for very large matrices.
If mode="matrix"
, a double-precision matrix of the same dimensions as X
is returned, containing cosine-normalized values.
If mode="l2norm"
, a double-precision vector is returned containing the L2 norm for each cell.
If mode="all"
, a named list is returned containing the fields "matrix"
and "l2norm"
, which are as described above.
Aaron Lun
mnnCorrect
and fastMNN
, where this function gets used.
A <- matrix(rnorm(1000), nrow=10) str(cosineNorm(A)) str(cosineNorm(A, mode="l2norm"))
A <- matrix(rnorm(1000), nrow=10) str(cosineNorm(A)) str(cosineNorm(A, mode="l2norm"))
A variety of correction diagnostics that make use of clustering information, usually obtained by clustering on cells from all batches in the corrected data.
clusterAbundanceTest(x, batch) clusterAbundanceVar(x, batch, pseudo.count = 10)
clusterAbundanceTest(x, batch) clusterAbundanceVar(x, batch, pseudo.count = 10)
x |
A factor or vector specifying the assigned cluster for each cell in each batch in the corrected data. Alternatively, a matrix or table containing the number of cells in each cluster (row) and batch (column). |
batch |
A factor or vector specifying the batch of origin for each cell.
Ignored if |
pseudo.count |
A numeric scalar containing the pseudo-count to use for the log-transformation. |
For clusterAbundanceTest
, the null hypothesis for each cluster is that the distribution of cells across batches is proportional to the total number of cells in each batch.
We then use chisq.test
to test for deviations from the expected proportions, possibly indicative of imperfect mixing across batches.
This works best for technical replicates where the population composition should be identical across batches.
However, the interpretation of the p-value loses its meaning for experiments where there is more biological variability between batches.
For clusterAbundanceVar
, we compute log-normalized abundances for each cluster using normalizeCounts
.
We then compute the variance of the log-abundances across batches for each cluster.
Large variances indicate that there are strong relative differences in abundance across batches, indicative of either imperfect mixing or genuine batch-specific subpopulations.
The idea is to rank clusters by their variance to prioritize them for manual inspection to decide between these two possibilities.
We use a large pseudo.count
by default to avoid spuriously large variances when the counts are low.
For clusterAbundanceTest
, a named numeric vector of p-values from applying Pearson's chi-squared test on each cluster.
For clusterAbundanceVar
, a named numeric vector of variances of log-abundances across batches for each cluster.
Aaron Lun
set.seed(1000) means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- fastMNN(B1, B2) cluster1 <- kmeans(t(B1), centers=10)$cluster cluster2 <- kmeans(t(B2), centers=10)$cluster merged.cluster <- kmeans(reducedDim(out, "corrected"), centers=10)$cluster # Low p-values indicate unexpected differences in abundance. clusterAbundanceTest(paste("Cluster", merged.cluster), out$batch) # High variances indicate differences in normalized abundance. clusterAbundanceVar(paste("Cluster", merged.cluster), out$batch)
set.seed(1000) means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- fastMNN(B1, B2) cluster1 <- kmeans(t(B1), centers=10)$cluster cluster2 <- kmeans(t(B2), centers=10)$cluster merged.cluster <- kmeans(reducedDim(out, "corrected"), centers=10)$cluster # Low p-values indicate unexpected differences in abundance. clusterAbundanceTest(paste("Cluster", merged.cluster), out$batch) # High variances indicate differences in normalized abundance. clusterAbundanceVar(paste("Cluster", merged.cluster), out$batch)
Divide a single input object into multiple separate objects according to their batch of origin.
divideIntoBatches(x, batch, byrow = FALSE, restrict = NULL)
divideIntoBatches(x, batch, byrow = FALSE, restrict = NULL)
x |
A matrix-like object where one dimension corresponds to cells and another represents features. |
batch |
A factor specifying the batch to which each cell belongs. |
byrow |
A logical scalar indicating whether rows correspond to cells. |
restrict |
A subsetting vector specifying which cells should be used for correction. |
This function is intended for internal use and other package developers.
It splits a single input object into multiple batches, allowing developers to use the same code for the scenario where batch
is supplied with a single input.
A list containing:
batches
, a named list of matrix-like objects where each element corresponds to a level of batch
and contains all cells from that batch.
reorder
, an integer vector to be applied to the combined batches
to recover the ordering of cells in x
.
restricted
, a named list of integer vectors specifying which cells are to be used for correction.
Set to NULL
if the input restrict
was also NULL
.
Aaron Lun
X <- matrix(rnorm(1000), ncol=100) out <- divideIntoBatches(X, sample(3, 100, replace=TRUE)) names(out) # Recovering original order. Y <- do.call(cbind, out$batches) Z <- Y[,out$reorder] all.equal(Z, X) # should be TRUE.
X <- matrix(rnorm(1000), ncol=100) out <- divideIntoBatches(X, sample(3, 100, replace=TRUE)) names(out) # Recovering original order. Y <- do.call(cbind, out$batches) Z <- Y[,out$reorder] all.equal(Z, X) # should be TRUE.
Correct for batch effects in single-cell expression data using a fast version of the mutual nearest neighbors (MNN) method.
fastMNN( ..., batch = NULL, k = 20, prop.k = NULL, restrict = NULL, cos.norm = TRUE, ndist = 3, d = 50, deferred = TRUE, weights = NULL, get.variance = FALSE, merge.order = NULL, auto.merge = FALSE, min.batch.skip = 0, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", BSPARAM = IrlbaParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
fastMNN( ..., batch = NULL, k = 20, prop.k = NULL, restrict = NULL, cos.norm = TRUE, ndist = 3, d = 50, deferred = TRUE, weights = NULL, get.variance = FALSE, merge.order = NULL, auto.merge = FALSE, min.batch.skip = 0, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts", BSPARAM = IrlbaParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns.
Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, it is assumed to contain cells from all batches, so Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside each list were passed directly to |
batch |
A vector or factor specifying the batch of origin for all cells when only a single object is supplied in |
k |
An integer scalar specifying the number of nearest neighbors to consider when identifying MNNs. |
prop.k |
A numeric scalar in (0, 1) specifying the proportion of cells in each dataset to use for mutual nearest neighbor searching.
If set, the number of nearest neighbors used for the MNN search in each batch is redefined as |
restrict |
A list of length equal to the number of objects in |
cos.norm |
A logical scalar indicating whether cosine normalization should be performed on the input data prior to PCA. |
ndist |
A numeric scalar specifying the threshold beyond which neighbours are to be ignored when computing correction vectors. Each threshold is defined as a multiple of the number of median distances. |
d |
Numeric scalar specifying the number of dimensions to use for dimensionality reduction in |
deferred |
Logical scalar indicating whether to defer centering/scaling, see |
weights , get.variance
|
Further arguments to pass to |
merge.order |
An integer vector containing the linear merge order of batches in |
auto.merge |
Logical scalar indicating whether to automatically identify the “best” merge order. |
min.batch.skip |
Numeric scalar specifying the minimum relative magnitude of the batch effect, below which no correction will be performed at a given merge step. |
subset.row |
A vector specifying which features to use for correction. |
correct.all |
Logical scalar indicating whether corrected expression values should be computed for genes not in |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA in |
BNPARAM |
A BiocNeighborParam object specifying the nearest neighbor algorithm. |
BPPARAM |
A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized. |
This function provides a variant of the mnnCorrect
function, modified for speed and more robust performance.
In particular:
It performs a multi-sample PCA via multiBatchPCA
and subsequently performs all calculations in the PC space.
This reduces computational work and provides some denoising for improved neighbour detection.
As a result, though, the corrected output cannot be interpreted on a gene level and is useful only for cell-level comparisons, e.g., clustering and visualization.
The correction vector for each cell is directly computed from its k
nearest neighbours in the same batch.
Specifically, only the k
nearest neighbouring cells that also participate in MNN pairs are used.
Each MNN-participating neighbour is weighted by distance from the current cell, using a tricube scheme with bandwidth equal to the median distance multiplied by ndist
.
This ensures that the correction vector only uses information from the closest cells, improving the fidelity of local correction.
Issues with “kissing” are avoided with a two-step procedure that removes variation along the batch effect vector. First, the average correction vector across all MNN pairs is computed. Cell coordinates are adjusted such that all cells in a single batch have the same position along this vector. The correction vectors are then recalculated with the adjusted coordinates (but the same MNN pairs).
The default setting of cos.norm=TRUE
provides some protection against differences in scaling between log-expression matrices from batches that are normalized separately
(see cosineNorm
for details).
However, if possible, we recommend using the output of multiBatchNorm
as input to fastMNN
.
This will equalize coverage on the count level before the log-transformation, which is a more accurate rescaling than cosine normalization on the log-values.
The batch
argument allows users to easily perform batch correction when all cells have already been combined into a single object.
This avoids the need to manually split the matrix or SingleCellExperiment object into separate objects for input into fastMNN
.
In this situation, the order of input batches is defined by the order of levels in batch
.
A SingleCellExperiment is returned where each row is a gene and each column is a cell. This contains:
A corrected
matrix in the reducedDims
slot, containing corrected low-dimensional coordinates for each cell.
This has number of columns equal to d
and number of rows equal to the total number of cells in ...
.
A batch
column in the colData
slot, containing the batch of origin for each row (i.e., cell) in corrected
.
A rotation
column the rowData
slot, containing the rotation matrix used for the PCA.
This has d
columns and number of rows equal to the number of genes to report (see the “Choice of genes” section).
A reconstructed
matrix in the assays
slot, containing the low-rank reconstruction of the expression matrix.
This can be interpreted as per-gene corrected log-expression values (after cosine normalization, if cos.norm=TRUE
) but should not be used for quantitative analyses.
This has number of rows equal to the number of input genes if subset.row=NULL
or correct.all=TRUE
,
otherwise each row corresponds to a gene in subset.row
.
Cells in the output object are always ordered in the same manner as supplied in ...
.
For a single input object, cells will be reported in the same order as they are arranged in that object.
In cases with multiple input objects, the cell identities are simply concatenated from successive objects,
i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.
This is true regardless of the value of merge.order
and auto.merge
,
which only affects the internal merge order of the batches.
Additionally, the metadata of the output object contains:
merge.info
, a DataFrame of diagnostic information about each merge step.
See the “Merge diagnostics” section for more details.
pca.info
, a list of metadata produced by multiBatchPCA
,
such as the variance explained when get.variance=TRUE
.
By default, batches are merged in the user-supplied order in ...
,
i.e., the first batch is merged with the second batch, the third batch is merged with the combined first-second batch, the fourth batch is merged with the combined first-second-third batch and so on.
We refer to this approach as a progressive merge.
When batch
is supplied for a single object in ...
, the ordering of batches in a progressive merge is determined by the ordering of factor levels in batch
.
If merge.order
is an integer vector, it is treated as an ordering permutation with which to perform a progressive merge.
For example, if merge.order=c(4,1,3,2)
, batches 4 and 1 in ...
are merged first;
batch 3 is merged with the combined 4+1 batch;
and then batch 2 is merged with the combined 4+1+3 batch.
This is often more convenient than changing the order manually in ...
,
which would alter the order of batches in the output corrected
matrix.
If merge.order
is a character vector, it is treated as an ordering permutation for named batches.
If merge.order
is a factor, it is coerced into a character vector and also treated as a permutation of names.
If merge.order
is a nested list, it is treated as a tree that specifies a hierarchical merge.
Each element of the list should either be a string or integer scalar, corresponding to a leaf node that specifies a batch;
or another list, corresponding to an internal node that should contain at least two children;
or an integer or character vector of length 2 or more, again corresponding to an internal node.
For example, list(list(1,2), list(3,4))
indicates that batch 1 should be merged with batch 2;
batch 3 should be merged with batch 4; and that, finally, the combined batches 1+2 and 3+4 should be merged.
More than two children per node are supported and will result in a progressive merge within that node.
For example, list(list(1,2,3), list(4,5,6))
will merge batch 1 with 2, then 1+2 with 3;
batch 4 with 5, and then 4+5 with 6; and finally, 1+2+3 with 4+5+6.
The same approach can be used for integer or character vectors, e.g., list(1:3, 4:6)
has the same effect as above.
Note that, while batches can be specified by name (character) or index (integer), users cannot use both in the same tree. Factors can be provided and are coerced to character vectors.
The merge order may occasionally be important as it determines the number of MNN pairs available at each merge step. MNN pairs results in greater stability of the batch vectors and increased likelihood of identifying shared subpopulations, which are important to the precision and accuracy of the MNN-based correction, respectively.
In a progressive merge, the reference increases in size at each step, ensuring that more cells are available to identify MNN pairs in later merges. We suggest setting the largest, most heterogeneous batch as the first reference, which favors detection of sufficient MNN pairs between the first and other batches. Conversely, if two small batches without shared populations are supplied first, the wrong MNN pairs will be detected and the result of the merge will be incorrect.
A merge tree is useful for merging together batches that are known to be more closely related (e.g., replicates) before attempting difficult merges involving more dissimilar batches. The idea is to increase the number of cells and thus MNN pairs prior to merging batches with few shared subpopulations. By comparison, performing the more difficult merges first is more likely to introduce errors whereby distinct subpopulations are incorrectly placed together, which is propagated to later steps as the initial merge is used as a reference for subsequent merges.
If auto.merge=TRUE
, merge steps are chosen to maximize the number of MNN pairs at each step.
The aim is to improve the stability of the correction by first merging more similar batches with more MNN pairs.
This can be somewhat time-consuming as MNN pairs need to be iteratively recomputed for all possible batch pairings.
The order of cells in the output is never affected by the setting of merge.order
.
It depends only on the order of objects in ...
and the order of cells within each object.
All genes are used with the default setting of subset.row=NULL
.
Users can set subset.row
to subset the inputs to highly variable genes or marker genes.
This improves the quality of the PCA and identification of MNN pairs by reducing the noise from irrelevant genes.
Users should not be too restrictive with subsetting, as high dimensionality is required to satisfy the orthogonality assumption in MNN detection.
By default, only the selected genes are used to compute rotation vectors and a low-rank corrected expression matrix.
However, setting correct.all=TRUE
will return rotation vectors spanning all genes in the supplied input data.
This ensures that corrected values are returned in reconstructed
for all input genes, e.g., in correctExperiments
.
Note that this setting will not alter the corrected low-dimension coordinates, nor the rotation values for the selected genes.
If d=NA
, any gene not in subset.row
will have reconstructed values of zero when correct.all=TRUE
.
Without a PCA, we cannot easily extrapolate the correction to other genes.
See ?"batchelor-restrict"
for a description of the restrict
argument.
Specifically, fastMNN
will only use the restricted subset of cells in each batch to identify MNN pairs and the center of the orthogonalization.
It will then extrapolate the correction to all cells in each batch.
Note that all cells are used to perform the PCA, regardless of whether restrict
is set.
This is generally desirable in applications where restrict
is useful.
For example, constructing the projection vectors with only control cells will not guarantee resolution of unique non-control populations in each batch.
However, this also means that the corrected values for the restricted cells will differ from the output when the inputs are directly subsetted to only contain the restricted cells.
If this is not desirable, users can perform the PCA manually and apply reducedMNN
instead.
We can consider fastMNN
's operation in terms of pairwise merge steps.
Each merge step involves two mutually exclusive sets of cells, a “left” set and “right” set.
Each set may consist of cells from different batches if those batches were merged in a previous step.
The merge will then create a new set of cells that combines the left and right sets.
Iteratively repeating this process with the newly formed sets will eventually merge all batches together.
The output metadata contains merge.info
, a DataFrame where each row corresponds to a merge step.
It contains the following fields:
left
, a List of integer or character vectors.
Each vector specifies the batches in the left set at a given merge step.
right
, a similar List of integer or character vectors.
Each vector specifies the batches in the right set at a given merge step.
pairs
, a List of DataFrames specifying which pairs of cells were identified as MNNs at each step.
In each DataFrame, each row corresponds to a single MNN pair and specifies the
paired cells that were in the left and right sets, respectively.
Note that the indices refer to those paired cells in the output ordering of cells,
i.e., users can identify the paired cells at each step by column-indexing the output of the fastMNN
function.
batch.size
, a numeric vector specifying the relative magnitude of the batch effect at each merge,
see “Orthogonalization details”.
skipped
, a logical vector indicating whether the correction was skipped
if the magnitude of the batch effect was below min.batch.skip
.
lost.var
, a numeric matrix specifying the percentage of variance lost due to orthogonalization at each merge step.
This is reported separately for each batch (columns, ordered according to the input order, not the merge order).
The threshold to define nearest neighbors is defined by k
, which is passed to findMutualNN
to identify MNN pairs.
The size of k
can be roughly interpreted as the anticipated minimum size of a shared subpopulation in each batch.
If a batch has fewer than k
cells of a shared subpopulation, there is an increased risk that its counterparts in other batches will form incorrect MNN pairs.
From the perspective of the algorithm, larger values allow for more MNN pairs to be obtained, which improves the stability of the correction vectors.
Larger values also increase robustness against non-orthogonality, by ignoring a certain level of biological variation when identifying pairs.
This can be used to avoid the kissing problem where MNN pairs are only detected on the “surface” of the distribution.
However, values of k
should not be too large, as this would result in MNN pairs being inappropriately identified between biologically distinct populations.
In practice, increasing k
will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches.
We suggest starting with the default k
and increasing it if one is confident that the same cell types are not adequately merged across batches.
This is better than starting with a large k
as incorrect merging is much harder to diagnose than insufficient merging.
An additional consideration is that the effect of any given k
will vary with the number of cells in each batch.
With more cells, a larger k
may be preferable to achieve better merging in the presence of non-orthogonality.
We can achieve this by setting prop.k
, e.g., prop.k=0.05
will set k
to 5% of the number of cells in each batch.
This allows the choice of k
to adapt to the size of each batch at each merge step and handles asymmetry in batch sizes (via the k1
and k2
arguments in findMutualNN
).
fastMNN
will compute the percentage of variance that is lost from each batch during orthogonalization at each merge step.
This represents the variance in each batch that is parallel to the average correction vectors (and hence removed during orthogonalization) at each merge step.
Large proportions suggest that there is biological structure that is parallel to the batch effect,
corresponding to violations of the assumption that the batch effect is orthogonal to the biological subspace.
A rule of thumb is that more than 10% of lost variance is cause for closer examination, though this is highly dependent on the context, e.g., a large lost proportion may be fine if the population structure is still approximately preserved while a small lost proportion may be unacceptable if an important rare subpopulation can no longer be distinguished.
In rare cases, orthogonalization may cause problems in the absence of a batch effect, resulting in large losses of variance.
To avoid this, fastMNN
will not perform any correction if the relative magnitude of the batch effect is less than min.batch.skip
.
The relative magnitude is defined as the L2 norm of the average correction vector divided by the root-mean-square of the L2 norms of the per-MNN pair correction vectors.
This will be large when the per-pair vectors are all pointing in the same direction,
and small when the per-pair vectors point in random directions due to the absence of a consistent batch effect.
If a large loss of variance is observed along with a small batch effect in a given merge step, users can set min.batch.skip
to simply skip correction in that step.
Aaron Lun
Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421
Lun ATL (2018). Further MNN algorithm development. https://MarioniLab.github.io/FurtherMNN2018/theory/description.html
cosineNorm
and multiBatchPCA
, to obtain the values to be corrected.
reducedMNN
, for a version of the function that operates in low-dimensional space.
mnnCorrect
, for the “classic” version of the MNN correction algorithm.
clusterMNN
, for the cluster-based version of this approach.
multiModalMNN
from the mumosa package, which extends this to data in the altExps
.
mnnDeltaVariance
, to obtain further per-gene diagnostics on the behavior of the correction.
B1 <- matrix(rnorm(10000, -1), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000, 1), ncol=50) # Batch 2 out <- fastMNN(B1, B2) # Corrected values for use in clustering, etc. str(reducedDim(out)) # Extracting corrected expression values for gene 10. summary(assay(out)[10,])
B1 <- matrix(rnorm(10000, -1), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000, 1), ncol=50) # Batch 2 out <- fastMNN(B1, B2) # Corrected values for use in clustering, etc. str(reducedDim(out)) # Extracting corrected expression values for gene 10. summary(assay(out)[10,])
Subset multiple batches so that they have the same number and order of rows.
intersectRows(..., subset.row = NULL, keep.all = FALSE)
intersectRows(..., subset.row = NULL, keep.all = FALSE)
... |
One or more matrix-like objects containing single-cell gene expression matrices. Alternatively, one or more SingleCellExperiment objects can be supplied. |
subset.row |
A vector specifying the subset of genes to retain.
Defaults to |
keep.all |
Logical scalar indicating whether the data should actually be subsetted to |
If any entry of ...
contains no row names or if the intersection is empty, an error is raised.
If all entries of ...
already have the same row names in the same order, this function is a no-op.
In such cases, subset.row
can be a character, integer or logical vector.
If entries of ...
do not have identical row names, subset.row
can only be a character vector.
Any other type will lead to an error as the interpretation of the subset is not well defined.
Setting keep.all=TRUE
option gives the same result as having subset.row=NULL
in the first place.
However, it is still useful for checking that subset.row
is a character vector.
This ensures that downstream applications can safely use subset.row
in conjunction with, e.g., correct.all=TRUE
for fastMNN
.
A list containing the row-subsetted contents of ...
, each of which have the same number and order of rows.
Aaron Lun
X <- rbind(A=c(1,2), B=c(3,4)) Y <- rbind(a=c(1,2), B=c(3,4)) intersectRows(X, Y) # Only B is retained. # Error is raised when no genes are retained: X <- rbind(A=c(1,2), B=c(3,4)) Y <- rbind(a=c(1,2), b=c(3,4)) try(intersectRows(X, Y)) # Error is raised for non-character subset.row # when row names are not identical: X <- rbind(A=c(1,2), B=c(3,4), C=c(5,6)) Y <- rbind(a=c(1,2), B=c(3,4), C=c(5,6)) intersectRows(X, Y) intersectRows(X, Y, subset.row="B") try(intersectRows(X, Y, subset.row=1)) # Setting keep.all=TRUE only checks but does not apply subset.row. intersectRows(X, Y, subset.row="B", keep.all=TRUE) try(intersectRows(X, Y, subset.row=1, keep.all=TRUE))
X <- rbind(A=c(1,2), B=c(3,4)) Y <- rbind(a=c(1,2), B=c(3,4)) intersectRows(X, Y) # Only B is retained. # Error is raised when no genes are retained: X <- rbind(A=c(1,2), B=c(3,4)) Y <- rbind(a=c(1,2), b=c(3,4)) try(intersectRows(X, Y)) # Error is raised for non-character subset.row # when row names are not identical: X <- rbind(A=c(1,2), B=c(3,4), C=c(5,6)) Y <- rbind(a=c(1,2), B=c(3,4), C=c(5,6)) intersectRows(X, Y) intersectRows(X, Y, subset.row="B") try(intersectRows(X, Y, subset.row=1)) # Setting keep.all=TRUE only checks but does not apply subset.row. intersectRows(X, Y, subset.row="B", keep.all=TRUE) try(intersectRows(X, Y, subset.row=1, keep.all=TRUE))
Correct for batch effects in single-cell expression data using the mutual nearest neighbors method.
mnnCorrect( ..., batch = NULL, restrict = NULL, k = 20, prop.k = NULL, sigma = 0.1, cos.norm.in = TRUE, cos.norm.out = TRUE, svd.dim = 0L, var.adj = TRUE, subset.row = NULL, correct.all = FALSE, merge.order = NULL, auto.merge = FALSE, assay.type = "logcounts", BSPARAM = ExactParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
mnnCorrect( ..., batch = NULL, restrict = NULL, k = 20, prop.k = NULL, sigma = 0.1, cos.norm.in = TRUE, cos.norm.out = TRUE, svd.dim = 0L, var.adj = TRUE, subset.row = NULL, correct.all = FALSE, merge.order = NULL, auto.merge = FALSE, assay.type = "logcounts", BSPARAM = ExactParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns.
Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, it is assumed to contain cells from all batches, so Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside each list were passed directly to |
batch |
A vector or factor specifying the batch of origin for all cells when only a single object is supplied in |
restrict |
A list of length equal to the number of objects in |
k |
An integer scalar specifying the number of nearest neighbors to consider when identifying MNNs. |
prop.k |
A numeric scalar in (0, 1) specifying the proportion of cells in each dataset to use for mutual nearest neighbor searching.
If set, the number of nearest neighbors used for the MNN search in each batch is redefined as |
sigma |
A numeric scalar specifying the bandwidth of the Gaussian smoothing kernel used to compute the correction vector for each cell. |
cos.norm.in |
A logical scalar indicating whether cosine normalization should be performed on the input data prior to calculating distances between cells. |
cos.norm.out |
A logical scalar indicating whether cosine normalization should be performed prior to computing corrected expression values. |
svd.dim |
An integer scalar specifying the number of dimensions to use for summarizing biological substructure within each batch. |
var.adj |
A logical scalar indicating whether variance adjustment should be performed on the correction vectors. |
subset.row |
A vector specifying which features to use for correction. |
correct.all |
A logical scalar specifying whether correction should be applied to all genes, even if only a subset is used for the MNN calculations. |
merge.order |
An integer vector containing the linear merge order of batches in |
auto.merge |
Logical scalar indicating whether to automatically identify the “best” merge order. |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA in |
BNPARAM |
A BiocNeighborParam object specifying the nearest neighbor algorithm. |
BPPARAM |
A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized. |
This function is designed for batch correction of single-cell RNA-seq data where the batches are partially confounded with biological conditions of interest. It does so by identifying pairs of mutual nearest neighbors (MNN) in the high-dimensional log-expression space. Each MNN pair represents cells in different batches that are of the same cell type/state, assuming that batch effects are mostly orthogonal to the biological manifold. Correction vectors are calculated from the pairs of MNNs and corrected (log-)expression values are returned for use in clustering and dimensionality reduction.
For each MNN pair, a pairwise correction vector is computed based on the difference in the log-expression profiles.
The correction vector for each cell is computed by applying a Gaussian smoothing kernel with bandwidth sigma
is the pairwise vectors.
This stabilizes the vectors across many MNN pairs and extends the correction to those cells that do not have MNNs.
The choice of sigma
determines the extent of smoothing - a value of 0.1 is used by default, corresponding to 10% of the radius of the space after cosine normalization.
A SingleCellExperiment object containing the corrected
assay.
This contains corrected expression values for each gene (row) in each cell (column) in each batch.
A batch
field is present in the column data, specifying the batch of origin for each cell.
Cells in the output object are always ordered in the same manner as supplied in ...
.
For a single input object, cells will be reported in the same order as they are arranged in that object.
In cases with multiple input objects, the cell identities are simply concatenated from successive objects,
i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.
The metadata of the SingleCellExperiment contains merge.info
, a DataFrame where each row corresponds to a merge step.
See “Merge diagnostics” for more information.
All genes are used with the default setting of subset.row=NULL
.
Users can set subset.row
to subset the inputs to highly variable genes or marker genes.
This may provide more meaningful identification of MNN pairs by reducing the noise from irrelevant genes.
Note that users should not be too restrictive with subsetting, as high dimensionality is required to satisfy the orthogonality assumption in MNN detection.
If subset.row
is specified and correct.all=TRUE
, corrected values are returned for all genes.
This is possible as subset.row
is only used to identify the MNN pairs and other cell-based distance calculations.
Correction vectors between MNN pairs can then be computed in for all genes in the supplied matrices.
Note that setting correct.all=TRUE
will not alter the corrected expression values for the subsetted genes.
The input expression values should generally be log-transformed, e.g., log-counts, see logNormCounts
for details.
They should also be normalized within each data set to remove cell-specific biases in capture efficiency and sequencing depth.
Users may also consider using the multiBatchNorm
functino to mitigate the effect of differences in sequencing depth between batches.
By default, a further cosine normalization step is performed on the supplied expression data to eliminate gross scaling differences between data sets.
When cos.norm.in=TRUE
, cosine normalization is performed on the matrix of expression values used to compute distances between cells.
This can be turned off when there are no scaling differences between data sets.
When cos.norm.out=TRUE
, cosine normalization is performed on the matrix of values used to calculate correction vectors (and on which those vectors are applied).
This can be turned off to obtain corrected values on the log-scale, similar to the input data.
The cosine normalization is achieved using the cosineNorm
function.
The function depends on a shared biological manifold, i.e., one or more cell types/states being present in multiple batches.
If this is not true, MNNs may be incorrectly identified, resulting in over-correction and removal of interesting biology.
Some protection can be provided by removing components of the correction vectors that are parallel to the biological subspaces in each batch.
The biological subspace in each batch is identified with a SVD on the expression matrix to obtain svd.dim
dimensions.
(By default, this option is turned off by setting svd.dim=0
.)
If var.adj=TRUE
, the function will adjust the correction vector to equalize the variances of the two data sets along the batch effect vector.
In particular, it avoids “kissing” effects whereby MNN pairs are identified between the surfaces of point clouds from different batches.
Naive correction would then bring only the surfaces into contact, rather than fully merging the clouds together.
The adjustment ensures that the cells from the two batches are properly intermingled after correction.
This is done by identifying each cell's position on the correction vector, identifying corresponding quantiles between batches,
and scaling the correction vector to ensure that the quantiles are matched after correction.
See ?"batchelor-restrict"
for a description of the restrict
argument.
Specifically, mnnCorrect
will only use the restricted subset of cells in each batch to identify MNN pairs (and to perform variance adjustment, if var.adj=TRUE
), and then apply the correction to all cells in each batch.
Each merge step combines two mutually exclusive sets of cells, a “left” set and “right” set. The metadata thus contains the following fields:
left
, a List of integer or character vectors.
Each vector specifies the batches in the left set at a given merge step.
right
, a similar List of integer or character vectors.
Each vector specifies the batches in the right set at a given merge step.
pairs
, a List of DataFrames specifying which pairs of cells were identified as MNNs at each step.
In each DataFrame, each row corresponds to a single MNN pair and specifies the
paired cells that were in the left and right sets, respectively.
Note that the indices refer to those paired cells in the output ordering of cells,
i.e., users can identify the paired cells at each step by column-indexing the output of the mnnCorrect
function.
The threshold to define nearest neighbors is defined by k
, which is passed to findMutualNN
to identify MNN pairs.
The size of k
can be roughly interpreted as the anticipated minimum size of a shared subpopulation in each batch.
If a batch has fewer than k
cells of a shared subpopulation, there is an increased risk that its counterparts in other batches will form incorrect MNN pairs.
From the perspective of the algorithm, larger values allow for more MNN pairs to be obtained, which improves the stability of the correction vectors.
Larger values also increase robustness against non-orthogonality, by ignoring a certain level of biological variation when identifying pairs.
This can be used to avoid the kissing problem where MNN pairs are only detected on the “surface” of the distribution.
However, values of k
should not be too large, as this would result in MNN pairs being inappropriately identified between biologically distinct populations.
In practice, increasing k
will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches.
We suggest starting with the default k
and increasing it if one is confident that the same cell types are not adequately merged across batches.
This is better than starting with a large k
as incorrect merging is much harder to diagnose than insufficient merging.
An additional consideration is that the effect of any given k
will vary with the number of cells in each batch.
With more cells, a larger k
may be preferable to achieve better merging in the presence of non-orthogonality.
We can achieve this by setting prop.k
, e.g., prop.k=0.05
will set k
to 5% of the number of cells in each batch.
This allows the choice of k
to adapt to the size of each batch at each merge step and handles asymmetry in batch sizes (via the k1
and k2
arguments in findMutualNN
).
By default, batches are merged in the user-supplied order in ...
,
i.e., the first batch is merged with the second batch, the third batch is merged with the combined first-second batch, the fourth batch is merged with the combined first-second-third batch and so on.
We refer to this approach as a progressive merge.
When batch
is supplied for a single object in ...
, the ordering of batches in a progressive merge is determined by the ordering of factor levels in batch
.
If merge.order
is an integer vector, it is treated as an ordering permutation with which to perform a progressive merge.
For example, if merge.order=c(4,1,3,2)
, batches 4 and 1 in ...
are merged first;
batch 3 is merged with the combined 4+1 batch;
and then batch 2 is merged with the combined 4+1+3 batch.
This is often more convenient than changing the order manually in ...
,
which would alter the order of batches in the output corrected
matrix.
If merge.order
is a character vector, it is treated as an ordering permutation for named batches.
If merge.order
is a factor, it is coerced into a character vector and also treated as a permutation of names.
If merge.order
is a nested list, it is treated as a tree that specifies a hierarchical merge.
Each element of the list should either be a string or integer scalar, corresponding to a leaf node that specifies a batch;
or another list, corresponding to an internal node that should contain at least two children;
or an integer or character vector of length 2 or more, again corresponding to an internal node.
For example, list(list(1,2), list(3,4))
indicates that batch 1 should be merged with batch 2;
batch 3 should be merged with batch 4; and that, finally, the combined batches 1+2 and 3+4 should be merged.
More than two children per node are supported and will result in a progressive merge within that node.
For example, list(list(1,2,3), list(4,5,6))
will merge batch 1 with 2, then 1+2 with 3;
batch 4 with 5, and then 4+5 with 6; and finally, 1+2+3 with 4+5+6.
The same approach can be used for integer or character vectors, e.g., list(1:3, 4:6)
has the same effect as above.
Note that, while batches can be specified by name (character) or index (integer), users cannot use both in the same tree. Factors can be provided and are coerced to character vectors.
The merge order may occasionally be important as it determines the number of MNN pairs available at each merge step. MNN pairs results in greater stability of the batch vectors and increased likelihood of identifying shared subpopulations, which are important to the precision and accuracy of the MNN-based correction, respectively.
In a progressive merge, the reference increases in size at each step, ensuring that more cells are available to identify MNN pairs in later merges. We suggest setting the largest, most heterogeneous batch as the first reference, which favors detection of sufficient MNN pairs between the first and other batches. Conversely, if two small batches without shared populations are supplied first, the wrong MNN pairs will be detected and the result of the merge will be incorrect.
A merge tree is useful for merging together batches that are known to be more closely related (e.g., replicates) before attempting difficult merges involving more dissimilar batches. The idea is to increase the number of cells and thus MNN pairs prior to merging batches with few shared subpopulations. By comparison, performing the more difficult merges first is more likely to introduce errors whereby distinct subpopulations are incorrectly placed together, which is propagated to later steps as the initial merge is used as a reference for subsequent merges.
If auto.merge=TRUE
, merge steps are chosen to maximize the number of MNN pairs at each step.
The aim is to improve the stability of the correction by first merging more similar batches with more MNN pairs.
This can be somewhat time-consuming as MNN pairs need to be iteratively recomputed for all possible batch pairings.
The order of cells in the output is never affected by the setting of merge.order
.
It depends only on the order of objects in ...
and the order of cells within each object.
Laleh Haghverdi, with modifications by Aaron Lun
Haghverdi L, Lun ATL, Morgan MD, Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36(5):421
fastMNN
for a faster equivalent.
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- mnnCorrect(B1, B2) # corrected values
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- mnnCorrect(B1, B2) # corrected values
Computes the variance of the paired MNN deltas
mnnDeltaVariance( ..., pairs, cos.norm = FALSE, subset.row = NULL, compute.all = FALSE, assay.type = "logcounts", BPPARAM = SerialParam(), trend.args = list() )
mnnDeltaVariance( ..., pairs, cos.norm = FALSE, subset.row = NULL, compute.all = FALSE, assay.type = "logcounts", BPPARAM = SerialParam(), trend.args = list() )
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns.
Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, it is assumed to contain cells from all batches, so Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside each list were passed directly to |
pairs |
A DataFrame or list of DataFrames containing MNN pairing information.
Each row of each DataFrame specifies an MNN pair; each DataFrame should have two columns containing the column indices of paired cells.
This is typically produced by |
cos.norm |
A logical scalar indicating whether cosine normalization should be performed on the expression values. |
subset.row |
A vector specifying which genes to use for the calculation.
This is typically the same as the |
compute.all |
Logical scalar indicating whether statistics should be returned for all genes when |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs. |
BPPARAM |
A BiocParallelParam object specifying whether the calculations should be parallelized. |
trend.args |
Named list of further arguments to pass to |
The “MNN delta” is defined as the difference in the uncorrected expression values of MNN-paired cells. We compute the variance of these values across all pairs for each gene; genes with highly variable deltas are strongly affected by the correction beyond a simple shift. By looking through the top genes, we may conclude that the correction did something unusual if, e.g., we find important markers in the top set. Conversely, if key genes do not have unusually high variances in their delta, we may gain some confidence in the reliability of the correction - at least with respect to the behavior of those genes.
We compute the variance of the deltas rather than using their magnitude, as a shift in expression space is not particularly concerning (and is, in fact, par for the course for batch correction). Rather, we are interested in identifying non-linear changes that might be indicative of correction errors, e.g., inappropriate merging of two different subpopulations. This would manifest as high variances for the relevant marker genes. Of course, whether or not this is actually an error can be a subjective decision - loss of some biological heterogeneity is often an acceptable cost for flexible correction.
To eliminate the mean-variance relationship, we fit a trend to the variances across all genes with respect to the mean log-expression in each MNN pair.
We then define an adjusted variance based on the residuals of the fitted curve;
this is a more appropriate measure to use for ranking affected genes, as it ensures that genes are not highly ranked due to their abundance alone.
Fitting is done using the fitTrendVar
function from the scran package - further arguments can be passed via trend.args
.
If a list is passed to pairs
, each entry of the list is assumed to correspond to a merge step.
The variance and trend fitting is done separately for each merge step, and the results are combined by computing the average variance across all steps.
This avoids considering the variance of the deltas across merge steps, which is not particularly concerning, e.g., if different batches require different translations.
A DataFrame with one row per gene in ...
(or as specified by subset.row
),
containing the following columns:
mean
, the mean of the mean log-expressions across all MNN pairs.
This may contain repeated contributions from the same cell if it is involved in many MNN pairs.
total
, the total variance of the deltas across all MNN pairs.
trend
, the fitted values of the trend in total
with respect to mean
.
adjusted
, the adjusted variance, i.e., the residuals of the fitted trend.
The metadata
contains the trend fitting statistics returned by fitTrendVar
.
If pairs
is a list of length greater than 1, the returned DataFrame will also contain per.block
, a nested DataFrames of nested DataFrames.
Each one of these contains statistics for the individual merge steps and has the same structure as that described above.
fastMNN
If cos.norm=TRUE
, cosine normalization is performed on each cell in ...
with cosineNorm
.
This mimics what is done inside fastMNN
for greater consistency with that function's internal calculations.
Some further scaling is performed so that the magnitude of the normalized values is mostly unaffected.
If subset.row
is specified, variances are only computed for the requested subset of genes, most typically the set of highly variable genes used in fastMNN
.
This also implies that the normalization and trend fitting is limited to the specified subset.
However, if compute.all=TRUE
, the scaling factor and fitted trend are extrapolated to compute adjusted variances for all other genes.
This is useful for picking up genes outside of the subset used in the correction.
In most applications, it is not necessary to have strict consistency with the internal values computed by fastMNN
.
Nonetheless, it may be helpful for diagnostics in more difficult situations.
Aaron Lun
fastMNN
and related functions, to obtain the MNN pairs in the first place.
means <- 2^rexp(200) * 10 B1 <- matrix(rpois(10000, means), ncol=50) # Batch 1 B2 <- matrix(rpois(10000, means), ncol=50) # Batch 2 # Spiking in some population structure. Each batch is split into two halves # corresponding to 2 subpopulations. These populations match up across batches # but the group in batch 1 also upregulates the first gene. B1[1:2,1:25] <- B1[1:2,1:25] * 50 B2[2,1:25] <- B2[2,1:25] * 50 # Quick log-transformation and correction. We'll omit the multiBatchNorm as the # sequencing depth is the same across batches anyway. library(scuttle) B1 <- normalizeCounts(B1) B2 <- normalizeCounts(B2) out <- fastMNN(B1, B2) # First gene should have high variance of deltas, as the upregulation in B1 is # removed to enable the alignment of subpopulations across batches. mnnDeltaVariance(B1, B2, pairs=metadata(out)$merge.info$pairs[[1]])
means <- 2^rexp(200) * 10 B1 <- matrix(rpois(10000, means), ncol=50) # Batch 1 B2 <- matrix(rpois(10000, means), ncol=50) # Batch 2 # Spiking in some population structure. Each batch is split into two halves # corresponding to 2 subpopulations. These populations match up across batches # but the group in batch 1 also upregulates the first gene. B1[1:2,1:25] <- B1[1:2,1:25] * 50 B2[2,1:25] <- B2[2,1:25] * 50 # Quick log-transformation and correction. We'll omit the multiBatchNorm as the # sequencing depth is the same across batches anyway. library(scuttle) B1 <- normalizeCounts(B1) B2 <- normalizeCounts(B2) out <- fastMNN(B1, B2) # First gene should have high variance of deltas, as the upregulation in B1 is # removed to enable the alignment of subpopulations across batches. mnnDeltaVariance(B1, B2, pairs=metadata(out)$merge.info$pairs[[1]])
Perform scaling normalization within each batch to provide comparable results to the lowest-coverage batch.
multiBatchNorm( ..., batch = NULL, norm.args = list(), min.mean = 1, subset.row = NULL, normalize.all = FALSE, preserve.single = TRUE, assay.type = "counts", BPPARAM = SerialParam() )
multiBatchNorm( ..., batch = NULL, norm.args = list(), min.mean = 1, subset.row = NULL, normalize.all = FALSE, preserve.single = TRUE, assay.type = "counts", BPPARAM = SerialParam() )
... |
One or more SingleCellExperiment objects containing counts and size factors. Each object should contain the same number of rows, corresponding to the same genes in the same order. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, Alternatively, one or more lists of SingleCellExperiments can be provided;
this is flattened as if the objects inside were passed directly to |
batch |
A factor specifying the batch of origin for all cells when only a single object is supplied in |
norm.args |
A named list of further arguments to pass to |
min.mean |
A numeric scalar specifying the minimum (library size-adjusted) average count of genes to be used for normalization. |
subset.row |
A vector specifying which features to use for normalization. |
normalize.all |
A logical scalar indicating whether normalized values should be returned for all genes. |
preserve.single |
A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in |
assay.type |
A string specifying which assay values contains the counts. |
BPPARAM |
A BiocParallelParam object specifying whether calculations should be parallelized. |
When performing integrative analyses of multiple batches, it is often the case that different batches have large differences in sequencing depth. This function removes systematic differences in coverage across batches to simplify downstream comparisons. It does so by resaling the size factors using median-based normalization on the ratio of the average counts between batches. This is roughly equivalent to the between-cluster normalization described by Lun et al. (2016).
This function will adjust the size factors so that counts in high-coverage batches are scaled downwards to match the coverage of the most shallow batch.
The logNormCounts
function will then add the same pseudo-count to all batches before log-transformation.
By scaling downwards, we favour stronger squeezing of log-fold changes from the pseudo-count, mitigating any technical differences in variance between batches.
Of course, genuine biological differences will also be shrunk, but this is less of an issue for upregulated genes with large counts.
Only genes with library size-adjusted average counts greater than min.mean
will be used for computing the rescaling factors.
This improves precision and avoids problems with discreteness.
By default, we use min.mean=1
, which is usually satisfactory but may need to be lowered for very sparse datasets.
Users can also set subset.row
to restrict the set of genes used for computing the rescaling factors.
By default, normalized values will only be returned for genes specified in the subset.
Setting normalize.all=TRUE
will return normalized values for all genes.
A list of SingleCellExperiment objects with normalized log-expression values in the "logcounts"
assay (depending on values in norm.args
).
Each object contains cells from a single batch.
If preserve.single=TRUE
and ...
contains only one SingleCellExperiment, that object is returned with an additional "logcounts"
assay containing normalized log-expression values.
The order of cells is not changed.
For comparison, imagine if we ran logNormCounts
separately in each batch prior to correction.
Size factors will be computed within each batch, and batch-specific application in logNormCounts
will not account for scaling differences between batches.
In contrast, multiBatchNorm
will rescale the size factors so that they are comparable across batches.
This removes at least one difference between batches to facilitate easier correction.
cosineNorm
performs a similar role of equalizing the scale of expression values across batches.
However, the advantage of multiBatchNorm
is that its output is more easily interpreted -
the normalized values remain on the log-scale and differences can still be interpreted (roughly) as log-fold changes.
The output can then be fed into downstream analysis procedures (e.g., HVG detection) in the same manner as typical log-normalized values from logNormCounts
.
Aaron Lun
Lun ATL (2018). Further MNN algorithm development. https://MarioniLab.github.io/FurtherMNN2018/theory/description.html
mnnCorrect
and fastMNN
, for methods that can benefit from rescaling.
logNormCounts
for the calculation of log-transformed normalized expression values.
applyMultiSCE
, to apply this function over the altExps
in x
.
d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100) sce1 <- SingleCellExperiment(list(counts=d1)) sizeFactors(sce1) <- runif(ncol(d1)) d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40) sce2 <- SingleCellExperiment(list(counts=d2)) sizeFactors(sce2) <- runif(ncol(d2)) out <- multiBatchNorm(sce1, sce2) summary(sizeFactors(out[[1]])) summary(sizeFactors(out[[2]]))
d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100) sce1 <- SingleCellExperiment(list(counts=d1)) sizeFactors(sce1) <- runif(ncol(d1)) d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40) sce2 <- SingleCellExperiment(list(counts=d2)) sizeFactors(sce2) <- runif(ncol(d2)) out <- multiBatchNorm(sce1, sce2) summary(sizeFactors(out[[1]])) summary(sizeFactors(out[[2]]))
Perform a principal components analysis across multiple gene expression matrices to project all cells to a common low-dimensional space.
multiBatchPCA( ..., batch = NULL, d = 50, subset.row = NULL, weights = NULL, get.all.genes = FALSE, get.variance = FALSE, preserve.single = FALSE, assay.type = "logcounts", BSPARAM = IrlbaParam(), deferred = TRUE, BPPARAM = SerialParam() )
multiBatchPCA( ..., batch = NULL, d = 50, subset.row = NULL, weights = NULL, get.all.genes = FALSE, get.variance = FALSE, preserve.single = FALSE, assay.type = "logcounts", BSPARAM = IrlbaParam(), deferred = TRUE, BPPARAM = SerialParam() )
... |
Two or more matrices containing expression values (usually log-normalized). Each matrix is assumed to represent one batch and should contain the same number of rows, corresponding to the same genes in the same order. Alternatively, two or more SingleCellExperiment objects containing these matrices. Note the same restrictions described above for gene expression matrix inputs. Alternatively, one matrix or SingleCellExperiment can be supplied containing cells from all batches.
This requires Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside were passed directly to |
batch |
A factor specifying the batch identity of each cell in the input data.
Ignored if |
d |
An integer scalar specifying the number of dimensions to keep from the PCA.
Alternatively |
subset.row |
A vector specifying which features to use for correction. |
weights |
Numeric vector, logical scalar or list specifying the weighting scheme to use, see below for details. |
get.all.genes |
A logical scalar indicating whether the reported rotation vectors should include genes
that are excluded by a non- |
get.variance |
A logical scalar indicating whether to return the (weighted) variance explained by each PC. |
preserve.single |
A logical scalar indicating whether to combine the results into a single matrix if only one object was supplied in |
assay.type |
A string or integer scalar specifying the assay containing the expression values, if SingleCellExperiment objects are present in |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA, see |
deferred |
A logical scalar used to overwrite the |
BPPARAM |
A BiocParallelParam object specifying whether the SVD should be parallelized. |
This function is roughly equivalent to cbind
ing all matrices in ...
and performing PCA on the merged matrix.
The main difference is that each sample is forced to contribute equally to the identification of the rotation vectors.
Specifically, the mean vector used for centering is defined as the grand mean of the mean vectors within each batch.
Each batch's contribution to the gene-gene covariance matrix is also divided by the number of cells in that batch.
Our approach is to effectively weight the cells in each batch to mimic the situation where all batches have the same number of cells. This ensures that the low-dimensional space can distinguish subpopulations in smaller batches. Otherwise, batches with a large number of cells would dominate the PCA, i.e., the definition of the mean vector and covariance matrix. This may reduce resolution of unique subpopulations in smaller batches that differ in a different dimension to the subspace of the larger batches.
It is usually recommended to set subset.row
to a subset of interesting features (e.g., highly variable genes).
This reduces computational time and eliminates uninteresting noise that could interfere with identification of the most relevant axes of variation.
Setting get.all.genes=TRUE
will report rotation vectors that span all genes, even when only a subset of genes are used for the PCA.
This is done by projecting all non-used genes into the low-dimensional “cell space” defined by the first d
components.
A List of numeric matrices is returned where each matrix corresponds to a batch and contains the first d
PCs (columns) for all cells in the batch (rows).
If preserve.single=TRUE
and ...
contains a single object, the List will only contain a single matrix.
This contains the first d
PCs (columns) for all cells in the same order as supplied in the single input object.
The metadata contains rotation
, a matrix of rotation vectors, which can be used to construct a low-rank approximation of the input matrices.
This has number of rows equal to the number of genes after any subsetting, except if get.all.genes=TRUE
, where the number of rows is equal to the genes before subsetting.
If get.variance=TRUE
, the metadata will also contain var.explained
, a numeric vector containing the weighted variance explained by each of d
PCs;
and var.total
, the total variance across all components after weighting.
By default, weights=NULL
or TRUE
will use the default weights,
i.e., the reciprocal of the number of cells in each batch.
This equalizes the contribution of batches with different numbers of cells as described above.
If weights=FALSE
, no weighting is performed.
This means that larger batches will drive the PCA,
which may be desirable when dealing with technical replicates where there is no concern about unique subpopulations in smaller batches.
If weights
is a numeric vector, it is expected to be of the same length
(and, if named, have the same names) as the entries in ...
.
Each entry of the vector is used to scale the default weight of the corresponding batch.
This allows users to fine-tune the contribution of each batch in situations with multiple levels of heterogeneity.
For example, consider merging data from multiple donors where each donor contains a variable number of batches.
In such cases, it may be more appropriate to ensure that each donor has equal weight, rather than each batch.
This is done by assigning a value of weights
to each replicate that is inversely proportional to the number of batches for the same donor - see Examples.
Alternatively, weights
can be a list representing a tree-like structure,
identical to the tree controlling the merge order in fastMNN
.
Here, weights are scaled so that each partition in the tree yields subtrees with identical scaled weights (summed across each subtree's children).
This allows us to easily adjust the weighting scheme for hierarchical batch structures like the replicate-study scenario described above.
Most of the parameters related to the PCA are controlled by BSPARAM
, which determines the choice and parameterization of SVD algorithms.
The default is to use IrlbaParam
though RandomParam
is often faster (at the cost of some accuracy).
Most choices of BSPARAM
will interactly sanely with BPPARAM
to achieve parallelization during the PCA itself.
With the default deferred=TRUE
, the per-gene centering and per-cell scaling will be deferred during matrix multiplication in approximate SVD algorithms.
This is much faster when the input matrices are sparse, as deferred operations avoids loss of sparsity at the cost of numerical precision.
If deferred=NULL
, the use of deferred scaling is determined by the setting within BSPARAM
itself - see ?bsdeferred
for details.
If d=NA
, no PCA is performed.
Instead, the centered matrices are transposed and returned directly, while the rotation matrix is set to an identity matrix.
This allows developers to easily switch between PCA-based approximations versus the underlying dataset in their functions by simply changing d
.
(In some settings, one can even interpret d=NA
as the maximum d
, as Euclidean distance calculations between cells will be identical to a full-rank PC matrix.)
Note that, in this mode, no projection will be done with get.all.genes=TRUE
; rather, genes not in subset.row
will simply have rotation values of zero.
If get.variance=TRUE
, the weighted variances of the individual genes are returned.
Aaron Lun
d1 <- matrix(rnorm(5000), ncol=100) d1[1:10,1:10] <- d1[1:10,1:10] + 2 # unique population in d1 d2 <- matrix(rnorm(2000), ncol=40) d2[11:20,1:10] <- d2[11:20,1:10] + 2 # unique population in d2 # PCA defaults to IRLBA, so we need to set the seed. set.seed(10) out <- multiBatchPCA(d1, d2, d=10) # Examining results. xlim <- range(c(out[[1]][,1], out[[2]][,1])) ylim <- range(c(out[[1]][,2], out[[2]][,2])) plot(out[[1]][,1], out[[1]][,2], col="red", xlim=xlim, ylim=ylim) points(out[[2]][,1], out[[2]][,2], col="blue") # Using the weighting scheme, assuming that 'd2' and 'd3' # are replicates and should contribute the same combined # weight as 'd1'. d3 <- d2 + 5 set.seed(10) out <- multiBatchPCA(d1, d2, d3, d=10, weights=c(1, 0.5, 0.5)) set.seed(10) alt <- multiBatchPCA(d1, d2, d3, d=10, weights=list(1, list(2, 3))) stopifnot(all.equal(out, alt)) # As they are the same.
d1 <- matrix(rnorm(5000), ncol=100) d1[1:10,1:10] <- d1[1:10,1:10] + 2 # unique population in d1 d2 <- matrix(rnorm(2000), ncol=40) d2[11:20,1:10] <- d2[11:20,1:10] + 2 # unique population in d2 # PCA defaults to IRLBA, so we need to set the seed. set.seed(10) out <- multiBatchPCA(d1, d2, d=10) # Examining results. xlim <- range(c(out[[1]][,1], out[[2]][,1])) ylim <- range(c(out[[1]][,2], out[[2]][,2])) plot(out[[1]][,1], out[[1]][,2], col="red", xlim=xlim, ylim=ylim) points(out[[2]][,1], out[[2]][,2], col="blue") # Using the weighting scheme, assuming that 'd2' and 'd3' # are replicates and should contribute the same combined # weight as 'd1'. d3 <- d2 + 5 set.seed(10) out <- multiBatchPCA(d1, d2, d3, d=10, weights=c(1, 0.5, 0.5)) set.seed(10) alt <- multiBatchPCA(d1, d2, d3, d=10, weights=list(1, list(2, 3))) stopifnot(all.equal(out, alt)) # As they are the same.
Provides a no-correction method that has the same interface as the correction functions. This allows users to easily swap function calls to examine the effect of correction.
noCorrect( ..., batch = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts" )
noCorrect( ..., batch = NULL, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts" )
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns.
Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, it is assumed to contain cells from all batches, so Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside each list were passed directly to |
batch |
A vector or factor specifying the batch of origin for all cells when only a single object is supplied in |
subset.row |
A vector specifying which features to use for correction. |
correct.all |
Logical scalar indicating whether corrected expression values should be computed for genes not in |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs. |
This function is effectively equivalent to cbind
ing the matrices together without any correction.
The aim is to provide a consistent interface that allows users to simply combine batches without additional operations.
This is often desirable as a negative control to see if the transformation is actually beneficial.
It also allows for convenient downstream analyses that are based on the uncorrected data, e.g., differential expression.
Setting correct.all=TRUE
is equivalent to forcing subset.row=NULL
,
given that no correction is being performed anyway.
In the case of a single object in ...
,
batch
has no effect beyond being stored in the colData
of the output.
A SingleCellExperiment is returned where each row is a gene and each column is a cell. This contains:
A merged
matrix in the assays
slot, containing the merged expression values from all elements of ...
.
A batch
column in the colData
slot, containing the batch of origin for each row (i.e., cell) in corrected
.
Aaron Lun
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- noCorrect(B1, B2) # Same as combining the expression values. stopifnot(all(assay(out)==cbind(B1, B2))) # Specifies which cell came from which batch: str(out$batch)
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- noCorrect(B1, B2) # Same as combining the expression values. stopifnot(all(assay(out)==cbind(B1, B2))) # Specifies which cell came from which batch: str(out$batch)
Quickly perform batch correction by intersecting the gene features, normalizing and log-transforming, modelling per-gene variances and identifying highly variable genes, and then applying the correction algorithm of choice.
quickCorrect( ..., batch = NULL, restrict = NULL, correct.all = TRUE, assay.type = "counts", PARAM = FastMnnParam(), multi.norm.args = list(), precomputed = NULL, model.var.args = list(), hvg.args = list(n = 5000) )
quickCorrect( ..., batch = NULL, restrict = NULL, correct.all = TRUE, assay.type = "counts", PARAM = FastMnnParam(), multi.norm.args = list(), precomputed = NULL, model.var.args = list(), hvg.args = list(n = 5000) )
... |
One or more matrix-like objects containing single-cell gene expression matrices. Alternatively, one or more SingleCellExperiment objects can be supplied. If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
Objects of different types can be mixed together.
If a single object is supplied, |
batch |
A factor specifying the batch of origin for each cell if only one batch is supplied in |
restrict |
A list of length equal to the number of objects in |
correct.all |
A logical scalar indicating whether to return corrected expression values for all genes, even if |
assay.type |
A string or integer scalar specifying the assay to use for correction. Only used for SingleCellExperiment inputs. |
PARAM |
A BatchelorParam object specifying the batch correction method to dispatch to, and the parameters with which it should be run.
ClassicMnnParam will dispatch to |
multi.norm.args |
Named list of further arguments to pass to |
precomputed |
List of DataFrames containing precomputed variance modelling results.
This should be a list of the same length as the number of entries in |
model.var.args |
Named list of further arguments to pass to |
hvg.args |
Named list of further arguments to pass to |
This function wraps the sequence of typical steps required to obtain corrected expression values.'
Intersecting each batch to the universe of common features with intersectRows
.
Applying normalization and log-transformation to the batches with multiBatchNorm
.
Modelling the per-gene variance with modelGeneVar
.
If precomputed
is supplied, the precomputed results are used instead.
Identifying highly variable genes with getTopHVGs
.
These genes will be used in the correction, though corrected values for all genes can be returned by setting correct.all=TRUE
.
Applying the batch correction algorithm of choice with batchCorrect
, as specified by PARAM
.
The default of correct.all=TRUE
differs from that of other functions.
This is because the subsetting to HVGs is done internally here, and we avoid surprises by returning results for all genes in the input object(s).
In contrast, the other functions require explicit subsetting via subset.row=
and it is expected that users will set correct.all=
if all genes are desired.
A list containing:
dec
, a DataFrame containing the combined variance modelling results across all batches.
hvgs
, a character vector of the genes selected to use in the correction.
corrected
, a SingleCellExperiment containing the corrected values across all cells in all batches.
Aaron Lun
intersectRows
, for the intersection to obtain a universe of genes.
multiBatchNorm
, to perform the normalization.
modelGeneVar
and getTopHVGs
, to identify the top HVGs for use in correction.
batchCorrect
, to dispatch to the desired correction algorithm.
d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100) sce1 <- SingleCellExperiment(list(counts=d1)) sizeFactors(sce1) <- runif(ncol(d1)) rownames(sce1) <- paste0("GENE", 1:500) d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40) sce2 <- SingleCellExperiment(list(counts=d2)) sizeFactors(sce2) <- runif(ncol(d2)) rownames(sce2) <- paste0("GENE", 201:700) # Fire and forget: set.seed(1000) output <- quickCorrect(sce1, sce2) output$corrected
d1 <- matrix(rnbinom(50000, mu=10, size=1), ncol=100) sce1 <- SingleCellExperiment(list(counts=d1)) sizeFactors(sce1) <- runif(ncol(d1)) rownames(sce1) <- paste0("GENE", 1:500) d2 <- matrix(rnbinom(20000, mu=50, size=1), ncol=40) sce2 <- SingleCellExperiment(list(counts=d2)) sizeFactors(sce2) <- runif(ncol(d2)) rownames(sce2) <- paste0("GENE", 201:700) # Fire and forget: set.seed(1000) output <- quickCorrect(sce1, sce2) output$corrected
MNN correction in reduced dimensions
reducedMNN( ..., batch = NULL, k = 20, prop.k = NULL, restrict = NULL, ndist = 3, merge.order = NULL, auto.merge = FALSE, min.batch.skip = 0, BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
reducedMNN( ..., batch = NULL, k = 20, prop.k = NULL, restrict = NULL, ndist = 3, merge.order = NULL, auto.merge = FALSE, min.batch.skip = 0, BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
... |
One or more matrices of low-dimensional representations where rows are cells and columns are dimensions.
Each object should contain the same number of columns, corresponding to the same dimensions.
These should have been generated by a single call to If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, Alternatively, any number of lists of such objects.
this is flattened as if the objects inside each list were passed directly to |
batch |
A vector or factor specifying the batch of origin for all cells when only a single object is supplied in |
k |
An integer scalar specifying the number of nearest neighbors to consider when identifying MNNs. |
prop.k |
A numeric scalar in (0, 1) specifying the proportion of cells in each dataset to use for mutual nearest neighbor searching.
If set, the number of nearest neighbors used for the MNN search in each batch is redefined as |
restrict |
A list of length equal to the number of objects in |
ndist |
A numeric scalar specifying the threshold beyond which neighbours are to be ignored when computing correction vectors. Each threshold is defined as a multiple of the number of median distances. |
merge.order |
An integer vector containing the linear merge order of batches in |
auto.merge |
Logical scalar indicating whether to automatically identify the “best” merge order. |
min.batch.skip |
Numeric scalar specifying the minimum relative magnitude of the batch effect, below which no correction will be performed at a given merge step. |
BNPARAM |
A BiocNeighborParam object specifying the nearest neighbor algorithm. |
BPPARAM |
A BiocParallelParam object specifying whether the PCA and nearest-neighbor searches should be parallelized. |
reducedMNN
performs the same operations as fastMNN
but assumes that the PCA has already been performed.
This is useful as the PCA (via multiBatchPCA
) is often the most time-consuming step.
By performing the PCA once, reducedMNN
allows the MNN correction to be quickly repeated with different parameters.
reducedMNN
operates on the same principles as fastMNN
,
so users are referred to the documentation for the latter for more details on the effect of each of the arguments.
Obviously, any arguments pertaining to gene-based steps in fastMNN
are not relevant here.
Note that multiBatchPCA
will not perform cosine-normalization,
so it is the responsibility of the user to cosine-normalize each batch beforehand with cosineNorm
to recapitulate results of fastMNN
with cos.norm=TRUE
.
In addition, multiBatchPCA
must be run on all samples at once, to ensure that all cells are projected to the same low-dimensional space.
A DataFrame is returned where each row corresponds to a cell, containing:
corrected
, the matrix of corrected low-dimensional coordinates for each cell.
batch
, the Rle specifying the batch of origin for each row.
Cells in the output object are always ordered in the same manner as supplied in ...
.
The metadata on this object is the same as that in the output of fastMNN
.
Aaron Lun
multiBatchPCA
, to obtain the values to be corrected.
fastMNN
, for the version that operates on gene-expression values.
clusterMNN
, for an application on cluster centroids.
B1 <- matrix(rnorm(10000), nrow=50) # Batch 1 B2 <- matrix(rnorm(10000), nrow=50) # Batch 2 # Corrected values equivalent to fastMNN(). cB1 <- cosineNorm(B1) cB2 <- cosineNorm(B2) pcs <- multiBatchPCA(cB1, cB2) mnn.out <- reducedMNN(pcs[[1]], pcs[[2]]) mnn.out
B1 <- matrix(rnorm(10000), nrow=50) # Batch 1 B2 <- matrix(rnorm(10000), nrow=50) # Batch 2 # Corrected values equivalent to fastMNN(). cB1 <- cosineNorm(B1) cB2 <- cosineNorm(B2) pcs <- multiBatchPCA(cB1, cB2) mnn.out <- reducedMNN(pcs[[1]], pcs[[2]]) mnn.out
Fit a linear model to each gene regress out uninteresting factors of variation, returning a matrix of residuals.
regressBatches( ..., batch = NULL, design = NULL, keep = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, d = NA, deferred = TRUE, assay.type = "logcounts", BSPARAM = IrlbaParam(), BPPARAM = SerialParam() )
regressBatches( ..., batch = NULL, design = NULL, keep = NULL, restrict = NULL, subset.row = NULL, correct.all = FALSE, d = NA, deferred = TRUE, assay.type = "logcounts", BSPARAM = IrlbaParam(), BPPARAM = SerialParam() )
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns.
Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, it is assumed to contain cells from all batches, so Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside each list were passed directly to |
batch |
A vector or factor specifying the batch of origin for all cells when only a single object is supplied in |
design |
A numeric design matrix with number of rows equal to the total number of cells,
specifying the experimental factors to remove.
Each row corresponds to a cell in the order supplied in |
keep |
Integer vector specifying the coefficients of |
restrict |
A list of length equal to the number of objects in |
subset.row |
A vector specifying which features to use for correction. |
correct.all |
Logical scalar indicating whether corrected expression values should be computed for genes not in |
d |
Numeric scalar specifying the number of dimensions to use for PCA via |
deferred |
Logical scalar indicating whether to defer centering/scaling, see |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for PCA in |
BPPARAM |
A BiocParallelParam object specifying whether the PCA should be parallelized. |
This function fits a linear model to the log-expression values for each gene and returns the residuals.
By default, the model is parameterized as a one-way layout with the batch of origin,
so the residuals represent the expression values after correcting for the batch effect.
The novelty of this function is that it returns a ResidualMatrix in as the "corrected"
assay.
This avoids explicitly computing the residuals, which would result in a loss of sparsity or similar problems.
Rather, residuals are either computed as needed or are never explicitly computed at all (e.g., during matrix multiplication).
This means that regressBatches
is faster and lighter than naive regression or even rescaleBatches
.
More complex designs should be explicitly specified with the design
argument, e.g., to regress out a covariate.
This can be any full-column-rank matrix that is typically constructed with model.matrix
.
If design
is specified with a single object in ...
, batch
is ignored.
If design
is specified with multiple objects, regression is applied to the matrix obtained by cbind
ing all of those objects together; this means that the first few rows of design
correspond to the cells from the first object, then the next rows correspond to the second object and so on.
Like rescaleBatches
, this function assumes that the batch effect is orthogonal to the interesting factors of variation.
For example, each batch is assumed to have the same composition of cell types.
The same reasoning applies to any uninteresting factors specified in design
, including continuous variables.
For example, if one were to use this function to regress out cell cycle, the assumption is that all cell types are similarly distributed across cell cycle phases.
If this is not true, the correction will not only be incomplete but can introduce spurious differences.
See ?"batchelor-restrict"
for a description of the restrict
argument.
Specifically, this function will compute the model coefficients using only the specified subset of cells.
The regression will then be applied to all cells in each batch.
If set, the d
option will perform a PCA via multiBatchPCA
.
This is provided for convenience as efficiently executing a PCA on a ResidualMatrix is not always intuitive.
(Specifically, BiocSingularParam objects must be set up with deferred=TRUE
for best performance.)
The arguments BSPARAM
, deferred
and BPPARAM
only have an effect when d
is set to a non-NA
value.
All genes are used with the default setting of subset.row=NULL
.
If a subset of genes is specified, residuals are only returned for that subset.
Similarly, if d
is set, only the genes in the subset are used to perform the PCA.
If additionally correct.all=TRUE
, residuals are returned for all genes but only the subset is used for the PCA.
A SingleCellExperiment object containing the corrected
assay.
This contains the computed residuals for each gene (row) in each cell (column) in each batch.
A batch
field is present in the column data, specifying the batch of origin for each cell.
Cells in the output object are always ordered in the same manner as supplied in ...
.
For a single input object, cells will be reported in the same order as they are arranged in that object.
In cases with multiple input objects, the cell identities are simply concatenated from successive objects,
i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.
If d
is not NA
, a PCA is performed on the residual matrix via multiBatchPCA
,
and an additional corrected
field is present in the reducedDims
of the output object.
Aaron Lun
rescaleBatches
, for another approach to regressing out the batch effect.
The ResidualMatrix class, for the class of the residual matrix.
applyMultiSCE
, to apply this across multiple altExps
.
means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- regressBatches(B1, B2)
means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- regressBatches(B1, B2)
Scale counts so that the average count within each batch is the same for each gene.
rescaleBatches( ..., batch = NULL, restrict = NULL, log.base = 2, pseudo.count = 1, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts" )
rescaleBatches( ..., batch = NULL, restrict = NULL, log.base = 2, pseudo.count = 1, subset.row = NULL, correct.all = FALSE, assay.type = "logcounts" )
... |
One or more log-expression matrices where genes correspond to rows and cells correspond to columns.
Alternatively, one or more SingleCellExperiment objects can be supplied containing a log-expression matrix in the If multiple objects are supplied, each object is assumed to contain all and only cells from a single batch.
If a single object is supplied, it is assumed to contain cells from all batches, so Alternatively, one or more lists of matrices or SingleCellExperiments can be provided;
this is flattened as if the objects inside each list were passed directly to |
batch |
A vector or factor specifying the batch of origin for all cells when only a single object is supplied in |
restrict |
A list of length equal to the number of objects in |
log.base |
A numeric scalar specifying the base of the log-transformation. |
pseudo.count |
A numeric scalar specifying the pseudo-count used for the log-transformation. |
subset.row |
A vector specifying which features to use for correction. |
correct.all |
Logical scalar indicating whether corrected expression values should be computed for genes not in |
assay.type |
A string or integer scalar specifying the assay containing the log-expression values. Only used for SingleCellExperiment inputs. |
This function assumes that the log-expression values were computed by a log-transformation of normalized count data, plus a pseudo-count. It reverses the log-transformation and scales the underlying counts in each batch so that the average (normalized) count is equal across batches. The assumption here is that each batch contains the same population composition. Thus, any scaling difference between batches is technical and must be removed.
This function is approximately equivalent to centering in log-expression space, the simplest application of linear regression methods for batch correction. However, by scaling the raw counts, it avoids loss of sparsity that would otherwise result from centering. It also mitigates issues with artificial differences in variance due to log-transformation. This is done by always downscaling to the lowest average expression for each gene such that differences in variance are dampened by the addition of the pseudo-count.
Use of rescaleBatches
assumes that the uninteresting factors described in design
are orthogonal to the interesting factors of variation.
For example, each batch is assumed to have the same composition of cell types.
If this is not true, the correction will not only be incomplete but may introduce spurious differences.
The output values are always re-log-transformed with the same log.base
and pseudo.count
.
These can be used directly in place of the input values for downstream operations.
All genes are used with the default setting of subset.row=NULL
.
Users can set subset.row
to subset the inputs, though this is purely for convenience as each gene is processed independently of other genes.
See ?"batchelor-restrict"
for a description of the restrict
argument.
Specifically, the function will compute the scaling differences using only the specified subset of cells, and then apply the re-scaling to all cells in each batch.
A SingleCellExperiment object containing the corrected
assay.
This contains corrected log-expression values for each gene (row) in each cell (column) in each batch.
A batch
field is present in the column data, specifying the batch of origin for each cell.
Cells in the output object are always ordered in the same manner as supplied in ...
.
For a single input object, cells will be reported in the same order as they are arranged in that object.
In cases with multiple input objects, the cell identities are simply concatenated from successive objects,
i.e., all cells from the first object (in their provided order), then all cells from the second object, and so on.
Aaron Lun
regressBatches
, for a residual calculation based on a fitted linear model.
applyMultiSCE
, to apply this function over multiple altExps
.
means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- rescaleBatches(B1, B2)
means <- 2^rgamma(1000, 2, 1) A1 <- matrix(rpois(10000, lambda=means), ncol=50) # Batch 1 A2 <- matrix(rpois(10000, lambda=means*runif(1000, 0, 2)), ncol=50) # Batch 2 B1 <- log2(A1 + 1) B2 <- log2(A2 + 1) out <- rescaleBatches(B1, B2)