Title: | Multi-Modal Single-Cell Analysis Methods |
---|---|
Description: | Assorted utilities for multi-modal analyses of single-cell datasets. Includes functions to combine multiple modalities for downstream analysis, perform MNN-based batch correction across multiple modalities, and to compute correlations between assay values for different modalities. |
Authors: | Aaron Lun [aut, cre] |
Maintainer: | Aaron Lun <[email protected]> |
License: | GPL-3 |
Version: | 1.15.0 |
Built: | 2024-12-29 07:42:52 UTC |
Source: | https://github.com/bioc/mumosa |
Compute Spearman correlations between two sets of features, using data collected for the same cells in different modalities.
computeCorrelations(x, y, ...) ## S4 method for signature 'ANY' computeCorrelations( x, y, subset.cols = NULL, block = NULL, equiweight = TRUE, use.names = TRUE, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' computeCorrelations(x, y, use.names = TRUE, ..., assay.type = "logcounts")
computeCorrelations(x, y, ...) ## S4 method for signature 'ANY' computeCorrelations( x, y, subset.cols = NULL, block = NULL, equiweight = TRUE, use.names = TRUE, BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' computeCorrelations(x, y, use.names = TRUE, ..., assay.type = "logcounts")
x , y
|
Normalized expression matrices containing features in the rows and cells in the columns. Each matrix should have the same set of columns but a different set of features, usually corresponding to different modes for the same cells. Alternatively, SummarizedExperiment objects containing such a matrix. Finally, |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. |
subset.cols |
Vector indicating the columns of |
block |
A vector or factor of length equal to the number of cells, specifying the block of origin for each cell. |
equiweight |
Logical scalar indicating whether each block should be given equal weight, if |
use.names |
Logical scalar specifying whether row names of For the SummarizedExperiment method, this may also be a string specifying the |
BPPARAM |
A BiocParallelParam object specifying the parallelization scheme to use. |
assay.type |
String or integer scalar specifying the assay containing the matrix of interest in |
If block
is specified, correlations are computed separately for each block of cells.
For each feature pair, the reported rho
is set to the average of the correlations across all blocks.
If equiweight=FALSE
, the average is weighted by the number of cells in each block.
Similarly, the p-value corresponding to each correlation is computed separately for each block and then combined across blocks with Stouffer's method.
More specifically, combining is done using the one-sided p-values for both signs of the correlation, and the smaller p-value is taken (and multiplied by 2).
This ensures that a low p-value can only be achieved if the blocks agree in the sign.
If equiweight=FALSE
, each per-block p-value is weighted by the number of cells.
A DataFrame where each row corresponds to a pair of features in x
and y
.
(If y=NULL
, each pair corresponds to a pair of features in x
.)
This contains the following fields:
feature1
, the name (character) or row index (integer) of each feature in x
.
feature2
, the name (character) or row index (integer) of one of the top correlated features to feature1
.
This is another feature in x
if y=NULL
, otherwise it is a feature in y
.
rho
, the Spearman rank correlation for the current pair of feature1
and feature2
.
p.value
, the approximate p-value associated with rho
under the null hypothesis that the correlation is zero.
FDR
, the adjusted p-value.
The rows are sorted by feature1
and then p.value
.
Aaron Lun
findTopCorrelations
, to avoid computing correlations for all pairs of features when y
has many rows.
library(scuttle) sce1 <- mockSCE() sce1 <- logNormCounts(sce1) sce2 <- mockSCE(ngenes=10) # pretend this is protein data. sce2 <- logNormCounts(sce2) output <- computeCorrelations(sce1, sce2) output
library(scuttle) sce1 <- mockSCE() sce1 <- logNormCounts(sce1) sce2 <- mockSCE(ngenes=10) # pretend this is protein data. sce2 <- logNormCounts(sce2) output <- computeCorrelations(sce1, sce2) output
For each feature, find the subset of other features in the same or another modality that have strongest positive/negative Spearman's rank correlations in a pair of normalized expression matrices.
findTopCorrelations(x, number, ...) ## S4 method for signature 'ANY' findTopCorrelations( x, number = 10, y = NULL, d = 50, direction = c("both", "positive", "negative"), subset.cols = NULL, block = NULL, equiweight = TRUE, use.names = TRUE, deferred = TRUE, BSPARAM = IrlbaParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' findTopCorrelations( x, number, y = NULL, use.names = TRUE, ..., assay.type = "logcounts" )
findTopCorrelations(x, number, ...) ## S4 method for signature 'ANY' findTopCorrelations( x, number = 10, y = NULL, d = 50, direction = c("both", "positive", "negative"), subset.cols = NULL, block = NULL, equiweight = TRUE, use.names = TRUE, deferred = TRUE, BSPARAM = IrlbaParam(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() ) ## S4 method for signature 'SummarizedExperiment' findTopCorrelations( x, number, y = NULL, use.names = TRUE, ..., assay.type = "logcounts" )
x , y
|
Normalized expression matrices containing features in the rows and cells in the columns. Each matrix should have the same set of columns but a different set of features, usually corresponding to different modes for the same cells. Alternatively, SummarizedExperiment objects containing such a matrix. Finally, |
number |
Integer scalar specifying the number of top correlated features to report for each feature in |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. |
d |
Integer scalar specifying the number of dimensions to use for the approximate search via PCA.
If |
direction |
String specifying the sign of the correlations to search for. |
subset.cols |
Vector indicating the columns of |
block |
A vector or factor of length equal to the number of cells, specifying the block of origin for each cell. |
equiweight |
Logical scalar indicating whether each block should be given equal weight, if |
use.names |
Logical scalar specifying whether row names of For the SummarizedExperiment method, this may also be a string specifying the |
deferred |
Logical scalar indicating whether a fast deferred calculation should be used for the rank-based PCA. |
BSPARAM |
A BiocSingularParam object specifying the algorithm to use for the PCA. |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use for the neighbor search. |
BPPARAM |
A BiocParallelParam object specifying the parallelization scheme to use. |
assay.type |
String or integer scalar specifying the assay containing the matrix of interest in |
In most cases, we only care about the top-correlated features, allowing us to skip a lot of unnecessary computation. This is achieved by transforming the problem of finding the largest Spearman correlation into a nearest-neighbor search in rank space. For the sake of speed, we approximate the search by performing PCA to compress the rank values for all features.
For each direction, we compute the one-sided p-value for each feature using the approximate method implemented in cor.test
.
The FDR correction is performed by considering all possible pairs of features, as these are implicitly tested in the neighbor search.
Note that this is somewhat conservative as it does not consider strong correlations outside the reported features.
If block
is specified, correlations are computed separately for each block of cells.
For each feature pair, the reported rho
is set to the average of the correlations across all blocks.
Similarly, the p-value corresponding to each correlation is computed separately for each block and then combined across blocks with Stouffer's method.
If equiweight=FALSE
, the average correlation and each per-block p-value is weighted by the number of cells.
We only consider pairs of features that have computable correlations in at least one block.
Blocks are ignored if one or the other feature has tied values (typically zeros) for all cells in that block.
This means that a feature may not have any entries in feature1
if it forms no valid pairs, e.g., because it is not expressed.
Similarly, the total number of rows may be less than the maximum if insufficient valid pairs are available.
A List containing one or two DataFrames for results in each direction.
These are named "positive"
and "negative"
, and are generated according to direction
;
if direction="both"
, both DataFrames will be present.
Each DataFrame has up to nrow(x) * number
rows, containing the top number
correlated features for each feature in x
.
This contains the following fields:
feature1
, the name (character) or row index (integer) of each feature in x
.
Not all features may be reported here, see Details.
feature2
, the name (character) or row index (integer) of one of the top correlated features to feature1
.
This is another feature in x
if y=NULL
, otherwise it is a feature in y
.
rho
, the Spearman rank correlation for the current pair of feature1
and feature2
.
p.value
, the approximate p-value associated with rho
under the null hypothesis that the correlation is zero.
FDR
, the adjusted p-value.
The rows are sorted by feature1
and then p.value
.
Aaron Lun
computeCorrelations
, to compute correlations for all pairs of features.
library(scuttle) sce1 <- mockSCE() sce1 <- logNormCounts(sce1) sce2 <- mockSCE(ngenes=20) # pretend this is CITE-seq data, or something. sce2 <- logNormCounts(sce2) # Top 20 correlated features in 'sce2' for each feature in 'sce1': df <- findTopCorrelations(sce1, sce2, number=20) df
library(scuttle) sce1 <- mockSCE() sce1 <- logNormCounts(sce1) sce2 <- mockSCE(ngenes=20) # pretend this is CITE-seq data, or something. sce2 <- logNormCounts(sce2) # Top 20 correlated features in 'sce2' for each feature in 'sce1': df <- findTopCorrelations(sce1, sce2, number=20) df
Intersect pre-defined clusters from multiple modalities, pruning out combinations that are poorly separated based on the within-cluster sum of squares (WCSS).
intersectClusters(clusters, coords, scale = 1, BPPARAM = SerialParam())
intersectClusters(clusters, coords, scale = 1, BPPARAM = SerialParam())
clusters |
A list of factors or vectors of the same length. Each element corresponds to one modality and contains the cluster assignments for the same set of cells. |
coords |
A list of matrices of length equal to |
scale |
Numeric scalar specifying the scaling factor to apply to the limit on the WCSS for each modality. |
BPPARAM |
A BiocParallelParam object specifying how parallelization should be performed. |
We intersect clusters by only considering two cells to be in the same “output” cluster if they are also clustered together in each modality.
In other words, all cells with a particular combination of identities in clusters
are assigned to a separate output cluster.
The simplest implementation of the above idea suffers from noise in the cluster definitions that introduces combinations with very few cells. We eliminate these by greedily merging pairs of combinations, starting with the pairs that minimize the gain in the WCSS. In this process, we only consider pairs of combinations that share at least cluster across all modalities (to avoid merges across unrelated clusters).
A natural stopping point for this merging process is when the WCSS of the output clustering exceeds the WCSS of the original clustering for any modality. This aims to preserve the original clustering in each modality by preventing overly aggressive merges that would greatly increase the WCSS, while reducing the complexity of the output clustering by ensuring that the variance explained is comparable.
Users can increase the aggressiveness of the merging procedure by increasing scale
, e.g., to 1.05 or 1.
This will scale up the limit on the WCSS, allowing more merges to be performed before termination.
An integer vector of length equal to the number of cells, containing the assignments to the output clusters.
Aaron Lun
mat1 <- matrix(rnorm(10000), ncol=20) chosen <- 1:250 mat1[chosen,1] <- mat1[chosen,1] + 10 clusters1 <- kmeans(mat1, 5)$cluster table(clusters1, chosen=mat1[,1] > 5) # Pretending we have some other data for the same cells, e.g., ADT. mat2 <- matrix(rnorm(10000), ncol=20) chosen <- c(1:125, 251:375) mat2[chosen,2] <- mat2[chosen,2] + 10 clusters2 <- kmeans(mat2, 5)$cluster table(clusters2, mat2[,2] > 5) # Intersecting the clusters: clusters3 <- intersectClusters(list(clusters1, clusters2), list(mat1, mat2)) table(clusters3, mat1[,1] > 5) table(clusters3, mat2[,2] > 5)
mat1 <- matrix(rnorm(10000), ncol=20) chosen <- 1:250 mat1[chosen,1] <- mat1[chosen,1] + 10 clusters1 <- kmeans(mat1, 5)$cluster table(clusters1, chosen=mat1[,1] > 5) # Pretending we have some other data for the same cells, e.g., ADT. mat2 <- matrix(rnorm(10000), ncol=20) chosen <- c(1:125, 251:375) mat2[chosen,2] <- mat2[chosen,2] + 10 clusters2 <- kmeans(mat2, 5)$cluster table(clusters2, mat2[,2] > 5) # Intersecting the clusters: clusters3 <- intersectClusters(list(clusters1, clusters2), list(mat1, mat2)) table(clusters3, mat1[,1] > 5) table(clusters3, mat2[,2] > 5)
Intersect two graphs by taking the product of their edge weights.
intersectGraphs(..., mixing = NULL, nominal = 1e-06)
intersectGraphs(..., mixing = NULL, nominal = 1e-06)
... |
Any number of graph objects for the same set and order of nodes. |
mixing |
Numeric vector of length equal to the number of graphs, specifying the mixing weights for the edge weights. |
nominal |
Numeric scalar specifying the scaling factor to compute the nominal weight. |
The idea of taking an intersection of a weighted graph is based on the intersection of simplicial sets in the UMAP algorithm. For each edge that exists in either graph, we compute the product of the weights across all graphs and assign that value as the edge weight in the output graph. This means that edges in the output only have high weight if they are present and highly weighted in all graphs - hence, an intersection.
This approach would make for a very sparse graph if the product was taken directly.
To maintain some connectivity, edges that exist in one graph but not the other are assigned nominal weights in the latter to ensure that the product is not zero.
The nominal weight for each graph is defined as the product of its smallest non-zero edge weight and nominal
.
Decreasing this value will yield a more conservative intersection and a less connected graph, usually manifesting as smaller clusters after application of community detection algorithms.
By default, mixing
is a vector of length equal to the numbeer of graphs, containing values of 1.
This means that edge weights from each graph in ...
contribute equally to the product.
However, it is possible to increase the contribution of some of the graphs by supplying a higher mixing
values for those graphs.
Unweighted graphs are supported and are considered to have edge weights of 1.
A graph object containing the intersection of g1
and g2
.
Aaron Lun
library(scran) mat1 <- matrix(rnorm(10000), ncol=20) chosen <- 1:250 mat1[chosen,1] <- mat1[chosen,1] + 10 g1 <- buildSNNGraph(mat1, d=NA, transposed=TRUE) clusters1 <- igraph::cluster_walktrap(g1)$membership table(clusters1, chosen=mat1[,1] > 5) # Pretending we have some other data for the same cells, e.g., ADT. mat2 <- matrix(rnorm(10000), ncol=20) chosen <- c(1:125, 251:375) mat2[chosen,2] <- mat2[chosen,2] + 10 g2 <- buildSNNGraph(mat2, d=NA, transposed=TRUE) clusters2 <- igraph::cluster_walktrap(g2)$membership table(clusters2, mat2[,2] > 5) # Intersecting the graphs and clustering: gcom <- intersectGraphs(g1, g2) clustersC <- igraph::cluster_walktrap(gcom)$membership table(clustersC, mat1[,1] > 5) table(clustersC, mat2[,2] > 5)
library(scran) mat1 <- matrix(rnorm(10000), ncol=20) chosen <- 1:250 mat1[chosen,1] <- mat1[chosen,1] + 10 g1 <- buildSNNGraph(mat1, d=NA, transposed=TRUE) clusters1 <- igraph::cluster_walktrap(g1)$membership table(clusters1, chosen=mat1[,1] > 5) # Pretending we have some other data for the same cells, e.g., ADT. mat2 <- matrix(rnorm(10000), ncol=20) chosen <- c(1:125, 251:375) mat2[chosen,2] <- mat2[chosen,2] + 10 g2 <- buildSNNGraph(mat2, d=NA, transposed=TRUE) clusters2 <- igraph::cluster_walktrap(g2)$membership table(clusters2, mat2[,2] > 5) # Intersecting the graphs and clustering: gcom <- intersectGraphs(g1, g2) clustersC <- igraph::cluster_walktrap(gcom)$membership table(clustersC, mat1[,1] > 5) table(clustersC, mat2[,2] > 5)
Perform MNN correction on multi-modal data, based on a generalization of fastMNN
to multiple feature sets.
multiModalMNN( ..., batch = NULL, which = NULL, rescale.k = 50, common.args = list(), main.args = list(), alt.args = list(), mnn.args = list(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
multiModalMNN( ..., batch = NULL, which = NULL, rescale.k = 50, common.args = list(), main.args = list(), alt.args = list(), mnn.args = list(), BNPARAM = KmknnParam(), BPPARAM = SerialParam() )
... |
One or more SingleCellExperiment objects, containing a shared set of alternative Experiments corresponding to different data modalities. Alternatively, one or more lists of such objects. |
batch |
Factor specifying the batch to which each cell belongs, when |
which |
Character vector containing the names of the alternative Experiments to use for correction.
Defaults to the names of all alternative Experiments that are present in every object of |
rescale.k |
Integer scalar specifying the number of neighbors to use in |
common.args |
Named list of further arguments to control the PCA for all modalities. |
main.args |
Named list of further arguments to control the PCA for the main Experiments.
Overrides any arguments of the same name in |
alt.args |
Named list of named lists of further arguments to control the PCA for each alternative Experiment specified in |
mnn.args |
Further arguments to pass to |
BNPARAM |
A BiocNeighborParam object specifying how the nearest neighbor searches should be performed. |
BPPARAM |
A BiocParallelParam object specifying how parallelization should be performed. |
This function implements a multi-modal MNN correction for SingleCellExperiment inputs where each main and alternative Experiment corresponds to one modality.
We perform a PCA within each modality with multiBatchPCA
,
rescale the PCs to be of a comparable scale with rescaleByNeighbors
,
and finally correct in low-dimensional space with reducedMNN
.
Corrected expression values for each modality are then recovered in the same manner as described for fastMNN
.
Modality-specific arguments can be passed to the PCA step via the common.args
, main.args
and alt.args
arguments.
These mirror the corresponding arguments in applyMultiSCE
- see the documentation for that function for more details.
Additional arguments for the MNN step can be passed via mnn.args
.
Note that batch
is used across all steps and must be specified as its own argument in the multiModalMNN
function signature.
Most arguments in multiBatchPCA
can be specified in common.args
, main.args
or each entry of alt.args
.
This includes passing d=NA
to turn off the PCA or subset.row
to only use a subset of features for the PCA.
Additionally, the following arguments are supported:
By default, a cosine-normalization is performed prior to the PCA for each modality.
This can be disabled by passing cos.norm=FALSE
to common.args
, main.args
or each entry of alt.args
.
Setting correct.all
will reported corrected expression values for all features even when subset.row
is specified.
This can be used in common.args
, main.args
or each entry of alt.args
.
Note that the function will look for assay.type="logcounts"
by default in each entry of ...
.
Users should perform log-normalization prior to calling multiModalMNN
, most typically with multiBatchNorm
- see Examples.
A SingleCellExperiment of the same structure as that returned by fastMNN
,
i.e., with a corrected
entry of corrected low-dimensional coordinates and a reconstructed
assay of corrected expression values.
In addition, the altExps
entries contain corrected values for each data modality used in the correction.
Aaron Lun
fastMNN
, for MNN correction within a single modality.
multiBatchPCA
, to perform a batch-aware PCA within each modality.
applyMultiSCE
, which inspired this interface for Experiment-specific arguments.
# Mocking up a gene expression + ADT dataset: library(scater) exprs_sce <- mockSCE() adt_sce <- mockSCE(ngenes=20) altExp(exprs_sce, "ADT") <- adt_sce # Pretend we have three batches for the sake of demonstration: batch <- sample(1:3, ncol(exprs_sce), replace=TRUE) # Normalizing first with batchelor::multiBatchNorm: library(batchelor) exprs_sce <- applyMultiSCE(exprs_sce, batch=batch, FUN=multiBatchNorm) # and perform batch correction: corrected <- multiModalMNN(exprs_sce, batch=batch, which="ADT") # Pass arguments to, e.g., use a subset of features for the RNA, # turn off the PCA for the ADTs: corrected2 <- multiModalMNN(exprs_sce, batch=batch, which="ADT", main.args=list(subset.row=1:500), alt.args=list(ADT=list(d=NA)))
# Mocking up a gene expression + ADT dataset: library(scater) exprs_sce <- mockSCE() adt_sce <- mockSCE(ngenes=20) altExp(exprs_sce, "ADT") <- adt_sce # Pretend we have three batches for the sake of demonstration: batch <- sample(1:3, ncol(exprs_sce), replace=TRUE) # Normalizing first with batchelor::multiBatchNorm: library(batchelor) exprs_sce <- applyMultiSCE(exprs_sce, batch=batch, FUN=multiBatchNorm) # and perform batch correction: corrected <- multiModalMNN(exprs_sce, batch=batch, which="ADT") # Pass arguments to, e.g., use a subset of features for the RNA, # turn off the PCA for the ADTs: corrected2 <- multiModalMNN(exprs_sce, batch=batch, which="ADT", main.args=list(subset.row=1:500), alt.args=list(ADT=list(d=NA)))
Rescale matrices for different data modalities so that their distances are more comparable, using the distances to neighbors to approximate noise.
rescaleByNeighbors(x, ...) ## S4 method for signature 'ANY' rescaleByNeighbors( x, k = 50, weights = NULL, combine = TRUE, num.threads = 1, BNPARAM = NULL, BPPARAM = NULL ) ## S4 method for signature 'SummarizedExperiment' rescaleByNeighbors(x, assays, extras = list(), ...) ## S4 method for signature 'SingleCellExperiment' rescaleByNeighbors( x, assays = NULL, dimreds = NULL, altexps = NULL, altexp.assay = "logcounts", extras = list(), ... )
rescaleByNeighbors(x, ...) ## S4 method for signature 'ANY' rescaleByNeighbors( x, k = 50, weights = NULL, combine = TRUE, num.threads = 1, BNPARAM = NULL, BPPARAM = NULL ) ## S4 method for signature 'SummarizedExperiment' rescaleByNeighbors(x, assays, extras = list(), ...) ## S4 method for signature 'SingleCellExperiment' rescaleByNeighbors( x, assays = NULL, dimreds = NULL, altexps = NULL, altexp.assay = "logcounts", extras = list(), ... )
x |
A list of numeric matrices where each row is a cell and each column is some dimension/variable. For gene expression data, this is usually the matrix of PC coordinates. All matrices should have the same number of rows. Alternatively, a SummarizedExperiment containing relevant matrices in its assays. Alternatively, a SingleCellExperiment containing relevant matrices in its assays, |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment and SingleCellExperiment methods, further arguments to pass to the ANY method. |
k |
An integer scalar specifying the number of neighbors to use for the distance calculation. |
weights |
A numeric vector of length equal to |
combine |
A logical scalar specifying whether the rescaled matrices should be combined into a single matrix. |
num.threads |
Integer scalar specifying the number of threads to use for the neighbor search. |
BNPARAM |
A BiocNeighborParam object specifying the algorithm to use for the nearest-neighbor search. |
BPPARAM |
Deprecated, use |
assays |
A character or integer vector of assays to extract and transpose for use in the ANY method. For the SingleCellExperiment, this argument can be missing, in which case no assays are used. |
extras |
A list of further matrices of similar structure to those matrices in a list-like |
dimreds |
A character or integer vector of |
altexps |
A character or integer vector of |
altexp.assay |
A character or integer vector specifying the assay to extract from alternative experiments, when |
When dealing with multi-modal data, we may wish to combine all modes into a single matrix for downstream processing.
However, a naive cbind
does not account for the fact that different modes may very different scales and number of features.
A mode with a larger scale or more features may dominate steps such as clustering or dimensionality reduction.
This function attempts to rescale the contents for each matrix so that the modes are more comparable.
A naive approach to rescaling would be to just equalize the total variances across matrices. This is not ideal as it fails to consider the differences in biological variation captured by each mode. For example, if a biological phenomenon is only present in one mode, that matrix's total variance would naturally be higher. Scaling all matrices to the same total variance would suppress genuine variation and inflate the relative contribution of noise.
We instead use the distance to the k
th nearest neighbor as an estimate of the per-mode “noise”.
Modes with more features or higher technical noise will have larger distances,
and downscaling each matrix by the median distance will correct for differences between modes.
At the same time, by only considering the nearest neighbors,
we avoid capturing (and inadvertently eliminating) variance due to mode-specific population structure.
The default approach is to weight each mode equally during the rescaling process,
i.e., the median distance to the k
th nearest neighbor will be equal for all modes after rescaling.
However, we can also set weights
to control the fold-differences in the median distances.
For example, a weight of 2 for one mode would mean that its median distance after rescaling is twice as large as that from a mode with a weight of 1.
This may be useful for prioritizing modes that are more likely to be important.
The correspondence between non-NULL
weights
and the modes is slightly tricky whe x
is not a list.
If x
is a SummarizedExperiment, the modes are ordered as: all entries in assays
in the specified order, then all entries in extras
.
If x
is a SingleCellExperiment, the modes are ordered as: all entries in assays
in the specified order,
then all entries in dimreds
, then all entries in altexps
, and finally all entries in extras
.
A numeric matrix with number of rows equal to the number of cells,
where the columns span all variables across all modes supplied in x
.
Values are scaled so that each mode contributes the specified weight to downstream Euclidean distance calculations.
If combine=FALSE
, a list of rescaled matrices is returned instead.
Aaron Lun
# Mocking up a gene expression + ADT dataset: library(scater) exprs_sce <- mockSCE() exprs_sce <- logNormCounts(exprs_sce) exprs_sce <- runPCA(exprs_sce) adt_sce <- mockSCE(ngenes=20) adt_sce <- logNormCounts(adt_sce) altExp(exprs_sce, "ADT") <- adt_sce combined <- rescaleByNeighbors(exprs_sce, dimreds="PCA", altexps="ADT") dim(combined)
# Mocking up a gene expression + ADT dataset: library(scater) exprs_sce <- mockSCE() exprs_sce <- logNormCounts(exprs_sce) exprs_sce <- runPCA(exprs_sce) adt_sce <- mockSCE(ngenes=20) adt_sce <- logNormCounts(adt_sce) altExp(exprs_sce, "ADT") <- adt_sce combined <- rescaleByNeighbors(exprs_sce, dimreds="PCA", altexps="ADT") dim(combined)
Perform UMAP with multiple input matrices by intersecting their simplicial sets. Typically used to combine results from multiple data modalities into a single embedding.
calculateMultiUMAP(x, ...) ## S4 method for signature 'ANY' calculateMultiUMAP(x, ..., metric = "euclidean") ## S4 method for signature 'SummarizedExperiment' calculateMultiUMAP(x, assays, extras = list(), ...) ## S4 method for signature 'SingleCellExperiment' calculateMultiUMAP( x, assays = NULL, dimreds = NULL, altexps = NULL, altexp.assay = "logcounts", extras = list(), ... ) runMultiUMAP(x, ..., name = "MultiUMAP")
calculateMultiUMAP(x, ...) ## S4 method for signature 'ANY' calculateMultiUMAP(x, ..., metric = "euclidean") ## S4 method for signature 'SummarizedExperiment' calculateMultiUMAP(x, assays, extras = list(), ...) ## S4 method for signature 'SingleCellExperiment' calculateMultiUMAP( x, assays = NULL, dimreds = NULL, altexps = NULL, altexp.assay = "logcounts", extras = list(), ... ) runMultiUMAP(x, ..., name = "MultiUMAP")
x |
For Alternatively, a SummarizedExperiment containing relevant matrices in its assays. Alternatively, a SingleCellExperiment containing relevant matrices in its assays, |
... |
For the generic, further arguments to pass to specific methods. For the ANY method, further arguments to pass to For the SummarizedExperiment and SingleCellExperiment methods, and for |
metric |
Character vector specifying the type of distance to use for each matrix in |
assays |
A character or integer vector of assays to extract and transpose for use in the UMAP. For the SingleCellExperiment, this argument can be missing, in which case no assays are used. |
extras |
A list of further matrices of similar structure to those matrices in a list-like |
dimreds |
A character or integer vector of |
altexps |
A character or integer vector of |
altexp.assay |
A character or integer vector specifying the assay to extract from alternative experiments, when |
name |
String specifying the name of the |
These functions serve as convenience wrappers around umap
for multi-modal analysis.
The idea is that each input matrix in x
corresponds to data for a different mode.
A typical example would consist of the PC coordinates generated from gene expression counts,
plus the log-abundance matrix for ADT counts from CITE-seq experiments;
one might also include matrices of transformed intensities from indexed FACS, to name some more possibilities.
Roughly speaking, the idea is to identify nearest neighbors within each mode to construct the simplicial sets. Integration of multiple modes is performed by intersecting the sets to obtain a single graph, which is used in the rest of the UMAP algorithm. By performing an intersection, we focus on relationships between cells that are consistently neighboring across all the modes, thus providing greater resolution of differences at any mode. The neighbor search within each mode also avoids difficulties with quantitative comparisons of distances between modes.
The most obvious use of this function is to generate a low-dimensional embedding for visualization.
However, users can also set n_components
to a higher value (e.g., 10-20) to retain more information for downstream steps like clustering.
Do, however, remember to set the seed appropriately.
By default, all modes use the distance metric of metric
to construct the simplicial sets within each mode.
However, it is possible to vary this by supplying a vector of metrics, e.g., "euclidean"
for the first matrix, "manhattan"
for the second.
For the SingleCellExperiment method, matrices are extracted in the order of assays, reduced dimensions and alternative experiments,
so any variation in metrics
is also assumed to follow this order.
For calculateMultiUMAP
, a numeric matrix containing the low-dimensional UMAP embedding.
For runMultiUMAP
, x
is returned with a MultiUMAP
field in its reducedDims
.
Aaron Lun
runUMAP
, for the more straightforward application of UMAP.
# Mocking up a gene expression + ADT dataset: library(scater) exprs_sce <- mockSCE() exprs_sce <- logNormCounts(exprs_sce) exprs_sce <- runPCA(exprs_sce) adt_sce <- mockSCE(ngenes=20) adt_sce <- logNormCounts(adt_sce) altExp(exprs_sce, "ADT") <- adt_sce # Running a multimodal analysis using PCs for expression # and log-counts for the ADTs. Annoyingly, have to prefix # this for the time being to distinguish from the scater generic. exprs_sce <- mumosa::runMultiUMAP(exprs_sce, dimreds="PCA", altexps="ADT") plotReducedDim(exprs_sce, "MultiUMAP")
# Mocking up a gene expression + ADT dataset: library(scater) exprs_sce <- mockSCE() exprs_sce <- logNormCounts(exprs_sce) exprs_sce <- runPCA(exprs_sce) adt_sce <- mockSCE(ngenes=20) adt_sce <- logNormCounts(adt_sce) altExp(exprs_sce, "ADT") <- adt_sce # Running a multimodal analysis using PCs for expression # and log-counts for the ADTs. Annoyingly, have to prefix # this for the time being to distinguish from the scater generic. exprs_sce <- mumosa::runMultiUMAP(exprs_sce, dimreds="PCA", altexps="ADT") plotReducedDim(exprs_sce, "MultiUMAP")