Title: | Evaluate Cellspecific Mixing |
---|---|
Description: | CellMixS provides metrics and functions to evaluate batch effects, data integration and batch effect correction in single cell trancriptome data with single cell resolution. Results can be visualized and summarised on different levels, e.g. on cell, celltype or dataset level. |
Authors: | Almut Lütge [aut, cre] |
Maintainer: | Almut Lütge <[email protected]> |
License: | GPL (>=2) |
Version: | 1.23.0 |
Built: | 2024-10-30 04:35:14 UTC |
Source: | https://github.com/bioc/CellMixS |
CellMixS provides metrics and functions to evaluate batch effects, data integration and batch effect correction in single cell trancriptome data with single cell resolution. Results can be visualized and summarised on different levels, e.g. on cell, celltype or dataset level.
In particular, CellMixS includes two main metrics:
Cellspecific mixing scores to determine the probability of random mixing
in each cell's neighbourhood. It can be assesed via the cms
function.
Local Density Factor Differences to evaluate the effect of data
integration methods on batch internal structures. It can be assesed via the
ldfDiff
function.
Almut Lütge [email protected]
Mark D Robinson [email protected]
Function to calculate a cellspecific mixing score (cms) of groups/batches.
.cmsCell( cell, group, knn, k_min = NA, batch_min = NULL, cell_min = 4, unbalanced = FALSE, sce )
.cmsCell( cell, group, knn, k_min = NA, batch_min = NULL, cell_min = 4, unbalanced = FALSE, sce )
cell |
Character. Name of the cell to calculate cms for.
Needs to be one of |
group |
Character. Name of group/batch variable.
Needs to be one of |
knn |
List with three elements. First "index" with indices of knn cells.
Second "distance" with distances to knn cells. Third a slot named by
|
k_min |
Numeric. Minimum number of knn to include. Default is NA (see Details). |
batch_min |
Numeric. Minimum number of cells per batch to include in to the AD test. If set neighbours will be included until batch_min cells from each batch are present. |
cell_min |
Numeric. Minimum number of cells from each group to be included into the AD test. Should be > 4 to make 'ad.test' working. |
unbalanced |
Boolean. If True neighbourhoods with only one batch present will be set to NA. This way they are not included into any summaries or smoothening. |
sce |
A |
The cms function tests the hypothesis, that group-specific distance
distributions of knn cells have the same underlying unspecified distribution.
It performs Anderson-Darling tests as implemented in the
kSamples package
.
In default the function uses all distances and group label defined in knn.
If k_min
is specified, the first local minimum of the overall distance
distribution with at least kmin cells is used. This can be used to adapt to
the local structure of the datatset e.g. prevent cells from a distinct
different cluster to be included.
A p.value as resulting from the ad.test.
Other helper functions:
.defineSubspace()
,
.filterKnn()
,
.filterLocMin()
,
.ldfKnn()
,
.smoothCms()
Helper function for ldfSce and cms to define or recalculate the subspace for analysis.
.defineSubspace(sce, assay_name, dim_red, n_dim)
.defineSubspace(sce, assay_name, dim_red, n_dim)
sce |
A |
assay_name |
Character. Name of the assay to use for PCA. Only relevant
if no existing 'dim_red' is provided. Must be one of
|
dim_red |
Character. Name of embeddings to use as subspace. |
n_dim |
Numeric. Number of subspace elements to include to define subspace. |
Function to determine the subspace for ldfDiff
and
cms
. Checks whether the defined 'dim_red' is present.
Only if no subspace is defined or present it will perform a PCA using
runPCA
. To calculate PCA counts defined in 'assay_name' are used.
A matrix of cell embeddings with reduced dimensions as columns.
Other helper functions:
.cmsCell()
,
.filterKnn()
,
.filterLocMin()
,
.ldfKnn()
,
.smoothCms()
.filterKnn
.filterKnn(knn_cell, batch_min, group, sce)
.filterKnn(knn_cell, batch_min, group, sce)
knn_cell |
Data frame with one column "distance" and one column named by the group variable. Rows correspond to the knn cells and do not need rownames. |
batch_min |
Numeric. Minimum number of cells per batch to include. |
group |
Character. Name of group/batch variable.
Needs to be one of |
sce |
A |
data.frame with two columns (index, distance) for filtered knn cells.
Other helper functions:
.cmsCell()
,
.defineSubspace()
,
.filterLocMin()
,
.ldfKnn()
,
.smoothCms()
Function to filter knn by overall distance density distribution.
.filterLocMin(knn_cell, k_min)
.filterLocMin(knn_cell, k_min)
knn_cell |
Data frame with one column "distance" and one column named by the group variable. Rows correspond to the knn cells and do not need rownames. |
k_min |
Numeric. Minimum number of Knn to include. |
Internal function to filter cells used for cms testing to come
from a continous overall density distribution function
(similar to cluster definitions). 'filterLocMin' is only applied, if k-min
is specified as parameter in .cmsCell
or cms
.
data.frame with two columns (index, distance) for filtered knn cells.
Other helper functions:
.cmsCell()
,
.defineSubspace()
,
.filterKnn()
,
.ldfKnn()
,
.smoothCms()
Calculates the Local Density Factor as implemented in the DDoutlier
package with a predefined knn neighbourhood.
.ldfKnn(dataset, knn_object, k = k, h = 1, c = 1)
.ldfKnn(dataset, knn_object, k = k, h = 1, c = 1)
dataset |
Matrix with cell embeddings with cells as rows and reduced dimensions as cloumns. Subspace to determine LDF in. |
knn_object |
List with k-nearest neighbours (knn) as provided by
|
k |
Numeric. Number of knn used. Should correspond to |
h |
Numeric. Bandwidth for kernel functions. The greater the bandwidth, the smoother kernels and lesser weight are put on outliers. Default is 1 |
c |
Scaling constant for comparison of LDE to neighboring observations. Default is 1. |
LDF fuction modified from the DDoutlier
package.
Calculates a Local Density Estimate (LDE) and Local Density Factor (LDF) with
a gaussian kernel. Modified to use a predefined knn neighbourhood.
For ldfSce
this is essential to determine LDF after data
integration on the same set of cells.
List with two elements "LDE" and "LDF".
Other helper functions:
.cmsCell()
,
.defineSubspace()
,
.filterKnn()
,
.filterLocMin()
,
.smoothCms()
Performs weighted smoothening of cms scores
.smoothCms(knn, cms_raw, cell_names, k_min, k)
.smoothCms(knn, cms_raw, cell_names, k_min, k)
knn |
List with three elements. First "index" with indices of knn cells.
Second "distance" with distances to knn cells. Third a slot named by
|
cms_raw |
Matrix with raw cms scores for all cells specified in
|
cell_names |
Character vector with cell names corresponding to the
rownames of the list elements in |
k_min |
Numeric. Minimum number of knn to include. Default is NA (see Details). |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
Internal function to smooth cms scores. In a complete random setting cms scores are uniform distributed. To reduce the resulting random variance and enable visualization of local pattern cms scores can be smoothened assuming that within one region mixing is uniform. Generates smoothened cms scores using weigthed means of cms scores within the k-nearest neighbourhood. Reciprocal distances are used as weights.
matrix with two columns ("cms_smooth", "cms").
Other helper functions:
.cmsCell()
,
.defineSubspace()
,
.filterKnn()
,
.filterLocMin()
,
.ldfKnn()
Calculates cell-specific mixing scores based on euclidean distances within a subspace of integrated data.
cms( sce, k, group, dim_red = "PCA", assay_name = "logcounts", res_name = NULL, k_min = NA, smooth = TRUE, n_dim = 20, cell_min = 10, batch_min = NULL, unbalanced = FALSE, BPPARAM = SerialParam() )
cms( sce, k, group, dim_red = "PCA", assay_name = "logcounts", res_name = NULL, k_min = NA, smooth = TRUE, n_dim = 20, cell_min = 10, batch_min = NULL, unbalanced = FALSE, BPPARAM = SerialParam() )
sce |
A |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
group |
Character. Name of group/batch variable.
Needs to be one of |
dim_red |
Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA". |
assay_name |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
res_name |
Character. Appendix of the result score's name (e.g. method used to combine batches). |
k_min |
Numeric. Minimum number of knn to include. Default is NA (see Details). |
smooth |
Logical. Indicating if cms results should be smoothened within each neighbourhood using the weigthed mean. |
n_dim |
Numeric. Number of dimensions to include to define the subspace. |
cell_min |
Numeric. Minimum number of cells from each group to be included into the AD test. |
batch_min |
Numeric. Minimum number of cells per batch to include in to the AD test. If set neighbours will be included until batch_min cells from each batch are present. |
unbalanced |
Boolean. If True neighbourhoods with only one batch present will be set to NA. This way they are not included into any summaries or smoothening. |
BPPARAM |
A BiocParallelParam object specifying whether cms scores shall be calculated in parallel. |
The cms function tests the hypothesis, that group-specific distance
distributions of knn cells have the same underlying unspecified distribution.
It performs Anderson-Darling tests as implemented in the
kSamples package
.
In default the function uses all distances and group label defined in knn.
Alternative a density based neighbourhood can be defined by specifying
k_min
. In this case the first local minimum of the overall distance
distribution with at least k_min cells is used. This can be used to adapt to
the local structure of the datatset e.g. prevent cells from a
different cluster to be included. Third the neighbourhood can be defined by
batch occurences. batch_min
specifies the minimal number of cells from
each batch that should be included to define the neighbourhood.
If 'dim_red' is not defined or default cms will calculate a PCA using
runPCA
. Results will be appended to colData(sce)
.
Names can be specified using res_name
.
If multiple cores are available cms scores can be calculated in parallel
(does not work on Windows). Parallelization can be specified using BPPARAM.
A SingleCellExperiment
with cms (and cms_smooth) within
colData.
Scholz, F. W. and Stephens, M. A. (1987). K-Sample Anderson-Darling Tests. J. Am. Stat. Assoc.
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:50)] sce_cms <- cms(sce, k = 20, group = "batch", n_dim = 2)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:50)] sce_cms <- cms(sce, k = 20, group = "batch", n_dim = 2)
entropy
entropy( sce, group, k, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, res_name = NULL )
entropy( sce, group, k, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, res_name = NULL )
sce |
|
group |
Character. Name of group/batch variable.
Needs to be one of |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
dim_red |
Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA". |
assay_name |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
n_dim |
Numeric. Number of dimensions to include to define the subspace. |
res_name |
Character. Appendix of the result score's name (e.g. method used to combine batches). |
The entropy function calculates the Shannon entropy of the group variable within each cell's k-nearest neighbourhood. For balanced batches a Shannon entropy close to 1 indicates high randomness and mixing. For unbalanced batches entropy should be interpreted with caution, but could work as a relative measure in a comparative setting.
A SingleCellExperiment
with the entropy score within colData.
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)] sce <- entropy(sce, "batch", k = 20)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)] sce <- entropy(sce, "batch", k = 20)
Function to evaluate sc data integration providing a framework for different metrics. Metrics to evaluate mixing and preservance of the local/individual structure are provided.
evalIntegration( metrics, sce, group, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, res_name = NULL, k = NULL, k_min = NA, smooth = TRUE, cell_min = 10, batch_min = NULL, unbalanced = FALSE, weight = TRUE, k_pos = 5, sce_pre_list = NULL, dim_combined = dim_red, assay_pre = "logcounts", n_combined = 10, BPPARAM = SerialParam() )
evalIntegration( metrics, sce, group, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, res_name = NULL, k = NULL, k_min = NA, smooth = TRUE, cell_min = 10, batch_min = NULL, unbalanced = FALSE, weight = TRUE, k_pos = 5, sce_pre_list = NULL, dim_combined = dim_red, assay_pre = "logcounts", n_combined = 10, BPPARAM = SerialParam() )
metrics |
Character vector. Name of the metrics to apply. Must be one to all of 'cms', 'ldfDiff', 'isi', 'mixingMetric', 'localStructure', 'entropy'. |
sce |
|
group |
Character. Name of group/batch variable.
Needs to be one of |
dim_red |
Character. Name of embedding to use as subspace for distance distributions. Default is "PCA". |
assay_name |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
n_dim |
Numeric. Number of dimensions to include to define the subspace. |
res_name |
Character vector. Appendix of the result score's name (e.g. method used to combine batches). Needs to have the same length as metrics or NULL. |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
k_min |
Numeric. Minimum number of knn to include
(see |
smooth |
Logical. Indicating if cms results should be smoothened within each neighbourhood using the weigthed mean. Relevant for metric: 'cms'. |
cell_min |
Numeric. Minimum number of cells from each group to be included into the AD test. Should be > 4. Relevant for metric: 'cms'. |
batch_min |
Numeric. Minimum number of cells per batch to include in to the AD test. If set, neighbours will be included until batch_min cells from each batch are present. Relevant for metrics: 'cms'. |
unbalanced |
Boolean. If TRUE, neighbourhoods with only one batch present will be set to NA. This way they are not included into any summaries or smoothening. Relevant for metrics: 'cms'. |
weight |
Boolean. If TRUE, batch probabilities to calculate the isi score are weighted by the mean distance of their cells towards the cell of interest. Relevant for metrics: 'isi'. |
k_pos |
Numeric. Position of cell to be used as reference within mixing
metric. See |
sce_pre_list |
A list of |
dim_combined |
Character. Name of embeddings to use as subspace to
calculate LDF after integration. Default is |
assay_pre |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
n_combined |
Number of PCs to use in original space.
See |
BPPARAM |
A BiocParallelParam object specifying whether cms scores shall be calculated in parallel. Relevant for metric: 'cms'. |
evalIntegration is a wrapper function for different metrics to understand results of integrated single cell data sets. In general there are metrics evaluationg the *mixing* of datasets, that is, metrics that show whether there still is a bias for different datasets after integration. Furthermore there are metrics to evaluate how well the dataset internal structure has been retained, that is, metrics that show whether there has been (potentially biological) signal removed or noise added by integration.
A SingleCellExperiment
with the chosen metric's score within
colData.
Here we provide the following metrics:
Cellspecific Mixing Score. Metric that tests the hypothesis
that group-specific distance distributions of knn cells have the same
underlying unspecified distribution. The score can be interpreted as the
data's probability within an equally mixed neighbourhood according to the
batch variable (see cms
).
Inverse Simpson Index. Metric that uses the Inverse Simpson’s Index to calculate the diversification within a specified neighbourhood. The Simpson index describes the probability that two entities are taken at random from the dataset and its inverse represent the effective number of batches in a neighbourhood. The inverse Simpson index has been proposed as a diversity score for batch mixing in single cell RNAseq by Korunsky et al. They provide a distance-based neighbourhood weightening in their Lisi package.
Mixing Metric. Metric using the median position of the
kth cell from each batch within its knn as a score. The lower the better
mixed is the neighbourhood. We implemented an equivalent version to the
one in the Seurat package (See MixingMetric
and
mixMetric
.)
Shannon entropy. Metric calculating the Shannon entropy of
the batch/group variable within each cell's k-nearest neigbours.
For balanced batches the entropy is closer to 1 the higher the variables
randomness. For unbalanced batches entropy should only be used as a
relative metric in a comparative setting (See entropy
.)
Local density factor differences. Metric that determines
cell-specific changes in the Local Density Factor before and after data
integration. A metric/difference close to 0 indicates no distortion of
the previous structure (see ldfDiff
).
Local structure. Metric that compares the
intersection of knn from the same batch before and after integration
returning the average between all groups. The higher the more neighbours
were reproduced after integration. Here we implemented an equivalent
version to the one in the Seurat package
(See LocalStruct
and locStructure
).
Korsunsky I Fan J Slowikowski K Zhang F Wei K et. al. (2018). Fast, sensitive, and accurate integration of single cell data with Harmony. bioRxiv (preprint).
Stuart T Butler A Hoffman P Hafemeister C Papalexi E et. al. (2019) Comprehensive Integration of Single-Cell Data. Cell.
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 300:320, 16:30)] sce_batch1 <- sce[,colData(sce)$batch == "1"] sce_batch2 <- sce[,colData(sce)$batch == "2"] pre <- list("1" = sce_batch1, "2" = sce_batch2) sce <- evalIntegration(metrics = c("cms", "mixingMetric", "isi", "entropy"), sce, "batch", k = 20) sce <- evalIntegration("ldfDiff", sce, "batch", k = 20, sce_pre_list = pre)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 300:320, 16:30)] sce_batch1 <- sce[,colData(sce)$batch == "1"] sce_batch2 <- sce[,colData(sce)$batch == "2"] pre <- list("1" = sce_batch1, "2" = sce_batch2) sce <- evalIntegration(metrics = c("cms", "mixingMetric", "isi", "entropy"), sce, "batch", k = 20) sce <- evalIntegration("ldfDiff", sce, "batch", k = 20, sce_pre_list = pre)
isi
isi( sce, group, k, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, weight = TRUE, res_name = NULL )
isi( sce, group, k, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, weight = TRUE, res_name = NULL )
sce |
|
group |
Character. Name of group/batch variable.
Needs to be one of |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
dim_red |
Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA". |
assay_name |
Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. |
n_dim |
Numeric. Number of dimensions to include to define the subspace. |
weight |
Boolean. If TRUE, batch probabilities to calculate the isi score are weighted by the mean distance of their cells towards the cell of interest. Relevant for metrics: 'isi'. |
res_name |
Character. Appendix of the result score's name (e.g. method used to combine batches). |
The isi function calculates the inverse Simpson index of the group
variable within each cell's k-nearest neighbourhood.
The Simpson index describes the probability that two entities are taken at
random from the dataset and its inverse represent the effective number of
batches in a neighbourhood. The inverse Simpson index has been proposed as a
diversity score for batch mixing in single cell RNAseq by Korunsky et al.
They provide a distance-based neighbourhood weightening in their Lisi package.
Here, we provide a simplified way of weightening probabilitities, if the
weight
argument is enabled.
A SingleCellExperiment
with the entropy score within colData.
Korsunsky I Fan J Slowikowski K Zhang F Wei K et. al. (2018). Fast, sensitive, and accurate integration of single cell data with Harmony. bioRxiv (preprint)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)] sce <- isi(sce, "batch", k = 20)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)] sce <- isi(sce, "batch", k = 20)
Determines cell-specific changes in the Local Density Factor before and after data integration.
ldfDiff( sce_pre_list, sce_combined, group, k = 75, dim_red = "PCA", dim_combined = dim_red, assay_pre = "logcounts", assay_combined = "logcounts", n_dim = 20, res_name = NULL )
ldfDiff( sce_pre_list, sce_combined, group, k = 75, dim_red = "PCA", dim_combined = dim_red, assay_pre = "logcounts", assay_combined = "logcounts", n_dim = 20, res_name = NULL )
sce_pre_list |
A list of |
sce_combined |
A |
group |
Character. Name of group/batch variable that separates elements
of |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
dim_red |
Character. Name of embeddings to use as subspace to calculate LDF before integration. Default is "PCA". |
dim_combined |
Character. Name of embeddings to use as subspace to
calculate LDF after integration. Default is |
assay_pre |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
assay_combined |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
n_dim |
Numeric. Number of PCs to include to define subspaces. |
res_name |
Character. Appendix of the result score's name (e.g. method used to combine batches). Used to specify result name for more than one run on the same input. |
The ldfDiff function calculates differences in LDF for each element
in sce_pre_list
and their corresponding cells in sce_combined
using ldfSce
.
If 'dim_red' is not defined a PCA will be calculated using runPCA
.
In this case 'assay_pre' need to refer to the data slot that shall define
the subspace. Similar refer 'dim-combined' and 'assay_combined' to the
integrated subspace or to the resp. "corrected" count data slot.
'k' can be used to define the level of local structure that is tested.
The smaller 'k' the more focus is on detailed structures,
while a large k will tets overall changes.
A SingleCellExperiment
object.
Latecki, Longin Jan and Lazarevic, Aleksandar and Pokrajac, Dragoljub (2007). Outlier Detection with Kernel Density Functions. Mach. Learn. Data Min. Pattern Recognit.. Springer Berlin Heidelberg.
Other ldf functions:
ldfSce()
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:50, 300:350)] sce_batch1 <- sce[,colData(sce)$batch == "1"] sce_batch2 <- sce[,colData(sce)$batch == "2"] sce_pre_list <- list("1" = sce_batch1, "2" = sce_batch2) sce_ldf <- ldfDiff(sce_pre_list, sce, k = 10, group = "batch", dim_combined = "MNN", n_dim = 2)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:50, 300:350)] sce_batch1 <- sce[,colData(sce)$batch == "1"] sce_batch2 <- sce[,colData(sce)$batch == "2"] sce_pre_list <- list("1" = sce_batch1, "2" = sce_batch2) sce_ldf <- ldfDiff(sce_pre_list, sce, k = 10, group = "batch", dim_combined = "MNN", n_dim = 2)
Determines cell-specific changes in the Local Density Factor before and after data integration for one specific group.
ldfSce( sce_name, sce_pre_list, sce_combined, group, k = 75, dim_red = "PCA", dim_combined = dim_red, assay_pre = "logcounts", assay_combined = "logcounts", n_dim = 20 )
ldfSce( sce_name, sce_pre_list, sce_combined, group, k = 75, dim_red = "PCA", dim_combined = dim_red, assay_pre = "logcounts", assay_combined = "logcounts", n_dim = 20 )
sce_name |
Character. Name of the element in |
sce_pre_list |
A list of |
sce_combined |
A |
group |
Character. Name of group/batch variable that separates elements
of |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
dim_red |
Character. Name of embeddings to use as subspace to calculate LDF before integration. Default is "PCA". |
dim_combined |
Character. Name of embeddings to use as subspace to
calculate LDF after integration. Default is |
assay_pre |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
assay_combined |
Character. Name of the assay to use for PCA.
Only relevant if no existing 'dim_red' is provided.
Must be one of |
n_dim |
Numeric. Number of PCs to include to define subspaces. |
The ldfSce function calculates differences in LDF for one specified
element in sce_pre_list
and their corresponding cells in
sce_combined
. If 'dim_red' is not defined a PCA will be calculated
using runPCA
. In this case 'assay_pre' need to refer to the data slot
that shall define the subspace.
Similar refer 'dim-combined' and 'assay_combined' to the integrated subspace
or to the resp. "corrected" count data slot.
'k' can be used to define the level of local structure that is tested.
The smaller 'k' the more focus is on detailed structures, while a large k
will tets overall changes.
K-nearest neighbours (knn) are determined in the subspaces before integration
defined by 'dim_red'.
The same set of knn are used to determine LDF before and after integration.
A data.frame with difference in LDF as column named "diff_ldf".
Latecki, Longin Jan and Lazarevic, Aleksandar and Pokrajac, Dragoljub (2007). Outlier Detection with Kernel Density Functions. Mach. Learn. Data Min. Pattern Recognit.. Springer Berlin Heidelberg.
Other ldf functions:
ldfDiff()
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:50, 300:350)] sce_batch1 <- sce[,colData(sce)$batch == "1"] sce_pre_list <- list("1" = sce_batch1) ldf_1 <- ldfSce("1", sce_pre_list, sce, k = 10, group = "batch", dim_combined = "MNN", n_dim = 5)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:50, 300:350)] sce_batch1 <- sce[,colData(sce)$batch == "1"] sce_pre_list <- list("1" = sce_batch1) ldf_1 <- ldfSce("1", sce_pre_list, sce, k = 10, group = "batch", dim_combined = "MNN", n_dim = 5)
locStructure
locStructure( sce, group, dim_combined, k = 100, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, n_combined = 10, res_name = NULL )
locStructure( sce, group, dim_combined, k = 100, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, n_combined = 10, res_name = NULL )
sce |
|
group |
Character. Name of group/batch variable.
Needs to be one of |
dim_combined |
Charactyer. Name of the reduced dimensional
representation of the integrated data.
Needs to be one of |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
dim_red |
Character. Name of embeddings to calculate neighbourhoods before integration. Default is "PCA". |
assay_name |
Character. Name of the assay to use for PCA of the original (not integrated) data. Should not refer to "corrected" counts. |
n_dim |
Numeric. Number of dimensions to include for the original data. |
n_combined |
Numeric. Number of dimensions to include for the integrated data. |
res_name |
Character. Appendix of the result score's name (e.g. method used to combine batches). |
The locStructure function implements the localStructure function
from Seurat (See LocalStruct
. For each group it
calculates the k nearest neighbour within PCA space before integration and
compares it to the knn within the reduced dimensional representation after
integration. The score represents the proportion of overlapping neighbours.
The LocalStruct
function is based on the
RunPCA
function, while here runPCA
is used. This can cause small deviance from the
LocalStruct
function, but overall these functions are
equivalent.
A SingleCellExperiment
with the mixing metric within colData.
Stuart T Butler A Hoffman P Hafemeister C Papalexi E et. al. (2019) Comprehensive Integration of Single-Cell Data. Cell.
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:50, 300:350)] sce <- locStructure(sce, "batch", "MNN", k = 20, assay_name = "counts")
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:50, 300:350)] sce <- locStructure(sce, "batch", "MNN", k = 20, assay_name = "counts")
mixMetric
mixMetric( sce, group, k = 300, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, k_pos = 5, res_name = NULL )
mixMetric( sce, group, k = 300, dim_red = "PCA", assay_name = "logcounts", n_dim = 10, k_pos = 5, res_name = NULL )
sce |
|
group |
Character. Name of group/batch variable.
Needs to be one of |
k |
Numeric. Number of k-nearest neighbours (knn) to use. |
dim_red |
Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA". |
assay_name |
Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. |
n_dim |
Numeric. Number of dimensions to include to define the subspace. |
k_pos |
Position of the cell, which rank to use for scoring, defaults to 5. |
res_name |
Character. Appendix of the result score's name (e.g. method used to combine batches). |
The mixMetric function implements the mixingMetric function from
Seurat (See MixingMetric
. It takes the median rank of
the '__k_pos__ neighbour from each batch as estimation for the data's entropy
according to the batch variable. The same result can be assesed using the
MixingMetric
function and a seurat object from the
__Seurat__ package.
A SingleCellExperiment
with the mixing metric within colData.
Stuart T Butler A Hoffman P Hafemeister C Papalexi E et. al. (2019) Comprehensive Integration of Single-Cell Data. Cell.
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)] sce <- mixMetric(sce, "batch", k = 20)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)] sce <- mixMetric(sce, "batch", k = 20)
Creates summary plots of metric scores for different groups/cluster.
visCluster(sce_cms, cluster_var, metric_var = "cms", violin = FALSE)
visCluster(sce_cms, cluster_var, metric_var = "cms", violin = FALSE)
sce_cms |
A |
cluster_var |
Character. Name of the factor level variable to summarize metric scores on. |
metric_var |
Character Name of the metric scores to use. Default is "cms". |
violin |
A logical. If true violin plots are plotted, while the default (FALSE) will plot ridge plots. |
Plots summarized metric scores. This function is intended to visualize and compare metric scores among clusters or other dataset variables spcified in 'cluster_var'.
a ggplot
object.
Other visualize functions:
visGroup()
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:30,300:320)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visCluster(sce_cms, "batch")
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:30,300:320)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visCluster(sce_cms, "batch")
Plot group label in a reduced dimensional plot.
visGroup(sce, group, dim_red = "TSNE")
visGroup(sce, group, dim_red = "TSNE")
sce |
A |
group |
Character. Name of group/batch variable.
Needs to be one of |
dim_red |
Character. Name of embeddings to use as subspace for plotting. Default is "TSNE". |
Plots a reduced dimension plot colored by group parameter.
The dimesion reduction embedding can be specified, but only tsne embeddings
will automatically be computed by runTSNE
. Embeddings from data
integration methods (e.g. mnn.correct) can be used as long as they are
specified in reducedDimNames(sce)
.
a ggplot
object.
Other visualize functions:
visCluster()
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:50, 300:350)] visGroup(sce, "batch")
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:50, 300:350)] visGroup(sce, "batch")
Plot pvalue histograms of metric score distributions
visHist( res_object, metric = "cms", prefix = TRUE, n_col = 1, metric_prefix = NULL )
visHist( res_object, metric = "cms", prefix = TRUE, n_col = 1, metric_prefix = NULL )
res_object |
|
metric |
Character vector. Specify names of |
prefix |
Boolean. Is ‘metric' used to specify column’s prefix(true) or complete column names (False). |
n_col |
Numeric. Number of columns of the pval histogram. |
metric_prefix |
Former parameter to define prefix of the metric to be plotted. Will stop and ask for the new syntax. |
Plots metric score distribution similar to a pvalue histogram
distribution. Without dataset-specific bias, cms scores should be approx.
flat distributed. If 'res_object' is a matrix or data.frame,
it will create a histogram for each column. If 'res_object' is a
SingleCellExperiment
object, it will create a histogram of all
colData(res_object)
that start with or are specified in 'metric'.
a ggplot
object.
Other visualize metric functions:
visMetric()
,
visOverview()
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:50)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visHist(sce_cms)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:50)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visHist(sce_cms)
Creates a summary plot of metric scores (for different integration methods).
visIntegration( res_object, metric = "cms", prefix = TRUE, violin = FALSE, metric_name = "metric", metric_prefix = NULL )
visIntegration( res_object, metric = "cms", prefix = TRUE, violin = FALSE, metric_name = "metric", metric_prefix = NULL )
res_object |
|
metric |
Character vector. Specify names of |
prefix |
Boolean. Is ‘metric' used to specify column’s prefix(true) or complete column names (False). |
violin |
A logical. If true violin plots are plotted, while the default (FALSE) will plot ridge plots. |
metric_name |
Character. Name of the score metric. |
metric_prefix |
Former parameter to define prefix of the metric to be plotted. Will stop and ask for the new syntax. |
Plots summarized cms scores from an SingleCellExperiment
object, list or dataframe. This function is intended to visualize and
compare different methods and views of the same dataset, not to compare
different datasets.
a ggplot
object.
visCluster
, ggridges
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:30,300:320)] sce_mnn <- cms(sce,"batch", k = 20, dim_red = "MNN", res_name = "MNN", n_dim = 2) visIntegration(sce_mnn, metric = "cms.", violin = TRUE)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[["batch20"]][, c(1:30,300:320)] sce_mnn <- cms(sce,"batch", k = 20, dim_red = "MNN", res_name = "MNN", n_dim = 2) visIntegration(sce_mnn, metric = "cms.", violin = TRUE)
Plot metric scores in a reduced dimensional plot.
visMetric(sce_cms, metric_var = "cms", dim_red = "TSNE", log10_val = FALSE)
visMetric(sce_cms, metric_var = "cms", dim_red = "TSNE", log10_val = FALSE)
sce_cms |
A |
metric_var |
Character Name of the metric scores to use. Default is "cms". |
dim_red |
Character. Name of embeddings to use as subspace for plotting. Default is "TSNE". |
log10_val |
Logical. Indicating if -log10(metric) should be plotted. |
Plots a reduced dimension plot colored by metric scores.
The dimension reduction embedding can be specified, but only tsne embeddings
will automatically be computed using runTSNE
. Embeddings from data
integration methods (e.g. mnn.correct) can be used as long as they are
present in reducedDimNames(sce)
.
a ggplot
object.
Other visualize metric functions:
visHist()
,
visOverview()
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:30, 300:320)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visMetric(sce_cms)
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:30, 300:320)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visMetric(sce_cms)
Plot an overview of metric results, group label and any colData variable in a reduced dimensional representation.
visOverview( sce_cms, group, metric = "cms", prefix = TRUE, dim_red = "TSNE", log10_val = FALSE, other_var = NULL, metric_prefix = NULL )
visOverview( sce_cms, group, metric = "cms", prefix = TRUE, dim_red = "TSNE", log10_val = FALSE, other_var = NULL, metric_prefix = NULL )
sce_cms |
A |
group |
Character. Name of group/batch variable. Needs to be one of
|
metric |
Character vector. Specify names of |
prefix |
Boolean. Is ‘metric' used to specify column’s prefix(true) or complete column names (False). |
dim_red |
Character. Name of embeddings to use as subspace for plotting. Default is "TSNE". |
log10_val |
Logical. Indicating if -log10(metric) should be plotted. |
other_var |
Character string. Name(s) of other variables to be plotted
asided. Need correspond to one of |
metric_prefix |
Former parameter to define prefix of the metric to be plotted. Will stop and ask for the new syntax. |
Plots reduced dimensions of cells colored by group variable and
metric score. If 'red_dim' is not defined in reducedDimNames(sce)
a
tsne is calculated using runTSNE
. Other color label as celltype label
or smoothened scores can be plotted aside. Embeddings from data integration
methods (e.g. mnn.correct) can be used if they are specified in
reducedDimNames(sce)
.
a ggplot
object.
Other visualize metric functions:
visHist()
,
visMetric()
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:30, 300:330)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visOverview(sce_cms, "batch", other_var = "batch")
library(SingleCellExperiment) sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS")) sce <- sim_list[[1]][, c(1:30, 300:330)] sce_cms <- cms(sce, "batch", k = 20, n_dim = 2) visOverview(sce_cms, "batch", other_var = "batch")