Package 'CellMixS' reference manual

Title:	Evaluate Cellspecific Mixing
Description:	CellMixS provides metrics and functions to evaluate batch effects, data integration and batch effect correction in single cell trancriptome data with single cell resolution. Results can be visualized and summarised on different levels, e.g. on cell, celltype or dataset level.
Authors:	Almut Lütge [aut, cre]
Maintainer:	Almut Lütge <[email protected]>
License:	GPL (>=2)
Version:	1.23.0
Built:	2025-03-29 05:51:02 UTC
Source:	https://github.com/bioc/CellMixS

Toolbox to explore batch effects and data integration in scRNA data.

Description

CellMixS provides metrics and functions to evaluate batch effects, data integration and batch effect correction in single cell trancriptome data with single cell resolution. Results can be visualized and summarised on different levels, e.g. on cell, celltype or dataset level.

Details

In particular, CellMixS includes two main metrics: Cellspecific mixing scores to determine the probability of random mixing in each cell's neighbourhood. It can be assesed via the cms function. Local Density Factor Differences to evaluate the effect of data integration methods on batch internal structures. It can be assesed via the ldfDiff function.

Author(s)

Almut Lütge [email protected]

Mark D Robinson [email protected]

.cmsCell

Description

Function to calculate a cellspecific mixing score (cms) of groups/batches.

Usage

.cmsCell(
  cell,
  group,
  knn,
  k_min = NA,
  batch_min = NULL,
  cell_min = 4,
  unbalanced = FALSE,
  sce
)
.cmsCell(
  cell,
  group,
  knn,
  k_min = NA,
  batch_min = NULL,
  cell_min = 4,
  unbalanced = FALSE,
  sce
)

Arguments

`cell`	Character. Name of the cell to calculate cms for. Needs to be one of `rownames(knn)`.
`group`	Character. Name of group/batch variable. Needs to be one of `names(knn)`.
`knn`	List with three elements. First "index" with indices of knn cells. Second "distance" with distances to knn cells. Third a slot named by `group` variable with group level of knn cells.
`k_min`	Numeric. Minimum number of knn to include. Default is NA (see Details).
`batch_min`	Numeric. Minimum number of cells per batch to include in to the AD test. If set neighbours will be included until batch_min cells from each batch are present.
`cell_min`	Numeric. Minimum number of cells from each group to be included into the AD test. Should be > 4 to make 'ad.test' working.
`unbalanced`	Boolean. If True neighbourhoods with only one batch present will be set to NA. This way they are not included into any summaries or smoothening.
`sce`	A `SingleCellExperiment` object with the combined data.

Details

Value

A p.value as resulting from the ad.test.

.defineSubspace

Description

Helper function for ldfSce and cms to define or recalculate the subspace for analysis.

Usage

.defineSubspace(sce, assay_name, dim_red, n_dim)
.defineSubspace(sce, assay_name, dim_red, n_dim)

Arguments

`sce`	A `SingleCellExperiment` object with the data to define the subspace.
`assay_name`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce))`.
`dim_red`	Character. Name of embeddings to use as subspace.
`n_dim`	Numeric. Number of subspace elements to include to define subspace.

Details

Function to determine the subspace for ldfDiff and cms. Checks whether the defined 'dim_red' is present. Only if no subspace is defined or present it will perform a PCA using runPCA. To calculate PCA counts defined in 'assay_name' are used.

Value

A matrix of cell embeddings with reduced dimensions as columns.

.filterKnn

Description

.filterKnn

Usage

.filterKnn(knn_cell, batch_min, group, sce)
.filterKnn(knn_cell, batch_min, group, sce)

Arguments

`knn_cell`	Data frame with one column "distance" and one column named by the group variable. Rows correspond to the knn cells and do not need rownames.
`batch_min`	Numeric. Minimum number of cells per batch to include.
`group`	Character. Name of group/batch variable. Needs to be one of `names(knn)`.
`sce`	A `SingleCellExperiment` object with the combined data.

Value

data.frame with two columns (index, distance) for filtered knn cells.

.filterLocMin

Description

Function to filter knn by overall distance density distribution.

Usage

.filterLocMin(knn_cell, k_min)
.filterLocMin(knn_cell, k_min)

Arguments

`knn_cell`	Data frame with one column "distance" and one column named by the group variable. Rows correspond to the knn cells and do not need rownames.
`k_min`	Numeric. Minimum number of Knn to include.

Details

Internal function to filter cells used for cms testing to come from a continous overall density distribution function (similar to cluster definitions). 'filterLocMin' is only applied, if k-min is specified as parameter in .cmsCell or cms.

Value

data.frame with two columns (index, distance) for filtered knn cells.

.ldfKnn

Description

Calculates the Local Density Factor as implemented in the DDoutlier package with a predefined knn neighbourhood.

Usage

.ldfKnn(dataset, knn_object, k = k, h = 1, c = 1)
.ldfKnn(dataset, knn_object, k = k, h = 1, c = 1)

Arguments

`dataset`	Matrix with cell embeddings with cells as rows and reduced dimensions as cloumns. Subspace to determine LDF in.
`knn_object`	List with k-nearest neighbours (knn) as provided by `get.knn` from the `FNN` package. First element named "indices" contains indices of knn in `dataset`. Second element named "distance" contains distances of knn in `dataset`. Third element named "cell_name" contains rownames of knn in `dataset`.
`k`	Numeric. Number of knn used. Should correspond to `knn_object`.
`h`	Numeric. Bandwidth for kernel functions. The greater the bandwidth, the smoother kernels and lesser weight are put on outliers. Default is 1
`c`	Scaling constant for comparison of LDE to neighboring observations. Default is 1.

Details

LDF fuction modified from the DDoutlier package. Calculates a Local Density Estimate (LDE) and Local Density Factor (LDF) with a gaussian kernel. Modified to use a predefined knn neighbourhood. For ldfSce this is essential to determine LDF after data integration on the same set of cells.

Value

List with two elements "LDE" and "LDF".

.smoothCms

Description

Performs weighted smoothening of cms scores

Usage

.smoothCms(knn, cms_raw, cell_names, k_min, k)
.smoothCms(knn, cms_raw, cell_names, k_min, k)

Arguments

`knn`	List with three elements. First "index" with indices of knn cells. Second "distance" with distances to knn cells. Third a slot named by `group` variable with group level of knn cells.
`cms_raw`	Matrix with raw cms scores for all cells specified in `cell_names` and `knn`. Colnames need to be "cms.
`cell_names`	Character vector with cell names corresponding to the rownames of the list elements in `knn` and `rownames(cms_raw)`.
`k_min`	Numeric. Minimum number of knn to include. Default is NA (see Details).
`k`	Numeric. Number of k-nearest neighbours (knn) to use.

Details

Internal function to smooth cms scores. In a complete random setting cms scores are uniform distributed. To reduce the resulting random variance and enable visualization of local pattern cms scores can be smoothened assuming that within one region mixing is uniform. Generates smoothened cms scores using weigthed means of cms scores within the k-nearest neighbourhood. Reciprocal distances are used as weights.

Value

matrix with two columns ("cms_smooth", "cms").

cms

Description

Calculates cell-specific mixing scores based on euclidean distances within a subspace of integrated data.

Usage

cms(
  sce,
  k,
  group,
  dim_red = "PCA",
  assay_name = "logcounts",
  res_name = NULL,
  k_min = NA,
  smooth = TRUE,
  n_dim = 20,
  cell_min = 10,
  batch_min = NULL,
  unbalanced = FALSE,
  BPPARAM = SerialParam()
)
cms(
  sce,
  k,
  group,
  dim_red = "PCA",
  assay_name = "logcounts",
  res_name = NULL,
  k_min = NA,
  smooth = TRUE,
  n_dim = 20,
  cell_min = 10,
  batch_min = NULL,
  unbalanced = FALSE,
  BPPARAM = SerialParam()
)

Arguments

`sce`	A `SingleCellExperiment` object with the combined data.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`
`dim_red`	Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA".
`assay_name`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce))`. Default is "logcounts".
`res_name`	Character. Appendix of the result score's name (e.g. method used to combine batches).
`k_min`	Numeric. Minimum number of knn to include. Default is NA (see Details).
`smooth`	Logical. Indicating if cms results should be smoothened within each neighbourhood using the weigthed mean.
`n_dim`	Numeric. Number of dimensions to include to define the subspace.
`cell_min`	Numeric. Minimum number of cells from each group to be included into the AD test.
`batch_min`	Numeric. Minimum number of cells per batch to include in to the AD test. If set neighbours will be included until batch_min cells from each batch are present.
`unbalanced`	Boolean. If True neighbourhoods with only one batch present will be set to NA. This way they are not included into any summaries or smoothening.
`BPPARAM`	A BiocParallelParam object specifying whether cms scores shall be calculated in parallel.

Details

The cms function tests the hypothesis, that group-specific distance distributions of knn cells have the same underlying unspecified distribution. It performs Anderson-Darling tests as implemented in the kSamples package. In default the function uses all distances and group label defined in knn. Alternative a density based neighbourhood can be defined by specifying k_min. In this case the first local minimum of the overall distance distribution with at least k_min cells is used. This can be used to adapt to the local structure of the datatset e.g. prevent cells from a different cluster to be included. Third the neighbourhood can be defined by batch occurences. batch_min specifies the minimal number of cells from each batch that should be included to define the neighbourhood. If 'dim_red' is not defined or default cms will calculate a PCA using runPCA. Results will be appended to colData(sce). Names can be specified using res_name. If multiple cores are available cms scores can be calculated in parallel (does not work on Windows). Parallelization can be specified using BPPARAM.

Value

A SingleCellExperiment with cms (and cms_smooth) within colData.

References

Scholz, F. W. and Stephens, M. A. (1987). K-Sample Anderson-Darling Tests. J. Am. Stat. Assoc.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:50)]

sce_cms <- cms(sce, k = 20, group = "batch", n_dim = 2)

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:50)]

sce_cms <- cms(sce, k = 20, group = "batch", n_dim = 2)

entropy

Description

entropy

Usage

entropy(
  sce,
  group,
  k,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  res_name = NULL
)
entropy(
  sce,
  group,
  k,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  res_name = NULL
)

Arguments

`sce`	`SingleCellExperiment` object, with the integrated data.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`dim_red`	Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA".
`assay_name`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce))`. Default is "logcounts".
`n_dim`	Numeric. Number of dimensions to include to define the subspace.
`res_name`	Character. Appendix of the result score's name (e.g. method used to combine batches).

Details

The entropy function calculates the Shannon entropy of the group variable within each cell's k-nearest neighbourhood. For balanced batches a Shannon entropy close to 1 indicates high randomness and mixing. For unbalanced batches entropy should be interpreted with caution, but could work as a relative measure in a comparative setting.

Value

A SingleCellExperiment with the entropy score within colData.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)]

sce <- entropy(sce, "batch", k = 20)

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)]

sce <- entropy(sce, "batch", k = 20)

evalIntegration

Description

Function to evaluate sc data integration providing a framework for different metrics. Metrics to evaluate mixing and preservance of the local/individual structure are provided.

Usage

evalIntegration(
  metrics,
  sce,
  group,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  res_name = NULL,
  k = NULL,
  k_min = NA,
  smooth = TRUE,
  cell_min = 10,
  batch_min = NULL,
  unbalanced = FALSE,
  weight = TRUE,
  k_pos = 5,
  sce_pre_list = NULL,
  dim_combined = dim_red,
  assay_pre = "logcounts",
  n_combined = 10,
  BPPARAM = SerialParam()
)
evalIntegration(
  metrics,
  sce,
  group,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  res_name = NULL,
  k = NULL,
  k_min = NA,
  smooth = TRUE,
  cell_min = 10,
  batch_min = NULL,
  unbalanced = FALSE,
  weight = TRUE,
  k_pos = 5,
  sce_pre_list = NULL,
  dim_combined = dim_red,
  assay_pre = "logcounts",
  n_combined = 10,
  BPPARAM = SerialParam()
)

Arguments

`metrics`	Character vector. Name of the metrics to apply. Must be one to all of 'cms', 'ldfDiff', 'isi', 'mixingMetric', 'localStructure', 'entropy'.
`sce`	`SingleCellExperiment` object, with the integrated data.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`dim_red`	Character. Name of embedding to use as subspace for distance distributions. Default is "PCA".
`assay_name`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce))`. Default is "logcounts".
`n_dim`	Numeric. Number of dimensions to include to define the subspace.
`res_name`	Character vector. Appendix of the result score's name (e.g. method used to combine batches). Needs to have the same length as metrics or NULL.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`k_min`	Numeric. Minimum number of knn to include (see `cms`). Relevant for metrics: 'cms'.
`smooth`	Logical. Indicating if cms results should be smoothened within each neighbourhood using the weigthed mean. Relevant for metric: 'cms'.
`cell_min`	Numeric. Minimum number of cells from each group to be included into the AD test. Should be > 4. Relevant for metric: 'cms'.
`batch_min`	Numeric. Minimum number of cells per batch to include in to the AD test. If set, neighbours will be included until batch_min cells from each batch are present. Relevant for metrics: 'cms'.
`unbalanced`	Boolean. If TRUE, neighbourhoods with only one batch present will be set to NA. This way they are not included into any summaries or smoothening. Relevant for metrics: 'cms'.
`weight`	Boolean. If TRUE, batch probabilities to calculate the isi score are weighted by the mean distance of their cells towards the cell of interest. Relevant for metrics: 'isi'.
`k_pos`	Numeric. Position of cell to be used as reference within mixing metric. See `MixingMetric` for details. Relevant for metric: 'mixingMetric'
`sce_pre_list`	A list of `SingleCellExperiment` objects with single datasets before integration. Names should correspond to levels in `colData(sce_combined)[,group]`. Relevant for metric: 'ldfDiff'
`dim_combined`	Character. Name of embeddings to use as subspace to calculate LDF after integration. Default is `dim_red`. Relevant for metric 'ldfDiff'.
`assay_pre`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce_pre))`. Default is "logcounts". Relevant for metric 'ldfDiff'.
`n_combined`	Number of PCs to use in original space. See `LocalStruct` for details. Relevant for metric 'localStructure'.
`BPPARAM`	A BiocParallelParam object specifying whether cms scores shall be calculated in parallel. Relevant for metric: 'cms'.

Details

evalIntegration is a wrapper function for different metrics to understand results of integrated single cell data sets. In general there are metrics evaluationg the *mixing* of datasets, that is, metrics that show whether there still is a bias for different datasets after integration. Furthermore there are metrics to evaluate how well the dataset internal structure has been retained, that is, metrics that show whether there has been (potentially biological) signal removed or noise added by integration.

Value

A SingleCellExperiment with the chosen metric's score within colData.

Metrics

Here we provide the following metrics:

cms: Cellspecific Mixing Score. Metric that tests the hypothesis that group-specific distance distributions of knn cells have the same underlying unspecified distribution. The score can be interpreted as the data's probability within an equally mixed neighbourhood according to the batch variable (see cms).
isi: Inverse Simpson Index. Metric that uses the Inverse Simpson’s Index to calculate the diversification within a specified neighbourhood. The Simpson index describes the probability that two entities are taken at random from the dataset and its inverse represent the effective number of batches in a neighbourhood. The inverse Simpson index has been proposed as a diversity score for batch mixing in single cell RNAseq by Korunsky et al. They provide a distance-based neighbourhood weightening in their Lisi package.
mixingMetric: Mixing Metric. Metric using the median position of the kth cell from each batch within its knn as a score. The lower the better mixed is the neighbourhood. We implemented an equivalent version to the one in the Seurat package (See MixingMetric and mixMetric.)
entropy: Shannon entropy. Metric calculating the Shannon entropy of the batch/group variable within each cell's k-nearest neigbours. For balanced batches the entropy is closer to 1 the higher the variables randomness. For unbalanced batches entropy should only be used as a relative metric in a comparative setting (See entropy.)
ldfDiff: Local density factor differences. Metric that determines cell-specific changes in the Local Density Factor before and after data integration. A metric/difference close to 0 indicates no distortion of the previous structure (see ldfDiff).
localStructure: Local structure. Metric that compares the intersection of knn from the same batch before and after integration returning the average between all groups. The higher the more neighbours were reproduced after integration. Here we implemented an equivalent version to the one in the Seurat package (See LocalStruct and locStructure ).

References

Korsunsky I Fan J Slowikowski K Zhang F Wei K et. al. (2018). Fast, sensitive, and accurate integration of single cell data with Harmony. bioRxiv (preprint).

Stuart T Butler A Hoffman P Hafemeister C Papalexi E et. al. (2019) Comprehensive Integration of Single-Cell Data. Cell.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 300:320, 16:30)]
sce_batch1 <- sce[,colData(sce)$batch == "1"]
sce_batch2 <- sce[,colData(sce)$batch == "2"]
pre <- list("1" = sce_batch1, "2" = sce_batch2)

sce <- evalIntegration(metrics = c("cms", "mixingMetric", "isi", "entropy"), sce, "batch", k = 20)
sce <- evalIntegration("ldfDiff", sce, "batch", k = 20, sce_pre_list = pre)

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 300:320, 16:30)]
sce_batch1 <- sce[,colData(sce)$batch == "1"]
sce_batch2 <- sce[,colData(sce)$batch == "2"]
pre <- list("1" = sce_batch1, "2" = sce_batch2)

sce <- evalIntegration(metrics = c("cms", "mixingMetric", "isi", "entropy"), sce, "batch", k = 20)
sce <- evalIntegration("ldfDiff", sce, "batch", k = 20, sce_pre_list = pre)

isi

Description

isi

Usage

isi(
  sce,
  group,
  k,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  weight = TRUE,
  res_name = NULL
)
isi(
  sce,
  group,
  k,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  weight = TRUE,
  res_name = NULL
)

Arguments

`sce`	`SingleCellExperiment` object, with the integrated data.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`dim_red`	Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA".
`assay_name`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided.
`n_dim`	Numeric. Number of dimensions to include to define the subspace.
`weight`	Boolean. If TRUE, batch probabilities to calculate the isi score are weighted by the mean distance of their cells towards the cell of interest. Relevant for metrics: 'isi'.
`res_name`	Character. Appendix of the result score's name (e.g. method used to combine batches).

Details

The isi function calculates the inverse Simpson index of the group variable within each cell's k-nearest neighbourhood. The Simpson index describes the probability that two entities are taken at random from the dataset and its inverse represent the effective number of batches in a neighbourhood. The inverse Simpson index has been proposed as a diversity score for batch mixing in single cell RNAseq by Korunsky et al. They provide a distance-based neighbourhood weightening in their Lisi package. Here, we provide a simplified way of weightening probabilitities, if the weight argument is enabled.

Value

A SingleCellExperiment with the entropy score within colData.

References

Korsunsky I Fan J Slowikowski K Zhang F Wei K et. al. (2018). Fast, sensitive, and accurate integration of single cell data with Harmony. bioRxiv (preprint)

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)]

sce <- isi(sce, "batch", k = 20)
library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)]

sce <- isi(sce, "batch", k = 20)

ldfDiff

Description

Determines cell-specific changes in the Local Density Factor before and after data integration.

Usage

ldfDiff(
  sce_pre_list,
  sce_combined,
  group,
  k = 75,
  dim_red = "PCA",
  dim_combined = dim_red,
  assay_pre = "logcounts",
  assay_combined = "logcounts",
  n_dim = 20,
  res_name = NULL
)
ldfDiff(
  sce_pre_list,
  sce_combined,
  group,
  k = 75,
  dim_red = "PCA",
  dim_combined = dim_red,
  assay_pre = "logcounts",
  assay_combined = "logcounts",
  n_dim = 20,
  res_name = NULL
)

Arguments

`sce_pre_list`	A list of `SingleCellExperiment` objects with single datasets before integration. Names should correspond to levels in `colData(sce_combined)$group`
`sce_combined`	A `SingleCellExperiment` object with the combined data.
`group`	Character. Name of group/batch variable that separates elements of `sce_pre_list`. Needs to be one of `names(colData(sce_combined))`.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`dim_red`	Character. Name of embeddings to use as subspace to calculate LDF before integration. Default is "PCA".
`dim_combined`	Character. Name of embeddings to use as subspace to calculate LDF after integration. Default is `dim_red`.
`assay_pre`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce_pre))`. Default is "logcounts".
`assay_combined`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce_combined))`. Default is "logcounts".
`n_dim`	Numeric. Number of PCs to include to define subspaces.
`res_name`	Character. Appendix of the result score's name (e.g. method used to combine batches). Used to specify result name for more than one run on the same input.

Details

The ldfDiff function calculates differences in LDF for each element in sce_pre_list and their corresponding cells in sce_combined using ldfSce. If 'dim_red' is not defined a PCA will be calculated using runPCA. In this case 'assay_pre' need to refer to the data slot that shall define the subspace. Similar refer 'dim-combined' and 'assay_combined' to the integrated subspace or to the resp. "corrected" count data slot. 'k' can be used to define the level of local structure that is tested. The smaller 'k' the more focus is on detailed structures, while a large k will tets overall changes.

Value

A SingleCellExperiment object.

References

Latecki, Longin Jan and Lazarevic, Aleksandar and Pokrajac, Dragoljub (2007). Outlier Detection with Kernel Density Functions. Mach. Learn. Data Min. Pattern Recognit.. Springer Berlin Heidelberg.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[["batch20"]][, c(1:50, 300:350)]
sce_batch1 <- sce[,colData(sce)$batch == "1"]
sce_batch2 <- sce[,colData(sce)$batch == "2"]
sce_pre_list <- list("1" = sce_batch1, "2" = sce_batch2)

sce_ldf <- ldfDiff(sce_pre_list, sce, k = 10, group = "batch",
dim_combined = "MNN", n_dim = 2)

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[["batch20"]][, c(1:50, 300:350)]
sce_batch1 <- sce[,colData(sce)$batch == "1"]
sce_batch2 <- sce[,colData(sce)$batch == "2"]
sce_pre_list <- list("1" = sce_batch1, "2" = sce_batch2)

sce_ldf <- ldfDiff(sce_pre_list, sce, k = 10, group = "batch",
dim_combined = "MNN", n_dim = 2)

ldfSce

Description

Determines cell-specific changes in the Local Density Factor before and after data integration for one specific group.

Usage

ldfSce(
  sce_name,
  sce_pre_list,
  sce_combined,
  group,
  k = 75,
  dim_red = "PCA",
  dim_combined = dim_red,
  assay_pre = "logcounts",
  assay_combined = "logcounts",
  n_dim = 20
)
ldfSce(
  sce_name,
  sce_pre_list,
  sce_combined,
  group,
  k = 75,
  dim_red = "PCA",
  dim_combined = dim_red,
  assay_pre = "logcounts",
  assay_combined = "logcounts",
  n_dim = 20
)

Arguments

`sce_name`	Character. Name of the element in `sce_pre_list` to calculate LDF differences in.
`sce_pre_list`	A list of `SingleCellExperiment` objects with single datasets before integration. Names need to correspond to levels in `colData(sce_combined)$group` and `sce_name`!!
`sce_combined`	A `SingleCellExperiment` object with combined data.
`group`	Character. Name of group/batch variable that separates elements of `sce_pre_list`. Needs to be one of `names(colData(sce_combined))`.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`dim_red`	Character. Name of embeddings to use as subspace to calculate LDF before integration. Default is "PCA".
`dim_combined`	Character. Name of embeddings to use as subspace to calculate LDF after integration. Default is `dim_red`.
`assay_pre`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce_pre))`. Default is "logcounts".
`assay_combined`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided. Must be one of `names(assays(sce_combined))`. Default is "logcounts".
`n_dim`	Numeric. Number of PCs to include to define subspaces.

Details

The ldfSce function calculates differences in LDF for one specified element in sce_pre_list and their corresponding cells in sce_combined. If 'dim_red' is not defined a PCA will be calculated using runPCA. In this case 'assay_pre' need to refer to the data slot that shall define the subspace. Similar refer 'dim-combined' and 'assay_combined' to the integrated subspace or to the resp. "corrected" count data slot. 'k' can be used to define the level of local structure that is tested. The smaller 'k' the more focus is on detailed structures, while a large k will tets overall changes. K-nearest neighbours (knn) are determined in the subspaces before integration defined by 'dim_red'. The same set of knn are used to determine LDF before and after integration.

Value

A data.frame with difference in LDF as column named "diff_ldf".

References

Latecki, Longin Jan and Lazarevic, Aleksandar and Pokrajac, Dragoljub (2007). Outlier Detection with Kernel Density Functions. Mach. Learn. Data Min. Pattern Recognit.. Springer Berlin Heidelberg.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[["batch20"]][, c(1:50, 300:350)]
sce_batch1 <- sce[,colData(sce)$batch == "1"]
sce_pre_list <- list("1" = sce_batch1)

ldf_1 <- ldfSce("1", sce_pre_list, sce, k = 10, group = "batch",
dim_combined = "MNN", n_dim = 5)

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[["batch20"]][, c(1:50, 300:350)]
sce_batch1 <- sce[,colData(sce)$batch == "1"]
sce_pre_list <- list("1" = sce_batch1)

ldf_1 <- ldfSce("1", sce_pre_list, sce, k = 10, group = "batch",
dim_combined = "MNN", n_dim = 5)

locStructure

Description

locStructure

Usage

locStructure(
  sce,
  group,
  dim_combined,
  k = 100,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  n_combined = 10,
  res_name = NULL
)
locStructure(
  sce,
  group,
  dim_combined,
  k = 100,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  n_combined = 10,
  res_name = NULL
)

Arguments

`sce`	`SingleCellExperiment` object, with the integrated data.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`dim_combined`	Charactyer. Name of the reduced dimensional representation of the integrated data. Needs to be one of `reducedDimNames(sce))`.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`dim_red`	Character. Name of embeddings to calculate neighbourhoods before integration. Default is "PCA".
`assay_name`	Character. Name of the assay to use for PCA of the original (not integrated) data. Should not refer to "corrected" counts.
`n_dim`	Numeric. Number of dimensions to include for the original data.
`n_combined`	Numeric. Number of dimensions to include for the integrated data.
`res_name`	Character. Appendix of the result score's name (e.g. method used to combine batches).

Details

The locStructure function implements the localStructure function from Seurat (See LocalStruct. For each group it calculates the k nearest neighbour within PCA space before integration and compares it to the knn within the reduced dimensional representation after integration. The score represents the proportion of overlapping neighbours. The LocalStruct function is based on the RunPCA function, while here runPCA is used. This can cause small deviance from the LocalStruct function, but overall these functions are equivalent.

Value

A SingleCellExperiment with the mixing metric within colData.

References

Stuart T Butler A Hoffman P Hafemeister C Papalexi E et. al. (2019) Comprehensive Integration of Single-Cell Data. Cell.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[["batch20"]][, c(1:50, 300:350)]

sce <- locStructure(sce, "batch", "MNN", k = 20, assay_name = "counts")
library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[["batch20"]][, c(1:50, 300:350)]

sce <- locStructure(sce, "batch", "MNN", k = 20, assay_name = "counts")

mixMetric

Description

mixMetric

Usage

mixMetric(
  sce,
  group,
  k = 300,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  k_pos = 5,
  res_name = NULL
)
mixMetric(
  sce,
  group,
  k = 300,
  dim_red = "PCA",
  assay_name = "logcounts",
  n_dim = 10,
  k_pos = 5,
  res_name = NULL
)

Arguments

`sce`	`SingleCellExperiment` object, with the integrated data.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`k`	Numeric. Number of k-nearest neighbours (knn) to use.
`dim_red`	Character. Name of embeddings to use as subspace for distance distributions. Default is "PCA".
`assay_name`	Character. Name of the assay to use for PCA. Only relevant if no existing 'dim_red' is provided.
`n_dim`	Numeric. Number of dimensions to include to define the subspace.
`k_pos`	Position of the cell, which rank to use for scoring, defaults to 5.
`res_name`	Character. Appendix of the result score's name (e.g. method used to combine batches).

Details

The mixMetric function implements the mixingMetric function from Seurat (See MixingMetric. It takes the median rank of the '__k_pos__ neighbour from each batch as estimation for the data's entropy according to the batch variable. The same result can be assesed using the MixingMetric function and a seurat object from the __Seurat__ package.

Value

A SingleCellExperiment with the mixing metric within colData.

References

Stuart T Butler A Hoffman P Hafemeister C Papalexi E et. al. (2019) Comprehensive Integration of Single-Cell Data. Cell.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)]

sce <- mixMetric(sce, "batch", k = 20)
library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:15, 400:420, 16:30)]

sce <- mixMetric(sce, "batch", k = 20)

visCluster

Description

Creates summary plots of metric scores for different groups/cluster.

Usage

visCluster(sce_cms, cluster_var, metric_var = "cms", violin = FALSE)
visCluster(sce_cms, cluster_var, metric_var = "cms", violin = FALSE)

Arguments

`sce_cms`	A `SingleCellExperiment` object with the result scores (e.g. cms) to plot within `colData(res_object)`.
`cluster_var`	Character. Name of the factor level variable to summarize metric scores on.
`metric_var`	Character Name of the metric scores to use. Default is "cms".
`violin`	A logical. If true violin plots are plotted, while the default (FALSE) will plot ridge plots.

Details

Plots summarized metric scores. This function is intended to visualize and compare metric scores among clusters or other dataset variables spcified in 'cluster_var'.

Value

a ggplot object.

Examples

library(SingleCellExperiment)

sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:30,300:320)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)

visCluster(sce_cms, "batch")

library(SingleCellExperiment)

sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:30,300:320)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)

visCluster(sce_cms, "batch")

visGroup

Description

Plot group label in a reduced dimensional plot.

Usage

visGroup(sce, group, dim_red = "TSNE")
visGroup(sce, group, dim_red = "TSNE")

Arguments

`sce`	A `SingleCellExperiment` object.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`dim_red`	Character. Name of embeddings to use as subspace for plotting. Default is "TSNE".

Details

Plots a reduced dimension plot colored by group parameter. The dimesion reduction embedding can be specified, but only tsne embeddings will automatically be computed by runTSNE. Embeddings from data integration methods (e.g. mnn.correct) can be used as long as they are specified in reducedDimNames(sce).

Value

a ggplot object.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:50, 300:350)]

visGroup(sce, "batch")


library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:50, 300:350)]

visGroup(sce, "batch")

visHist

Description

Plot pvalue histograms of metric score distributions

Usage

visHist(
  res_object,
  metric = "cms",
  prefix = TRUE,
  n_col = 1,
  metric_prefix = NULL
)
visHist(
  res_object,
  metric = "cms",
  prefix = TRUE,
  n_col = 1,
  metric_prefix = NULL
)

Arguments

`res_object`	`SingleCellExperiment` object, matrix or data.frame. The SingleCellExperiment object should contain the result scores (e.g. cms) to plot in `colData(res_object)`. Matrix or data frame should have result scores in columns and cells in rows.
`metric`	Character vector. Specify names of `colData(sce)` to be plotted. Applys only if 'res_object' is a `SingleCellExperiment` object. Default is 'cms'. If prefix is TRUE all columns starting with 'metric' will be plotted.
`prefix`	Boolean. Is ‘metric' used to specify column’s prefix(true) or complete column names (False).
`n_col`	Numeric. Number of columns of the pval histogram.
`metric_prefix`	Former parameter to define prefix of the metric to be plotted. Will stop and ask for the new syntax.

Details

Plots metric score distribution similar to a pvalue histogram distribution. Without dataset-specific bias, cms scores should be approx. flat distributed. If 'res_object' is a matrix or data.frame, it will create a histogram for each column. If 'res_object' is a SingleCellExperiment object, it will create a histogram of all colData(res_object) that start with or are specified in 'metric'.

Value

a ggplot object.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:50)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)
visHist(sce_cms)


library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:50)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)
visHist(sce_cms)

visIntegration

Description

Creates a summary plot of metric scores (for different integration methods).

Usage

visIntegration(
  res_object,
  metric = "cms",
  prefix = TRUE,
  violin = FALSE,
  metric_name = "metric",
  metric_prefix = NULL
)
visIntegration(
  res_object,
  metric = "cms",
  prefix = TRUE,
  violin = FALSE,
  metric_name = "metric",
  metric_prefix = NULL
)

Arguments

`res_object`	`SingleCellExperiment` object, list, matrix or data.frame. The SingleCellExperiment object should contain the result scores (cms) to compare within `colData(res_object)`. List, matrix or data frame should have result scores in list elements resp. columns.
`metric`	Character vector. Specify names of `colData(sce)` to be compared. Applys only if 'res_object' is a `SingleCellExperiment` object. Default is 'cms'. If prefix is TRUE all columns starting with 'metric' will be compared and plotted.
`prefix`	Boolean. Is ‘metric' used to specify column’s prefix(true) or complete column names (False).
`violin`	A logical. If true violin plots are plotted, while the default (FALSE) will plot ridge plots.
`metric_name`	Character. Name of the score metric.
`metric_prefix`	Former parameter to define prefix of the metric to be plotted. Will stop and ask for the new syntax.

Details

Plots summarized cms scores from an SingleCellExperiment object, list or dataframe. This function is intended to visualize and compare different methods and views of the same dataset, not to compare different datasets.

Value

a ggplot object.

Examples

library(SingleCellExperiment)

sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))

sce <- sim_list[["batch20"]][, c(1:30,300:320)]
sce_mnn <- cms(sce,"batch", k = 20, dim_red = "MNN", res_name = "MNN",
n_dim = 2)

visIntegration(sce_mnn, metric = "cms.", violin = TRUE)

library(SingleCellExperiment)

sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))

sce <- sim_list[["batch20"]][, c(1:30,300:320)]
sce_mnn <- cms(sce,"batch", k = 20, dim_red = "MNN", res_name = "MNN",
n_dim = 2)

visIntegration(sce_mnn, metric = "cms.", violin = TRUE)

visMetric

Description

Plot metric scores in a reduced dimensional plot.

Usage

visMetric(sce_cms, metric_var = "cms", dim_red = "TSNE", log10_val = FALSE)
visMetric(sce_cms, metric_var = "cms", dim_red = "TSNE", log10_val = FALSE)

Arguments

`sce_cms`	A `SingleCellExperiment` object with the result scores (e.g. cms) to plot within `colData(res_object)`.
`metric_var`	Character Name of the metric scores to use. Default is "cms".
`dim_red`	Character. Name of embeddings to use as subspace for plotting. Default is "TSNE".
`log10_val`	Logical. Indicating if -log10(metric) should be plotted.

Details

Plots a reduced dimension plot colored by metric scores. The dimension reduction embedding can be specified, but only tsne embeddings will automatically be computed using runTSNE. Embeddings from data integration methods (e.g. mnn.correct) can be used as long as they are present in reducedDimNames(sce).

Value

a ggplot object.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:30, 300:320)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)

visMetric(sce_cms)

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:30, 300:320)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)

visMetric(sce_cms)

visOverview

Description

Plot an overview of metric results, group label and any colData variable in a reduced dimensional representation.

Usage

visOverview(
  sce_cms,
  group,
  metric = "cms",
  prefix = TRUE,
  dim_red = "TSNE",
  log10_val = FALSE,
  other_var = NULL,
  metric_prefix = NULL
)
visOverview(
  sce_cms,
  group,
  metric = "cms",
  prefix = TRUE,
  dim_red = "TSNE",
  log10_val = FALSE,
  other_var = NULL,
  metric_prefix = NULL
)

Arguments

`sce_cms`	A `SingleCellExperiment` object with the result scores (e.g. cms) to plot in `colData(sce_cms)`.
`group`	Character. Name of group/batch variable. Needs to be one of `names(colData(sce))`.
`metric`	Character vector. Specify names of `colData(sce)` to be plotted. Applys only if 'res_object' is a `SingleCellExperiment` object. Default is 'cms'. If prefix is TRUE all columns starting with 'metric' will be plotted.
`prefix`	Boolean. Is ‘metric' used to specify column’s prefix(true) or complete column names (False).
`dim_red`	Character. Name of embeddings to use as subspace for plotting. Default is "TSNE".
`log10_val`	Logical. Indicating if -log10(metric) should be plotted.
`other_var`	Character string. Name(s) of other variables to be plotted asided. Need correspond to one of `colData(sce)`.
`metric_prefix`	Former parameter to define prefix of the metric to be plotted. Will stop and ask for the new syntax.

Details

Plots reduced dimensions of cells colored by group variable and metric score. If 'red_dim' is not defined in reducedDimNames(sce) a tsne is calculated using runTSNE. Other color label as celltype label or smoothened scores can be plotted aside. Embeddings from data integration methods (e.g. mnn.correct) can be used if they are specified in reducedDimNames(sce).

Value

a ggplot object.

Examples

library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:30, 300:330)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)

visOverview(sce_cms, "batch", other_var = "batch")


library(SingleCellExperiment)
sim_list <- readRDS(system.file("extdata/sim50.rds", package = "CellMixS"))
sce <- sim_list[[1]][, c(1:30, 300:330)]
sce_cms <- cms(sce, "batch", k = 20, n_dim = 2)

visOverview(sce_cms, "batch", other_var = "batch")

Package 'CellMixS'

Help Index

Toolbox to explore batch effects and data integration in scRNA data.

Description

Details

Author(s)

.cmsCell

Description

Usage

Arguments

Details

Value

See Also

.defineSubspace

Description

Usage

Arguments

Details

Value

See Also

.filterKnn

Description

Usage

Arguments

Value

See Also

.filterLocMin

Description

Usage

Arguments

Details

Value

See Also

.ldfKnn

Description

Usage

Arguments

Details

Value

See Also

.smoothCms

Description

Usage

Arguments

Details

Value

See Also

cms

Description

Usage

Arguments

Details

Value

References

See Also

Examples

entropy

Description

Usage

Arguments

Details

Value

Examples

evalIntegration

Description

Usage

Arguments

Details

Value

Metrics

References

Examples

isi

Description

Usage

Arguments

Details

Value

References

Examples