Title: | ccImpute: an accurate and scalable consensus clustering based approach to impute dropout events in the single-cell RNA-seq data (https://doi.org/10.1186/s12859-022-04814-8) |
---|---|
Description: | Dropout events make the lowly expressed genes indistinguishable from true zero expression and different than the low expression present in cells of the same type. This issue makes any subsequent downstream analysis difficult. ccImpute is an imputation algorithm that uses cell similarity established by consensus clustering to impute the most probable dropout events in the scRNA-seq datasets. ccImpute demonstrated performance which exceeds the performance of existing imputation approaches while introducing the least amount of new noise as measured by clustering performance characteristics on datasets with known cell identities. |
Authors: | Marcin Malec [cre, aut] , Parichit Sharma [aut] , Hasan Kurban [aut] , Mehmet Dalkilic [aut] |
Maintainer: | Marcin Malec <[email protected]> |
License: | GPL-3 |
Version: | 1.9.0 |
Built: | 2024-10-30 04:34:03 UTC |
Source: | https://github.com/bioc/ccImpute |
Performs imputation of dropout values in single-cell RNA sequencing (scRNA-seq) data using a consensus clustering-based algorithm (ccImpute). This implementation includes performance enhancements over the original ccImpute method described in the paper "ccImpute: an accurate and scalable consensus clustering based algorithm to impute dropout events in the single-cell RNA-seq data" (DOI: https://doi.org/10.1186/s12859-022-04814-8).
Defines the generic function 'ccImpute' and a specific method for 'SingleCellExperiment' objects.
ccImpute.SingleCellExperiment( object, dist, nCeil = 2000, svdMaxRatio = 0.08, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, fastSolver = TRUE, BPPARAM = bpparam(), verbose = TRUE ) ccImpute( object, dist, nCeil = 2000, svdMaxRatio = 0.08, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, fastSolver = TRUE, BPPARAM = bpparam(), verbose = TRUE ) ## S4 method for signature 'SingleCellExperiment' ccImpute( object, dist, nCeil = 2000, svdMaxRatio = 0.08, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, fastSolver = TRUE, BPPARAM = bpparam(), verbose = TRUE )
ccImpute.SingleCellExperiment( object, dist, nCeil = 2000, svdMaxRatio = 0.08, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, fastSolver = TRUE, BPPARAM = bpparam(), verbose = TRUE ) ccImpute( object, dist, nCeil = 2000, svdMaxRatio = 0.08, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, fastSolver = TRUE, BPPARAM = bpparam(), verbose = TRUE ) ## S4 method for signature 'SingleCellExperiment' ccImpute( object, dist, nCeil = 2000, svdMaxRatio = 0.08, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, fastSolver = TRUE, BPPARAM = bpparam(), verbose = TRUE )
object |
A |
dist |
(Optional) A distance matrix used for cell similarity. calculations. If not provided, a weighted Spearman correlation matrix is calculated. |
nCeil |
(Optional) The maximum number of cells used to compute the
proportion of singular vectors (default: |
svdMaxRatio |
(Optional) The maximum proportion of singular vectors
used for generating subsets (default: |
maxSets |
(Optional) The maximum number of sub-datasets used for
consensus clustering (default: |
k |
(Optional) The number of clusters (cell groups) in the data. If not provided, it is estimated using the Tracy-Widom Bound. |
consMin |
(Optional) The low-pass filter threshold for processing the
consensus matrix (default: |
kmNStart |
nstart parameter passed to |
kmMax |
iter.max parameter passed to |
fastSolver |
(Optional) Whether to use mean of
non-zero values for calculating dropout values or a linear equation solver
(much slower and did show empirical difference in imputation performance)
(default: |
BPPARAM |
(Optional) A |
verbose |
(Optional) Whether to print progress messages
(default: |
A SingleCellExperiment
class object with the imputed
expression values stored in the '"imputed"' assay.
library(BiocParallel) library(splatter) library(scater) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 BPPARAM = MulticoreParam(cores) sce <- ccImpute(sce, BPPARAM=BPPARAM)
library(BiocParallel) library(splatter) library(scater) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 BPPARAM = MulticoreParam(cores) sce <- ccImpute(sce, BPPARAM=BPPARAM)
Computes rankings for each column of a matrix in parallel.
colRanks_fast(x, n_cores)
colRanks_fast(x, n_cores)
x |
The input matrix to be ranked. |
n_cores |
The number of CPU cores to use for parallel processing. |
A matrix where each column contains the rankings for the corresponding column in the input matrix.
This function imputes dropout values (zeros) in a count matrix using either a fast numerical solver or a slower linear equations solver.
computeDropouts(consMtx, logX, dropIds, fastSolver = TRUE, nCores)
computeDropouts(consMtx, logX, dropIds, fastSolver = TRUE, nCores)
consMtx |
A numeric matrix representing the processed consensus matrix obtained from clustering analysis. |
logX |
A (sparse or dense) numeric matrix representing the transpose of a log-normalized gene expression matrix. Rows correspond to cells, and columns correspond to genes. |
dropIds |
A numeric vector containing the row/col indices of the dropouts to be imputed. |
fastSolver |
A logical value indicating whether to use the fast solver (default) or the slow solver. |
nCores |
An integer specifying the number of cores to use for parallel processing (if applicable). |
An imputed log-transformed count matrix (same dimensions as 'logX').
library(scater) library(BiocParallel) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores) BPPARAM = MulticoreParam(cores) consMtx <- runKM(logX, v, BPPARAM=bpparam()) dropIds <- findDropouts(logX, consMtx) impLogX <- computeDropouts(consMtx, logX, dropIds, nCores=cores)
library(scater) library(BiocParallel) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores) BPPARAM = MulticoreParam(cores) consMtx <- runKM(logX, v, BPPARAM=bpparam()) dropIds <- findDropouts(logX, consMtx) impLogX <- computeDropouts(consMtx, logX, dropIds, nCores=cores)
This function calculates a Pearson correlation matrix
cor_fast(x, n_cores)
cor_fast(x, n_cores)
x |
The input matrix, where each column represents a set of observations. |
n_cores |
The number of CPU cores to utilize for parallel computation (optional, defaults to 1). |
A Pearson correlation matrix if 'useRanks' is 'false'. If 'useRanks' is 'true', returns a Spearman correlation matrix.
Computes a truncated SVD on a matrix using the implicitly restarted Lanczos bidiagonalization algorithm (IRLBA).
doSVD(x, svdMaxRatio = 0.08, nCeil = 2000, nCores)
doSVD(x, svdMaxRatio = 0.08, nCeil = 2000, nCores)
x |
A numeric matrix to perform SVD on. |
svdMaxRatio |
(Optional) The maximum proportion of singular vectors
used for generating subsets (default: |
nCeil |
(Optional) The maximum number of cells used to compute the
proportion of singular vectors (default: |
nCores |
The number of cores to use for parallel processing. |
This function utilizes the 'irlba' function from the 'irlba' package to efficiently calculate the truncated SVD of the input matrix 'x'. The returned matrix contains 'nv' right singular vectors, which are often used for dimensionality reduction and feature extraction in various applications.
A matrix containing the right singular vectors of 'x'.
library(scater) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores)
library(scater) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores)
This function estimates the number of clusters (k) in a dataset using the Tracy-Widom distribution as a bound for the eigenvalues of the scaled data covariance matrix.
estkTW(x)
estkTW(x)
x |
A numeric matrix or data frame where rows are observations and columns are variables. |
The estimated number of clusters (k).
library(scater) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) logX <- as.matrix(logcounts(sce)) k <- estkTW(logX)
library(scater) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) logX <- as.matrix(logcounts(sce)) k <- estkTW(logX)
Determines which zero values within a transposed, log-normalized expression matrix are likely dropout events. The identification is based on a weighted cell voting scheme, where weights are derived from a processed consensus matrix.
findDropouts(logX, consMtx)
findDropouts(logX, consMtx)
logX |
A (sparse or dense) numeric matrix representing the transpose of a log-normalized gene expression matrix. Rows correspond to cells, and columns correspond to genes. |
consMtx |
A numeric matrix representing the processed consensus matrix obtained from clustering analysis. |
A two-column matrix (or data frame) where each row indicates the location (row index, column index) of a potential dropout event in the input matrix 'logX'.
library(scater) library(BiocParallel) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores) BPPARAM = MulticoreParam(cores) consMtx <- runKM(logX, v, BPPARAM=bpparam()) dropIds <- findDropouts(logX, consMtx)
library(scater) library(BiocParallel) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores) BPPARAM = MulticoreParam(cores) consMtx <- runKM(logX, v, BPPARAM=bpparam()) dropIds <- findDropouts(logX, consMtx)
This function calculates an average consensus matrix from a set of clustering solutions. It filters out values below a specified minimum threshold ('consMin') and normalizes the remaining non-zero columns to sum to 1.
getConsMtx(dat, consMin, n_cores)
getConsMtx(dat, consMin, n_cores)
dat |
An integer matrix where each column represents a different clustering solution (cluster assignments for each data point). |
consMin |
The minimum consensus value to retain. Values below this threshold are set to zero. |
n_cores |
The number of cores to use for parallel processing. This can speed up the normalization step. |
A processed consensus matrix where each element (i, j) represents the proportion of times data points i and j were assigned to the same cluster with filtering and normalization applied.
Efficiently computes a column-wise correlation matrix for a given input matrix. Supports Pearson and Spearman correlations, with optional weighting for features.
getCorM(method, x, w, nCores)
getCorM(method, x, w, nCores)
method |
A character string specifying the correlation
metric to use. Currently supported options are:
- |
x |
A numeric matrix where each column represents a sample. |
w |
(Optional) A numeric vector of weights for each feature (row) in
|
nCores |
The number of cores to use for parallel processing. |
A correlation matrix of the same dimensions as the number of columns in 'x'. The values represent the pairwise correlations between samples (columns) based on the chosen method and optional weights.
library(scater) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores)
library(scater) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores)
Computes Means and Standard Deviations for Scaling
getScale(x, n_cores)
getScale(x, n_cores)
x |
A numeric matrix representing the gene expression data, where rows are genes and columns are samples. |
n_cores |
The number of cores to use for parallel processing. |
A list containing: * 'means': A numeric vector of column means. * 'sds': A numeric vector of column standard deviations.
Computes Row Variances Efficiently
rowVars_fast(x, n_cores)
rowVars_fast(x, n_cores)
x |
A numeric dense matrix for which to compute row variances. |
n_cores |
The number of cores to utilize for parallel processing. |
A numeric vector containing the variance for each row of the input matrix.
library(Matrix) rand_vals <- sample(0:10,1e4,replace=TRUE, p=c(0.99,rep(0.001,10))) x <- as.matrix(Matrix(rand_vals,ncol=5)) cores <- 2 vars_vector <- rowVars_fast(x, cores)
library(Matrix) rand_vals <- sample(0:10,1e4,replace=TRUE, p=c(0.99,rep(0.001,10))) x <- as.matrix(Matrix(rand_vals,ncol=5)) cores <- 2 vars_vector <- rowVars_fast(x, cores)
Executes k-means clustering on multiple subsets of data defined by singular value decomposition (SVD) components, and then aggregates the results into a consensus matrix.
runKM( logX, v, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, BPPARAM = bpparam() )
runKM( logX, v, maxSets = 8, k, consMin = 0.75, kmNStart, kmMax = 1000, BPPARAM = bpparam() )
logX |
A (sparse or dense) numeric matrix representing the transpose of a log-normalized gene expression matrix. Rows correspond to cells, and columns correspond to genes. |
v |
A matrix of right singular vectors obtained from SVD of a distance matrix derived from 'logX'. |
maxSets |
(Optional) The maximum number of sub-datasets used for
consensus clustering (default: |
k |
(Optional) The number of clusters (cell groups) in the data. If not provided, it is estimated using the Tracy-Widom Bound. |
consMin |
(Optional) The low-pass filter threshold for processing the
consensus matrix (default: |
kmNStart |
nstart parameter passed to |
kmMax |
iter.max parameter passed to |
BPPARAM |
(Optional) A |
A consensus matrix summarizing the clustering results across multiple sub-datasets.
library(scater) library(BiocParallel) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores) BPPARAM = MulticoreParam(cores) consMtx <- runKM(logX, v, BPPARAM=bpparam())
library(scater) library(BiocParallel) library(splatter) sce <- splatSimulate(group.prob = rep(1, 5)/5, sparsify = FALSE, batchCells=100, nGenes=1000, method = "groups", verbose = FALSE, dropout.type = "experiment") sce <- logNormCounts(sce) cores <- 2 logX <- as.matrix(logcounts(sce)) w <- rowVars_fast(logX, cores) corMat <- getCorM("spearman", logcounts(sce), w, cores) v <- doSVD(corMat, nCores=cores) BPPARAM = MulticoreParam(cores) consMtx <- runKM(logX, v, BPPARAM=bpparam())
Computes imputed expression matrix using linear eq solver
solver(cm, em, ids, n_cores)
solver(cm, em, ids, n_cores)
cm |
processed consensus matrix |
em |
expression matrix |
ids |
location of values determined to be dropout events |
n_cores |
number of cores to use for parallel computation. |
imputed expression matrix
Fast Calculation of "Dropout" values
solver2(cm, em, ids, n_cores)
solver2(cm, em, ids, n_cores)
cm |
A numeric matrix representing the consensus matrix. |
em |
A dense numeric matrix representing the gene expression data, where rows are genes and columns are samples. |
ids |
An integer matrix specifying the row and column indices of entries for which to calculate importance scores. Each row of 'ids' should contain two integers: the row index (gene) and column index (sample) in the 'em' matrix. |
n_cores |
The number of cores to use for parallel processing. |
A numeric vector of imputed dropout values, corresponding to the entries specified in the 'ids' matrix.
Computes rankings for each column of a matrix in parallel.
sparseColRanks_fast(x, n_cores)
sparseColRanks_fast(x, n_cores)
x |
The input matrix to be ranked. |
n_cores |
The number of CPU cores to use for parallel processing. |
A matrix where each column contains the rankings for the corresponding column in the input matrix.
Fast Calculation of "Dropout" values
sparseSolver2(cm, em, ids, n_cores)
sparseSolver2(cm, em, ids, n_cores)
cm |
A numeric matrix representing the consensus matrix. |
em |
A sparse numeric matrix representing the gene expression data, where rows are genes and columns are samples. |
ids |
An integer matrix specifying the row and column indices of entries for which to calculate importance scores. Each row of 'ids' should contain two integers: the row index (gene) and column index (sample) in the 'em' matrix. |
n_cores |
The number of cores to use for parallel processing. |
A numeric vector of imputed dropout values, corresponding to the entries specified in the 'ids' matrix.
This function calculates weighted Pearson correlation matrix
wCor_fast(x, w, n_cores)
wCor_fast(x, w, n_cores)
x |
The input matrix (dense), where each column represents a set of observations. |
w |
A vector of weights, one for each observation (must have the same number of elements as rows in 'x'). |
n_cores |
The number of CPU cores to utilize for parallel computation. |
A weighted Pearson correlation matrix