Title: | Single-cell Interpretation via Multi-kernel LeaRning (SIMLR) |
---|---|
Description: | Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical for the identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. We develop a novel similarity-learning framework, SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization. |
Authors: | Daniele Ramazzotti [aut] , Bo Wang [aut], Luca De Sano [cre, aut] , Serafim Batzoglou [ctb] |
Maintainer: | Luca De Sano <[email protected]> |
License: | file LICENSE |
Version: | 1.33.0 |
Built: | 2024-11-20 06:22:06 UTC |
Source: | https://github.com/bioc/SIMLR |
example dataset to test SIMLR from the work by Buettner, Florian, et al.
data(BuettnerFlorian)
data(BuettnerFlorian)
gene expression measurements of individual cells
list of 6: in_X = input dataset as an (m x n) gene expression measurements of individual cells, n_clust = number of clusters (number of distinct true labels), true_labs = ground true of cluster assignments for each of the n_clust clusters, seed = seed used to compute the results for the example, results = result by SIMLR for the inputs defined as described, nmi = normalized mutual information as a measure of the inferred clusters compared to the true labels
Buettner, Florian, et al. "Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells." Nature biotechnology 33.2 (2015): 155-160.
perform the SIMLR clustering algorithm
SIMLR( X, c, no.dim = NA, k = 10, if.impute = FALSE, normalize = FALSE, cores.ratio = 1 )
SIMLR( X, c, no.dim = NA, k = 10, if.impute = FALSE, normalize = FALSE, cores.ratio = 1 )
X |
an (m x n) data matrix of gene expression measurements of individual cells or and object of class SCESet |
c |
number of clusters to be estimated over X |
no.dim |
number of dimensions |
k |
tuning parameter |
if.impute |
should I traspose the input data? |
normalize |
should I normalize the input data? |
cores.ratio |
ratio of the number of cores to be used when computing the multi-kernel |
clusters the cells based on SIMLR and their similarities
list of 8 elements describing the clusters obtained by SIMLR, of which y are the resulting clusters: y = results of k-means clusterings, S = similarities computed by SIMLR, F = results from network diffiusion, ydata = data referring the the results by k-means, alphaK = clustering coefficients, execution.time = execution time of the present run, converge = iterative convergence values by T-SNE, LF = parameters of the clustering
data(BuettnerFlorian) SIMLR(X = BuettnerFlorian$in_X, c = BuettnerFlorian$n_clust, cores.ratio = 0)
data(BuettnerFlorian) SIMLR(X = BuettnerFlorian$in_X, c = BuettnerFlorian$n_clust, cores.ratio = 0)
estimate the number of clusters by means of two huristics as discussed in the SIMLR paper
SIMLR_Estimate_Number_of_Clusters(X, NUMC = 2:5, cores.ratio = 1)
SIMLR_Estimate_Number_of_Clusters(X, NUMC = 2:5, cores.ratio = 1)
X |
an (m x n) data matrix of gene expression measurements of individual cells |
NUMC |
vector of number of clusters to be considered |
cores.ratio |
ratio of the number of cores to be used when computing the multi-kernel |
a list of 2 elements: K1 and K2 with an estimation of the best clusters (the lower values the better) as discussed in the original paper of SIMLR
data(BuettnerFlorian) SIMLR_Estimate_Number_of_Clusters(BuettnerFlorian$in_X, NUMC = 2:5, cores.ratio = 0)
data(BuettnerFlorian) SIMLR_Estimate_Number_of_Clusters(BuettnerFlorian$in_X, NUMC = 2:5, cores.ratio = 0)
perform the SIMLR feature ranking algorithm. This takes as input the original input data and the corresponding similarity matrix computed by SIMLR
SIMLR_Feature_Ranking(A, X)
SIMLR_Feature_Ranking(A, X)
A |
an (n x n) similarity matrix by SIMLR |
X |
an (m x n) data matrix of gene expression measurements of individual cells |
a list of 2 elements: pvalues and ranking ordering over the n covariates as estimated by the method
data(BuettnerFlorian) SIMLR_Feature_Ranking(A = BuettnerFlorian$results$S, X = BuettnerFlorian$in_X)
data(BuettnerFlorian) SIMLR_Feature_Ranking(A = BuettnerFlorian$results$S, X = BuettnerFlorian$in_X)
perform the SIMLR clustering algorithm for large scale datasets
SIMLR_Large_Scale(X, c, k = 10, kk = 100, if.impute = FALSE, normalize = FALSE)
SIMLR_Large_Scale(X, c, k = 10, kk = 100, if.impute = FALSE, normalize = FALSE)
X |
an (m x n) data matrix of gene expression measurements of individual cells or and object of class SCESet |
c |
number of clusters to be estimated over X |
k |
tuning parameter |
kk |
number of principal components to be assessed in the PCA |
if.impute |
should I traspose the input data? |
normalize |
should I normalize the input data? |
clusters the cells based on SIMLR Large Scale and their similarities
list of 8 elements describing the clusters obtained by SIMLR, of which y are the resulting clusters: y = results of k-means clusterings, S0 = similarities computed by SIMLR, F = results from the large scale iterative procedure, ydata = data referring the the results by k-means, alphaK = clustering coefficients, val = distances from the k-nearest neighbour search, ind = indeces from the k-nearest neighbour search, execution.time = execution time of the present run
data(ZeiselAmit) resized = ZeiselAmit$in_X[, 1:340] SIMLR_Large_Scale(X = resized, c = ZeiselAmit$n_clust, k = 5, kk = 5)
data(ZeiselAmit) resized = ZeiselAmit$in_X[, 1:340] SIMLR_Large_Scale(X = resized, c = ZeiselAmit$n_clust, k = 5, kk = 5)
example dataset to test SIMLR large scale. This is a reduced version of the dataset from the work by Zeisel, Amit, et al.
data(ZeiselAmit)
data(ZeiselAmit)
gene expression measurements of individual cells
list of 6: in_X = input dataset as an (m x n) gene expression measurements of individual cells, n_clust = number of clusters (number of distinct true labels), true_labs = ground true of cluster assignments for each of the n_clust clusters, seed = seed used to compute the results for the example, results = result by SIMLR for the inputs defined as described, nmi = normalized mutual information as a measure of the inferred clusters compared to the true labels
Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142.