Title: | Rank Constrained Similarity Learning for single cell RNA sequencing data |
---|---|
Description: | A novel clustering algorithm and toolkit RCSL (Rank Constrained Similarity Learning) to accurately identify various cell types using scRNA-seq data from a complex tissue. RCSL considers both lo-cal similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types. RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similar-ity, and adaptively learns neighbour representation of a cell as its local similarity. The overall similar-ity of a cell to other cells is a linear combination of its global similarity and local similarity. |
Authors: | Qinglin Mei [cre, aut], Guojun Li [fnd], Zhengchang Su [fnd] |
Maintainer: | Qinglin Mei <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.15.0 |
Built: | 2024-10-31 04:23:13 UTC |
Source: | https://github.com/bioc/RCSL |
Cell type annotations of 'yan' datasets by Yan et al.
ann
ann
An object of class data.frame
with 90 rows and 1 columns.
http://dx.doi.org/10.1038/nsmb.2660
Each row corresponds to one cell of 'yan' dataset
Calculate the bolock-diagnal matrix B min_B>=0, B*1=1, F'*F=I ||B - A||_1 + r*||B||^2 + 2*lambda*trace(F'*L*F)
BDSM(S, C)
BDSM(S, C)
S |
the calculated initial similarity matrix S |
C |
the estimated number of clusters C |
B block-diagonal matrix
y clustering results
gfData <- GenesFilter(yan) res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) BDSM(res_SimS$S,C)
gfData <- GenesFilter(yan) res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) BDSM(res_SimS$S,C)
Solve the problem: min 1/2*x'*L*x-x'*d s.t. x>=0, 1'x=1
EProjSimplexdiag(d, l)
EProjSimplexdiag(d, l)
d |
matrix or vector |
l |
matrix or vector |
x
Estimate the optimal number of clusters C for clustering
EstClusters(drData, S)
EstClusters(drData, S)
drData |
gene expression matrix after PCA processing |
S |
the calculated similarity matrix S from "SimS" |
C the estimated number of clusters
gfData <- GenesFilter(yan) res_SimS <- SimS(gfData) EstClusters(res_SimS$drData,res_SimS$S)
gfData <- GenesFilter(yan) res_SimS <- SimS(gfData) EstClusters(res_SimS$drData,res_SimS$S)
Solve the problem: ||A-B||^2 = ||A||^2 + ||B||^2 - 2*A'*B
EucDist(A, B)
EucDist(A, B)
A |
matrix or vector |
B |
matrix or vector |
d matrix or vector
Perform the step of gene filtering to normalizaed gene expression data
GenesFilter(data, gfRatio = 0.025)
GenesFilter(data, gfRatio = 0.025)
data |
the normalized gene expression matrix |
gfRatio |
the ratio of genes filtering |
the gene expression matrix after genes filtering gfData
data(yan) GenesFilter(yan)
data(yan) GenesFilter(yan)
Infer the development lineage based on the clustering results from RCSL and the pseudotime
getLineage(drData, clustRes, pseudoTime, simMeasure = "kendall")
getLineage(drData, clustRes, pseudoTime, simMeasure = "kendall")
drData |
preprocessed gene expression data (each column represent a cell) |
clustRes |
the clustering results identified by RCSL |
pseudoTime |
inferred by PlotPseudoTime() using the similarity matrix S and starting cell |
simMeasure |
the calculation method of measuring the cluster centers' similarity |
lineage the cell lineages connected all the cluster centers based on the clustering results from RCSL
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) Pseudo <- PlotPseudoTime(res_SimS$S,TrueLabel,startPoint=1) getLineage(res_SimS$drData,res_BDSM$y,Pseudo$pseudoTime)
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) Pseudo <- PlotPseudoTime(res_SimS$S,TrueLabel,startPoint=1) getLineage(res_SimS$drData,res_BDSM$y,Pseudo$pseudoTime)
Calculate the neighbor representation of cells to the low-dimensional gene expression matrix
NeigRepresent( drData, NN.method = "KNN", Dis.method = "Euclidean", LSH.TreeNum = 30, LSH.Dim = 500, LSH.Dis = "angular", neiRatio = 0.65 )
NeigRepresent( drData, NN.method = "KNN", Dis.method = "Euclidean", LSH.TreeNum = 30, LSH.Dim = 500, LSH.Dis = "angular", neiRatio = 0.65 )
drData |
gene expression matrix after dimensionality reduced by PCA |
NN.method |
the method of finding neighbors |
Dis.method |
the distance metric in finding neighbors |
LSH.TreeNum |
the tree number of LSH |
LSH.Dim |
the dimension in LSH |
LSH.Dis |
the distance metric in LSH |
neiRatio |
ratio of the number of selected |
the similarity matrix measured by neighbor representation NR
gfData <- GenesFilter(yan) res_SimS <- SimS(gfData) NeigRepresent(res_SimS$drData)
gfData <- GenesFilter(yan) res_SimS <- SimS(gfData) NeigRepresent(res_SimS$drData)
Plot the visualization of constructed Minimum Spanning Tree based on the clustering results of RCSL
PlotMST( drData, clustRes, TrueLabel, dataName = "", fontSize = 12, VisualMethod = "umap" )
PlotMST( drData, clustRes, TrueLabel, dataName = "", fontSize = 12, VisualMethod = "umap" )
drData |
preprocessed gene expression data |
clustRes |
the clustering results identified by RCSL |
TrueLabel |
the real cell types to color the dots in plot |
dataName |
the name of the data that will be showed in the plot |
fontSize |
the font size of the plot |
VisualMethod |
the method for 2D visualization including UMAP,t-SNE and PCA |
MSTPlot ggplot object of the visualization of constructed MST
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) PlotMST(res_SimS$drData,res_BDSM$y,TrueLabel)
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) PlotMST(res_SimS$drData,res_BDSM$y,TrueLabel)
Infer the pseudo-temporal ordering between the cell types using the distance from a cell type to the predefined starting cell type.
PlotPseudoTime( S, TrueLabel, startPoint, fontSize = 12, dataName = "", sim = TRUE )
PlotPseudoTime( S, TrueLabel, startPoint, fontSize = 12, dataName = "", sim = TRUE )
S |
the similarity matrix calculated by SimS() function |
TrueLabel |
the real cell types used to indicate the vertical axis |
startPoint |
the posiition of the starting cell in the matrix |
fontSize |
the font size of the plot |
dataName |
the name of the data that will be showed in the plot |
sim |
indicate the input data is simialrity matrix or not |
PstudoTime
PseudoTimePlot ggplot object of the pseudo-temporal ordering of cells
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) PlotPseudoTime(res_SimS$S,TrueLabel,startPoint=1)
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) PlotPseudoTime(res_SimS$S,TrueLabel,startPoint=1)
Infer the developmental trajectories based on the clustering results from RCSL
PlotTrajectory( gfData, clustRes, TrueLabel, lineage, fontSize = 12, dataName = "", VisualMethod = "umap" )
PlotTrajectory( gfData, clustRes, TrueLabel, lineage, fontSize = 12, dataName = "", VisualMethod = "umap" )
gfData |
preprocessed gene expression data (each column represent a cell) |
clustRes |
the clustering results identified by RCSL |
TrueLabel |
the real cell types |
lineage |
the lineage obtained by getLineage() |
fontSize |
the size of font in the plot |
dataName |
the name of the data that will be showed in the plot |
VisualMethod |
the display method of 2-D visualization |
TrajectoryPlot ggplot object of the inferred developmental trajectories
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) Pseudo <- PlotPseudoTime(res_SimS$S,TrueLabel,startPoint=1) Linea <- getLineage(res_SimS$drData,res_BDSM$y,Pseudo$pseudoTime) PlotTrajectory(gfData,res_BDSM$y,TrueLabel,lineage=Linea)
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) Pseudo <- PlotPseudoTime(res_SimS$S,TrueLabel,startPoint=1) Linea <- getLineage(res_SimS$drData,res_BDSM$y,Pseudo$pseudoTime) PlotTrajectory(gfData,res_BDSM$y,TrueLabel,lineage=Linea)
Perform the RCSL program
RCSL( data, GF = TRUE, gfRatio = 0.025, pcRatio = 0.95, NN.method = "KNN", Dis.method = "Euclidean", neiRatio = 0.65 )
RCSL( data, GF = TRUE, gfRatio = 0.025, pcRatio = 0.95, NN.method = "KNN", Dis.method = "Euclidean", neiRatio = 0.65 )
data |
normalizaed gene expression matrix(each column represents a cell) |
GF |
should I need the gene filter step? |
gfRatio |
the ratio of the gene filter |
pcRatio |
the ratio between the variance of the |
NN.method |
the method of finding neighbors |
Dis.method |
the distance metric in finding neighbors |
neiRatio |
ratio of the number of selected |
gfData gene expression matrix after genes filtering
B block-diagonal matrix
C estimated number of clusters
y clustering results
data(yan) data <- log2(yan+1) RCSL(yan[,1:20])
data(yan) data <- log2(yan+1) RCSL(yan[,1:20])
Calculate the initial similarity matrix
SimS( data, pcRatio = 0.95, gamma = 0.8, NN.method = "KNN", Dis.method = "Euclidean", LSH.TreeNum = 30, LSH.Dim = 1000, LSH.Dis = "angular", neiRatio = 0.65 )
SimS( data, pcRatio = 0.95, gamma = 0.8, NN.method = "KNN", Dis.method = "Euclidean", LSH.TreeNum = 30, LSH.Dim = 1000, LSH.Dis = "angular", neiRatio = 0.65 )
data |
gene expression matrix after genes filtering |
pcRatio |
the ratio between the variance of the |
gamma |
the ratio of the global simialrity |
NN.method |
the method of finding neighbors |
Dis.method |
the distance metric in finding neighbors |
LSH.TreeNum |
the tree number of LSH |
LSH.Dim |
the dimension in LSH |
LSH.Dis |
the distance metric in LSH |
neiRatio |
ratio of the number of selected |
initial similarity matrix S
gene expression matrix after PCA processing drData
gfData <- GenesFilter(yan) SimS(gfData)
gfData <- GenesFilter(yan) SimS(gfData)
Trajectory analysis
TrajectoryAnalysis( gfData, drData, S, clustRes, fontSize = 12, TrueLabel, startPoint, dataName = "", sim = TRUE, simMeasure = "kendall", VisualMethod = "umap" )
TrajectoryAnalysis( gfData, drData, S, clustRes, fontSize = 12, TrueLabel, startPoint, dataName = "", sim = TRUE, simMeasure = "kendall", VisualMethod = "umap" )
gfData |
preprocessed gene expression data (each column represent a cell) |
drData |
preprocessed gene expression data (each column represent a cell) |
S |
the similarity matrix calculated by SimS() function |
clustRes |
the clustering results identified by RCSL |
fontSize |
the size of font in the plot |
TrueLabel |
the real cell types used to indicate the vertical axis |
startPoint |
the posiition of the starting cell in the matrix |
dataName |
the name of the data that will be showed in the plot |
sim |
indicate the input data is simialrity matrix or not |
simMeasure |
the calculation method of measuring the cluster centers' similarity |
VisualMethod |
the display method of 2-D visualization |
PseudoTimePlot, MSTPlot, TrajectoryPlot
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) TrajectoryAnalysis(gfData,res_SimS$drData,res_SimS$S,res_BDSM$y, TrueLabel=TrueLabel,startPoint=1)
gfData <- GenesFilter(yan) TrueLabel <- ann$cell_type1 res_SimS <- SimS(gfData) C <- EstClusters(res_SimS$drData,res_SimS$S) res_BDSM <- BDSM(res_SimS$S,C) TrajectoryAnalysis(gfData,res_SimS$drData,res_SimS$S,res_BDSM$y, TrueLabel=TrueLabel,startPoint=1)
A public scRNA-seq dataset by Yan et al.
yan
yan
An object of class data.frame
with 20214 rows and 90 columns.
http://dx.doi.org/10.1038/nsmb.2660
Columns represent cells, rows represent genes expression values.