Title: | FEAture SelcTion (FEAST) for Single-cell clustering |
---|---|
Description: | Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as “features”), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have significant impact on the clustering accuracy. FEAST is an R library for selecting most representative features before performing the core of scRNA-seq clustering. It can be used as a plug-in for the etablished clustering algorithms such as SC3, TSCAN, SHARP, SIMLR, and Seurat. The core of FEAST algorithm includes three steps: 1. consensus clustering; 2. gene-level significance inference; 3. validation of an optimized feature set. |
Authors: | Kenong Su [aut, cre], Hao Wu [aut] |
Maintainer: | Kenong Su <[email protected]> |
License: | GPL-2 |
Version: | 1.15.0 |
Built: | 2025-01-05 06:07:08 UTC |
Source: | https://github.com/bioc/FEAST |
Align the cell types from the prediction with the truth.
align_CellType(tt0)
align_CellType(tt0)
tt0 |
a N*N table. |
the matched (re-ordered) table
vec1 = rep(1:4, each=100) vec2 = sample(vec1) tb = table(vec1, vec2) #tb_arg = align_CellType(tb)
vec1 = rep(1:4, each=100) vec2 = sample(vec1) tb = table(vec1, vec2) #tb_arg = align_CellType(tb)
Calculate the gene-level F score and corresponding significance level.
cal_F2(Y, classes)
cal_F2(Y, classes)
Y |
A gene expression matrix |
classes |
The initial cluster labels NA values are allowed. This can directly from the |
The score vector
data(Yan) cal_F2(Y, classes = trueclass)
data(Yan) cal_F2(Y, classes = trueclass)
Calculate the gene-level fisher score.
cal_Fisher2(Y, classes)
cal_Fisher2(Y, classes)
Y |
A gene expression matrix |
classes |
The initial cluster labels NA values are allowed. This can directly from the |
The score vector This is from the paper https://arxiv.org/pdf/1202.3725.pdf Vector based calculation
Calculate 3 metrics and these methods are exported in C codes. flag = 1 — Rand index, flag = 2 — Fowlkes and Mallows's index, flag = 3 — Jaccard index
cal_metrics(cl1, cl2, randMethod = c("Rand", "FM", "Jaccard"))
cal_metrics(cl1, cl2, randMethod = c("Rand", "FM", "Jaccard"))
cl1 |
a vector |
cl2 |
a vector |
randMethod |
a string chosen from "Rand", "FM", or "Jaccard" |
a numeric vector including three values
Standard way to preprocess the count matrix. It is the QC step for the genes.
cal_MSE(Ynorm, cluster, return_mses = FALSE)
cal_MSE(Ynorm, cluster, return_mses = FALSE)
Ynorm |
A normalized gene expression matrix. If not, we will normalize it for you. |
cluster |
The clustering outcomes. Specifically, they are cluster labels. |
return_mses |
True or False indicating whether returning the MSE. |
The MSE of the clustering centers with the predicted Y.
data(Yan) Ynorm = Norm_Y(Y) cluster = trueclass MSE_res = cal_MSE(Ynorm, cluster)
data(Yan) Ynorm = Norm_Y(Y) cluster = trueclass MSE_res = cal_MSE(Ynorm, cluster)
Consensus Clustering
Consensus(Y, num_pcs = 10, top_pctg = 0.33, k = 2, thred = 0.9, nProc = 1)
Consensus(Y, num_pcs = 10, top_pctg = 0.33, k = 2, thred = 0.9, nProc = 1)
Y |
A expression matrix. It is recommended to use the raw count matrix. Users can input normalized matrix directly. |
num_pcs |
The number of top pcs that will be investigated on through consensus clustering. |
top_pctg |
Top percentage of features for dimension reduction |
k |
The number of input clusters (best guess). |
thred |
For the final GMM clustering, the probability of a cell belonging to a certain cluster. |
nProc |
number of cores for BiocParallel enviroment. |
the clustering labels and the featured genes.
data(Yan) set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] con = Consensus(Y, k=5)
data(Yan) set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] con = Consensus(Y, k=5)
Calculate the a series of the evaluation statistics.
eval_Cluster(vec1, vec2)
eval_Cluster(vec1, vec2)
vec1 |
a vector. |
vec2 |
a vector. x and y are with the same length. |
a vector of evaluation metrics
vec2 = vec1 = rep(1:4, each = 100) vec2[1:10] = 4 acc = eval_Cluster(vec1, vec2)
vec2 = vec1 = rep(1:4, each = 100) vec2[1:10] = 4 acc = eval_Cluster(vec1, vec2)
FEAST main function
FEAST( Y, k = 2, num_pcs = 10, dim_reduce = c("irlba", "svd", "pca"), split = FALSE, batch_size = 1000, nProc = 1 )
FEAST( Y, k = 2, num_pcs = 10, dim_reduce = c("irlba", "svd", "pca"), split = FALSE, batch_size = 1000, nProc = 1 )
Y |
A expression matrix. Raw count matrix or normalized matrix. |
k |
The number of input clusters (best guess). |
num_pcs |
The number of top pcs that will be investigated through the consensus clustering. |
dim_reduce |
dimension reduction methods chosen from pca, svd, or irlba. |
split |
boolean. If T, using subsampling to calculate the gene-level significance. |
batch_size |
when split is true, need to claim the batch size for spliting the cells. |
nProc |
number of cores for BiocParallel enviroment. |
the rankings of the gene-significance.
data(Yan) k = length(unique(trueclass)) set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] ixs = FEAST(Y, k=k)
data(Yan) k = length(unique(trueclass)) set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] ixs = FEAST(Y, k=k)
FEAST main function (fast version)
FEAST_fast(Y, k = 2, num_pcs = 10, split = FALSE, batch_size = 1000, nProc = 1)
FEAST_fast(Y, k = 2, num_pcs = 10, split = FALSE, batch_size = 1000, nProc = 1)
Y |
A expression matrix. Raw count matrix or normalized matrix. |
k |
The number of input clusters (best guess). |
num_pcs |
The number of top pcs that will be investigated through the consensus clustering. |
split |
boolean. If T, using subsampling to calculate the gene-level significance. |
batch_size |
when split is true, need to claim the batch size for spliting the cells. |
nProc |
number of cores for BiocParallel enviroment. |
the rankings of the gene-significance.
data(Yan) k = length(unique(trueclass)) res = FEAST_fast(Y, k=k)
data(Yan) k = length(unique(trueclass)) res = FEAST_fast(Y, k=k)
Normalize the count expression matrix by the size factor and take the log transformation.
Norm_Y(Y)
Norm_Y(Y)
Y |
a count expression matrix |
a normalized matrix
data(Yan) Ynorm = Norm_Y(Y)
data(Yan) Ynorm = Norm_Y(Y)
Standard way to preprocess the count matrix. It is the QC step for the genes.
process_Y(Y, thre = 2)
process_Y(Y, thre = 2)
Y |
A gene expression data (Raw count matrix) |
thre |
The threshold of minimum number of cells expressing a certain gene (default =2) |
A processed gene expression matrix. It is not log transformed
data(Yan) YY = process_Y(Y, thre=2)
data(Yan) YY = process_Y(Y, thre=2)
Calculate the purity between two vectors.
Purity(x, y)
Purity(x, y)
x |
a vector. |
y |
a vector. x and y are with the same length. |
the purity score
SC3 Clustering
SC3_Clust(Y, k = NULL, input_markers = NULL)
SC3_Clust(Y, k = NULL, input_markers = NULL)
Y |
A expression matrix. It is recommended to use the raw count matrix. |
k |
The number of clusters. If it is not provided, k is estimated by the default method in SC3. |
input_markers |
A character vector including the featured genes. If they are not presented, SC3 will take care of this. |
the clustering labels and the featured genes.
Using clustering results based on feature selection to perform model selection.
Select_Model_short_SC3(Y, cluster, tops = c(500, 1000, 2000))
Select_Model_short_SC3(Y, cluster, tops = c(500, 1000, 2000))
Y |
A gene expression matrix |
cluster |
The initial cluster labels NA values are allowed. This can directly from the |
tops |
A numeric vector containing a list of numbers corresponding to top genes; e.g., tops = c(500, 1000, 2000). |
mse and the SC3 clustering result.
data(Yan) k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) # preprocess the data set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] con_res = Consensus(Y, k=k) # not run # mod_res = Select_Model_short_SC3(Y, cluster = con_res$cluster, top = c(100, 200))
data(Yan) k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) # preprocess the data set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] con_res = Consensus(Y, k=k) # not run # mod_res = Select_Model_short_SC3(Y, cluster = con_res$cluster, top = c(100, 200))
Using clustering results (from TSCAN) based on feature selection to perform model selection.
Select_Model_short_TSCAN( Y, cluster, minexpr_percent = 0.5, cvcutoff = 1, tops = c(500, 1000, 2000) )
Select_Model_short_TSCAN( Y, cluster, minexpr_percent = 0.5, cvcutoff = 1, tops = c(500, 1000, 2000) )
Y |
A gene expression matrix |
cluster |
The initial cluster labels NA values are allowed. This can directly from the |
minexpr_percent |
The threshold used for processing data in TSCAN. Using it by default. |
cvcutoff |
The threshold used for processing data in TSCAN. Using it by default. |
tops |
A numeric vector containing a list of numbers corresponding to top genes; e.g., tops = c(500, 1000, 2000). |
mse and the TSCAN clustering result.
data(Yan) k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) # preprocess the data set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] con_res = Consensus(Y, k=k) # not run # mod_res = Select_Model_short_TSCAN(Y, cluster = con_res$cluster, top = c(100, 200))
data(Yan) k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) # preprocess the data set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, cixs] con_res = Consensus(Y, k=k) # not run # mod_res = Select_Model_short_TSCAN(Y, cluster = con_res$cluster, top = c(100, 200))
The true cell type labels for Yan dataset. It includes 8 different cell types.
data("Yan")
data("Yan")
A character vector contains the cell type label
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36552
Yan, Liying, et al. "Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells." Nature structural & molecular biology 20.9 (2013): 1131.
data("Yan") table(trueclass)
data("Yan") table(trueclass)
TSCAN Clustering
TSCAN_Clust(Y, k, minexpr_percent = 0.5, cvcutoff = 1, input_markers = NULL)
TSCAN_Clust(Y, k, minexpr_percent = 0.5, cvcutoff = 1, input_markers = NULL)
Y |
A expression matrix. It is recommended to use the raw count matrix. |
k |
The number of clusters. If it is not provided, k is estimated by the default method in SC3. |
minexpr_percent |
minimum expression threshold (default = 0.5). |
cvcutoff |
the cv cutoff to filter the genes (default = 1). |
input_markers |
A character vector including the featured genes. If they are not presented, SC3 will take care of this. |
the clustering labels and the featured genes.
data(Yan) k = length(unique(trueclass)) # TSCAN_res = TSCAN_Clust(Y, k=k)
data(Yan) k = length(unique(trueclass)) # TSCAN_res = TSCAN_Clust(Y, k=k)
function for convert a vector to a binary matrix
vector2matrix(vec)
vector2matrix(vec)
vec |
a vector. |
a n by n binary matrix indicating the adjacency.
Using clustering results based on feature selection to perform model selection.
Visual_Rslt(model_cv_res, trueclass)
Visual_Rslt(model_cv_res, trueclass)
model_cv_res |
model selection result from |
trueclass |
The real class labels |
a list of mse dataframe, clustering accuracy dataframe, and ggplot object.
data(Yan) k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) # preprocess the data set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, ] con_res = Consensus(Y, k=k) # Not run # mod_res = Select_Model_short_SC3(Y, cluster = con_res$cluster, top = c(100, 200)) library(ggpubr) # Visual_Rslt(model_cv_res = mod_res, trueclass = trueclass)
data(Yan) k = length(unique(trueclass)) Y = process_Y(Y, thre = 2) # preprocess the data set.seed(123) rixs = sample(nrow(Y), 500) cixs = sample(ncol(Y), 40) Y = Y[rixs, ] con_res = Consensus(Y, k=k) # Not run # mod_res = Select_Model_short_SC3(Y, cluster = con_res$cluster, top = c(100, 200)) library(ggpubr) # Visual_Rslt(model_cv_res = mod_res, trueclass = trueclass)
Y is a count expression matrix which belongs to "matrix" class. The data includes 124 cells about human preimplantation embryos and embryonic stem cells. It contains 19304 genes after removing genes with extreme high dropout rate.
data("Yan")
data("Yan")
An object of "matrix" class contains the count expressions
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36552
Yan, Liying, et al. "Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells." Nature structural & molecular biology 20.9 (2013): 1131.
data("Yan") Y[1:10, 1:4]
data("Yan") Y[1:10, 1:4]