A common application of single-cell RNA sequencing (RNA-seq) data is
to identify discrete cell types. To take advantage of the large
collection of well-annotated scRNA-seq datasets, scClassify
package implements a set of methods to perform accurate cell type
classification based on ensemble learning and sample size
calculation. This vignette demonstrates the usage of
scClassify
, providing a pithy description of each method
with workable examples.
First, install scClassify
via
BiocManager
.
We assume that you have log-transformed (size-factor normalized) matrices where each row is a gene and each column a cell for a reference dataset and a query dataset. For demonstration purposes, we will take a subset of single-cell pancreas datasets from two independent studies (Wang et al., and Xin et al.).
library("scClassify")
data("scClassify_example")
xin_cellTypes <- scClassify_example$xin_cellTypes
exprsMat_xin_subset <- scClassify_example$exprsMat_xin_subset
wang_cellTypes <- scClassify_example$wang_cellTypes
exprsMat_wang_subset <- scClassify_example$exprsMat_wang_subset
exprsMat_xin_subset <- as(exprsMat_xin_subset, "dgCMatrix")
exprsMat_wang_subset <- as(exprsMat_wang_subset, "dgCMatrix")
The original cell type annotations and compositions of the example datasets can be easily accessed as shown below.
table(xin_cellTypes)
#> xin_cellTypes
#> alpha beta delta gamma
#> 285 261 49 79
table(wang_cellTypes)
#> wang_cellTypes
#> acinar alpha beta delta ductal gamma stellate
#> 5 206 118 10 96 21 45
We can see that Xin et al. data only have 4 cell types, while Wang et al. has 7 cell types.
We first perform non-ensemble scClassify
by using Xin et
al. as our reference dataset and Wang et al. data as ur query data. We
use WKNN
as the KNN algorithm, DE
(differential expression genes) as the gene selection method, and lastly
pearson
as the similarity calculation method.
scClassify_res <- scClassify(exprsMat_train = exprsMat_xin_subset,
cellTypes_train = xin_cellTypes,
exprsMat_test = list(wang = exprsMat_wang_subset),
cellTypes_test = list(wang = wang_cellTypes),
tree = "HOPACH",
algorithm = "WKNN",
selectFeatures = c("limma"),
similarity = c("pearson"),
returnList = FALSE,
verbose = FALSE)
We can check the cell type tree generated by the reference data:
scClassify_res$trainRes
#> Class: scClassifyTrainModel
#> Model name: training
#> Feature selection methods: limma
#> Number of cells in the training data: 674
#> Number of cell types in the training data: 4
plotCellTypeTree(cellTypeTree(scClassify_res$trainRes))
Noted that scClassify_res$trainRes
is a
scClassifyTrainModel
class.
Check the prediction results.
table(scClassify_res$testRes$wang$pearson_WKNN_limma$predRes, wang_cellTypes)
#> wang_cellTypes
#> acinar alpha beta delta ductal gamma stellate
#> alpha 0 206 0 0 0 2 0
#> beta 0 0 118 0 1 0 0
#> beta_delta_gamma 0 0 0 0 25 0 0
#> delta 0 0 0 10 0 0 0
#> gamma 0 0 0 0 0 19 0
#> unassigned 5 0 0 0 70 0 45
We next perform ensemble scClassify
by using Xin et al.
as our reference dataset and Wang et al. data as our query data. We use
WKNN
as the KNN algorithm, DE
as the gene
selection method, and pearson
and spearman
as
the similarity calculation methods. Thus, we will generate two
combinations of gene selection models and similarity metrics as training
classifiers:
WKNN
+ DE
+ pearson
WKNN
+ DE
+ spearman
Here, we will weight these two classifiers equally by setting
weighted_ensemble = FALSE
. By default this is set as
TRUE
, so each base classifier will be weighted by the
accuracy rates trained in the reference data.
scClassify_res_ensemble <- scClassify(exprsMat_train = exprsMat_xin_subset,
cellTypes_train = xin_cellTypes,
exprsMat_test = list(wang = exprsMat_wang_subset),
cellTypes_test = list(wang = wang_cellTypes),
tree = "HOPACH",
algorithm = "WKNN",
selectFeatures = c("limma"),
similarity = c("pearson", "cosine"),
weighted_ensemble = FALSE,
returnList = FALSE,
verbose = FALSE)
We can compare the two base classifiers predictions as below.
table(scClassify_res_ensemble$testRes$wang$pearson_WKNN_limma$predRes,
scClassify_res_ensemble$testRes$wang$cosine_WKNN_limma$predRes)
#>
#> alpha beta beta_delta_gamma delta gamma unassigned
#> alpha 208 0 0 0 0 0
#> beta 0 119 0 0 0 0
#> beta_delta_gamma 0 0 22 0 0 3
#> delta 0 0 0 10 0 0
#> gamma 0 0 0 0 19 0
#> unassigned 1 3 15 0 0 101
Now, check the final ensemble results:
table(scClassify_res_ensemble$testRes$wang$ensembleRes$cellTypes,
wang_cellTypes)
#> wang_cellTypes
#> acinar alpha beta delta ductal gamma stellate
#> alpha 0 206 0 0 0 2 0
#> beta 0 0 118 0 1 0 0
#> beta_delta_gamma 0 0 0 0 25 0 0
#> delta 0 0 0 10 0 0 0
#> gamma 0 0 0 0 0 19 0
#> unassigned 5 0 0 0 70 0 45
You can also train your own model scClassifyTrainModel
using train_scClassify()
. Note that by setting
weightsCal = TRUE
, we will calculate the training error of
the reference data as the weights for the individual classifiers.
Here, we illustrate the training function with gene selection methods based on differential expression (“limma”) and biomodal distribution (“BI”).
trainClass <- train_scClassify(exprsMat_train = exprsMat_xin_subset,
cellTypes_train = xin_cellTypes,
selectFeatures = c("limma", "BI"),
returnList = FALSE
)
#> after filtering not expressed genes
#> [1] 980 674
#> [1] "Feature Selection..."
#> [1] "Number of genes selected to construct HOPACH tree 160"
#> [1] "Constructing tree ..."
#> [1] "Training...."
#> [1] "=== selecting features by: limma ===="
#> [1] "=== selecting features by: BI ===="
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] scClassify_1.19.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] gridExtra_2.3 rlang_1.1.4
#> [3] magrittr_2.0.3 matrixStats_1.4.1
#> [5] compiler_4.4.2 mgcv_1.9-1
#> [7] DelayedMatrixStats_1.29.0 vctrs_0.6.5
#> [9] reshape2_1.4.4 stringr_1.5.1
#> [11] pkgconfig_2.0.3 crayon_1.5.3
#> [13] fastmap_1.2.0 XVector_0.47.0
#> [15] labeling_0.4.3 ggraph_2.2.1
#> [17] rmarkdown_2.29 UCSC.utils_1.3.0
#> [19] purrr_1.0.2 xfun_0.49
#> [21] zlibbioc_1.52.0 cachem_1.1.0
#> [23] GenomeInfoDb_1.43.2 jsonlite_1.8.9
#> [25] rhdf5filters_1.19.0 DelayedArray_0.33.3
#> [27] Rhdf5lib_1.29.0 BiocParallel_1.41.0
#> [29] tweenr_2.0.3 parallel_4.4.2
#> [31] cluster_2.1.8 R6_2.5.1
#> [33] bslib_0.8.0 stringi_1.8.4
#> [35] limma_3.63.2 diptest_0.77-1
#> [37] GenomicRanges_1.59.1 jquerylib_0.1.4
#> [39] Rcpp_1.0.13-1 SummarizedExperiment_1.37.0
#> [41] knitr_1.49 mixtools_2.0.0
#> [43] IRanges_2.41.2 Matrix_1.7-1
#> [45] splines_4.4.2 igraph_2.1.2
#> [47] tidyselect_1.2.1 abind_1.4-8
#> [49] yaml_2.3.10 hopach_2.67.0
#> [51] viridis_0.6.5 codetools_0.2-20
#> [53] minpack.lm_1.2-4 Cepo_1.13.0
#> [55] lattice_0.22-6 tibble_3.2.1
#> [57] plyr_1.8.9 Biobase_2.67.0
#> [59] withr_3.0.2 evaluate_1.0.1
#> [61] survival_3.8-3 proxy_0.4-27
#> [63] polyclip_1.10-7 kernlab_0.9-33
#> [65] pillar_1.10.0 BiocManager_1.30.25
#> [67] MatrixGenerics_1.19.0 stats4_4.4.2
#> [69] plotly_4.10.4 generics_0.1.3
#> [71] S4Vectors_0.45.2 ggplot2_3.5.1
#> [73] sparseMatrixStats_1.19.0 munsell_0.5.1
#> [75] scales_1.3.0 glue_1.8.0
#> [77] lazyeval_0.2.2 proxyC_0.4.1
#> [79] maketools_1.3.1 tools_4.4.2
#> [81] data.table_1.16.4 sys_3.4.3
#> [83] buildtools_1.0.0 graphlayouts_1.2.1
#> [85] tidygraph_1.3.1 rhdf5_2.51.1
#> [87] grid_4.4.2 tidyr_1.3.1
#> [89] colorspace_2.1-1 SingleCellExperiment_1.29.1
#> [91] nlme_3.1-166 GenomeInfoDbData_1.2.13
#> [93] patchwork_1.3.0 ggforce_0.4.2
#> [95] HDF5Array_1.35.2 cli_3.6.3
#> [97] segmented_2.1-3 S4Arrays_1.7.1
#> [99] viridisLite_0.4.2 dplyr_1.1.4
#> [101] gtable_0.3.6 sass_0.4.9
#> [103] digest_0.6.37 BiocGenerics_0.53.3
#> [105] SparseArray_1.7.2 ggrepel_0.9.6
#> [107] htmlwidgets_1.6.4 farver_2.1.2
#> [109] memoise_2.0.1 htmltools_0.5.8.1
#> [111] lifecycle_1.0.4 httr_1.4.7
#> [113] statmod_1.5.0 MASS_7.3-61