One of key functions of the scAnnotatR package is to provide users easy tools to train their own model classifying new cell types from labeled scRNA-seq data.
This vignette shows how to train a basic classification model for an independent cell type, which is not a child of any other cell type.
The workflow starts with either a Seurat or SingleCellExperiment object where cells have already been assigned to different cell types.
To do this, users may have annotated scRNA-seq data (by a FACS-sorting process, for example), create a Seurat/ SingleCellExperiment (SCE) object based on the sequencing data and assign the predetermined cell types as cell meta data. If the scRNA-seq data has not been annotated yet, another possible approach is to follow the basic workflow (Seurat, for example) until assigning cell type identity to clusters.
In this vignette, we use the human lung dataset from Zilionis et al., 2019, which is available in the scRNAseq (2.4.0) library. The dataset is stored as a SCE object.
To start the training workflow, we first install and load the necessary libraries.
# use BiocManager to install from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# the scAnnotatR package
if (!require(scAnnotatR))
BiocManager::install("scAnnotatR")
# we use the scRNAseq package to load example data
if (!require(scRNAseq))
BiocManager::install("scRNAseq")
First, we load the dataset. To reduce the computational complexity of this vignette, we only use the first 5000 cells of the dataset.
zilionis <- ZilionisLungData()
zilionis <- zilionis[, 1:5000]
# now we add simple colnames (= cell ids) to the dataset
# Note: This is normally not necessary
colnames(zilionis) <- paste0("Cell", 1:ncol(zilionis))
We split this dataset into two parts, one for the training and the other for the testing.
pivot = ncol(zilionis)%/%2
train_set <- zilionis[, 1:pivot]
test_set <- zilionis[, (1+pivot):ncol(zilionis)]
In this dataset, the cell type meta data is stored in the Most likely LM22 cell type slot of the SingleCellExperiment object (in both the train object and test object).
If the cell type is stored not stored as the default identification
(set through Idents
for Seurat object) the slot must be set
as a parameter in the training and testing function (see below).
unique(train_set$`Most likely LM22 cell type`)
#> [1] "Macrophages M0" "Macrophages M2"
#> [3] "Monocytes" "Macrophages M1"
#> [5] "Mast cells activated" NA
#> [7] "T cells CD4 memory resting" "NK cells resting"
#> [9] "Neutrophils" "T cells CD8"
#> [11] "Dendritic cells resting" "T cells CD4 memory activated"
#> [13] "Plasma cells" "T cells regulatory (Tregs)"
#> [15] "T cells follicular helper" "Eosinophils"
#> [17] "Dendritic cells activated" "B cells memory"
#> [19] "Mast cells resting" "NK cells activated"
#> [21] "B cells naive"
unique(test_set$`Most likely LM22 cell type`)
#> [1] "Monocytes" NA
#> [3] "Macrophages M0" "T cells CD4 memory resting"
#> [5] "T cells regulatory (Tregs)" "T cells follicular helper"
#> [7] "B cells memory" "T cells CD8"
#> [9] "T cells CD4 memory activated" "Dendritic cells resting"
#> [11] "Neutrophils" "Plasma cells"
#> [13] "Mast cells activated" "Macrophages M2"
#> [15] "NK cells activated" "B cells naive"
#> [17] "Dendritic cells activated" "NK cells resting"
#> [19] "Macrophages M1" "Eosinophils"
#> [21] "Mast cells resting"
We want to train a classifier for B cells and their phenotypes. Considering memory B cells, naive B cells and plasma cells as B cell phenotypes, we convert all those cells to a uniform cell label, ie. B cells. All non B cells are converted into ‘others’.
# change cell label
train_set$B_cell <- unlist(lapply(train_set$`Most likely LM22 cell type`,
function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
test_set$B_cell <- unlist(lapply(test_set$`Most likely LM22 cell type`,
function(x) if (is.na(x)) {'ambiguous'} else if (x %in% c('Plasma cells', 'B cells memory', 'B cells naive')) {'B cells'} else {'others'}))
We observe that there are cells marked NAs. Those can be understood as 1/different from all indicated cell types or 2/any unknown cell types. Here we consider the second case, ie. we don’t know whether they are positive or negative to B cells. To avoid any effect of these cells, we can assign them as ‘ambiguous’. All cells tagged ‘ambiguous’ will be ignored by scAnnotatR from training and testing.
We may want to check the number of cells in each category:
Next, we define a set of marker genes, which will be used in training the classification model. Supposing we are training a model for classifying B cells, we define the set of marker genes as follows:
When the model is being trained, three pieces of information must be provided:
In case the dataset does not contain any cell classified as the target cell type, the function will fail.
If the cell type annotation is not set in the default identification
slot (Idents
for Seurat
objects) the name of
the metadata field must be provided to the
sce_tag_slot parameter
.
When training on an imbalanced dataset (f.e. a datasets containing
90% B cells and only very few other cell types), the trained model may
bias toward the majority group and ignore the presence of the minority
group. To avoid this, the number of positive cells and negative cells
will be automatically balanced before training. Therefore, a smaller
number cells will be randomly picked
from the majority group. To use the same set of cells while training
multiple times for one model, users can use set.seed
.
set.seed(123)
classifier_B <- train_classifier(train_obj = train_set, cell_type = "B cells",
marker_genes = selected_marker_genes_B,
assay = 'counts', tag_slot = 'B_cell')
#> Warning: Cell types containing /, ,, -, [+], [.], and , or , _or_, -or-, [(], [)], ambiguous are considered as ambiguous. They are removed from training and testing.
#> Loading required package: ggplot2
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:generics':
#>
#> train
#> Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
classifier_B
#> An object of class scAnnotatR for B cells
#> * 15 marker genes applied: BLK, CD19, CD27, CD79A, CD79B, CD86, CR2, DERL3, IGHG1, IGHG2, IGHM, LY86, MEF2C, MS4A1, VPREB3
#> * Predicting probability threshold: 0.5
#> * No parent model
The classification model is a scAnnotatR
object. Details
about the classification model are accessible via getter methods.
For example:
caret_model(classifier_B)
#> Support Vector Machines with Linear Kernel
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 984, 984, 985, 985, 985, 985, ...
#> Addtional sampling using down-sampling
#>
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9543119 0.6263564
#>
#> Tuning parameter 'C' was held constant at a value of 1
The test_classifier
model automatically tests a
classifier’s performance against another dataset. Here, we used the
test_set
created before:
classifier_B_test <- test_classifier(classifier = classifier_B, test_obj = test_set,
assay = 'counts', tag_slot = 'B_cell')
#> Warning: Cell types containing /, ,, -, [+], [.], and , or , _or_, -or-, [(], [)], ambiguous are considered as ambiguous. They are removed from training and testing.
#> Current probability threshold: 0.5
#> Positive Negative Total
#> Actual 90 1024 1114
#> Predicted 115 999 1114
#> Accuracy: 0.957809694793537
#> Sensivity (True Positive Rate) for B cells: 0.877777777777778
#> Specificity (1 - False Positive Rate) for B cells: 0.96484375
#> Area under the curve: 0.959456380208333
Apart from the output exported to console, test classifier function also returns an object, which is a list of:
test_tag: actual cell label, this can be different from the label provided by users because of ambiguous characters or the incoherence in cell type and sub cell type label assignment.
pred: cell type prediction using current classifier
acc: prediction accuracy at the fixed probability threshold, the probability threshold value can also be queried using p_thres(classifier)
auc: AUC score provided by current classifier
overall_roc: True Positive Rate and False Positive Rate with a certain number of prediction probability thresholds
Every classifier internally consists of a trained SVM and a probability threshold. Only cells that are classified with a probability exceeding this threshold are classified as the respective cell type. The overall_roc slot summarizes the True Positive Rate (sensitivity) and False Positive Rate (1 - specificity) obtained by the trained model according to different thresholds.
classifier_B_test$overall_roc
#> p_thres fpr tpr
#> [1,] 0.1 0.96582031 1.0000000
#> [2,] 0.2 0.88378906 1.0000000
#> [3,] 0.3 0.07519531 0.8888889
#> [4,] 0.4 0.06445312 0.8888889
#> [5,] 0.5 0.03515625 0.8777778
#> [6,] 0.6 0.03222656 0.8333333
#> [7,] 0.7 0.02929688 0.8222222
#> [8,] 0.8 0.02343750 0.7555556
#> [9,] 0.9 0.01269531 0.6777778
In this example of B cell classifier, the current threshold is at 0.5. The higher sensitivity can be reached if we set the p_thres at 0.4. However, we will then have lower specificity, which means that we will incorrectly classify some cells as B cells. At the sime time, we may not retrieve all actual B cells with higher p_thres (0.6, for example).
There is of course a certain trade-off between the sensitivity and the specificity of the model. Depending on the need of the project or the user-own preference, a probability threshold giving higher sensitivity or higher specificity can be chosen. In our perspective, p_thres at 0.5 is a good choice for the current B cell model.
Apart from numbers, we also provide a method to plot the ROC curve.
Changes in the training data, in the set of marker genes and in the prediction probability threshold will all lead to a change in model performance.
There are several ways to evaluate the trained model, including the overall accuracy, the AUC score and the sensitivity/specificity of the model when testing on an independent dataset. In this example, we choose the model which has the best AUC score.
Tip: Using more general markers of the whole population leads to higher sensitivity. This sometimes produces lower specificity because of close cell types (T cells and NK cells, for example). While training some models, we observed that we can use the markers producing high sensitivity but at the same time can improve the specificity by increasing the probability threshold. Of course, this can only applied in some cases, because some markers can even have a larger affect on the specificity than the prediction probability threshold.
New classification models can be stored using the
save_new_model
function:
# no copy of pretrained models is performed
save_new_model(new_model = classifier_B, path_to_models = tempdir(),
include.default = FALSE)
#> Saving new models to /tmp/Rtmp0XkCt7/new_models.rda...
#> Finished saving new model
Parameters:
Users can also choose whether copy all pretrained models of the packages to the new model database. If not, in the future, user can only choose to use either default pretrained models or new models by specifying only one path to models.
Models can be deleted from the model database using the
delete_model
function:
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] caret_6.0-94 lattice_0.22-6
#> [3] ggplot2_3.5.1 scRNAseq_2.20.0
#> [5] scAnnotatR_1.13.0 SingleCellExperiment_1.29.1
#> [7] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [9] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [11] IRanges_2.41.1 S4Vectors_0.45.2
#> [13] BiocGenerics_0.53.3 generics_0.1.3
#> [15] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [17] Seurat_5.1.0 SeuratObject_5.0.2
#> [19] sp_2.1-4 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] ProtGenerics_1.39.0 spatstat.sparse_3.1-0 bitops_1.0-9
#> [4] lubridate_1.9.3 httr_1.4.7 RColorBrewer_1.1-3
#> [7] alabaster.base_1.7.2 tools_4.4.2 sctransform_0.4.1
#> [10] utf8_1.2.4 R6_2.5.1 HDF5Array_1.35.2
#> [13] lazyeval_0.2.2 uwot_0.2.2 rhdf5filters_1.19.0
#> [16] withr_3.0.2 gridExtra_2.3 progressr_0.15.1
#> [19] cli_3.6.3 spatstat.explore_3.3-3 fastDummies_1.7.4
#> [22] alabaster.se_1.7.0 labeling_0.4.3 sass_0.4.9
#> [25] spatstat.data_3.1-4 proxy_0.4-27 ggridges_0.5.6
#> [28] pbapply_1.7-2 Rsamtools_2.23.1 parallelly_1.39.0
#> [31] RSQLite_2.3.8 BiocIO_1.17.1 ica_1.0-3
#> [34] spatstat.random_3.3-2 dplyr_1.1.4 Matrix_1.7-1
#> [37] fansi_1.0.6 abind_1.4-8 lifecycle_1.0.4
#> [40] yaml_2.3.10 rhdf5_2.51.0 recipes_1.1.0
#> [43] SparseArray_1.7.2 BiocFileCache_2.15.0 Rtsne_0.17
#> [46] grid_4.4.2 blob_1.2.4 promises_1.3.2
#> [49] ExperimentHub_2.15.0 crayon_1.5.3 miniUI_0.1.1.1
#> [52] cowplot_1.1.3 GenomicFeatures_1.59.1 KEGGREST_1.47.0
#> [55] sys_3.4.3 maketools_1.3.1 pillar_1.9.0
#> [58] knitr_1.49 rjson_0.2.23 future.apply_1.11.3
#> [61] codetools_0.2-20 leiden_0.4.3.1 glue_1.8.0
#> [64] spatstat.univar_3.1-1 data.table_1.16.2 gypsum_1.3.0
#> [67] vctrs_0.6.5 png_0.1-8 spam_2.11-0
#> [70] gtable_0.3.6 kernlab_0.9-33 cachem_1.1.0
#> [73] gower_1.0.1 xfun_0.49 S4Arrays_1.7.1
#> [76] mime_0.12 prodlim_2024.06.25 survival_3.7-0
#> [79] timeDate_4041.110 iterators_1.0.14 hardhat_1.4.0
#> [82] lava_1.8.0 fitdistrplus_1.2-1 ROCR_1.0-11
#> [85] ipred_0.9-15 nlme_3.1-166 bit64_4.5.2
#> [88] alabaster.ranges_1.7.0 filelock_1.0.3 RcppAnnoy_0.0.22
#> [91] data.tree_1.1.0 bslib_0.8.0 irlba_2.3.5.1
#> [94] KernSmooth_2.23-24 rpart_4.1.23 colorspace_2.1-1
#> [97] DBI_1.2.3 nnet_7.3-19 tidyselect_1.2.1
#> [100] bit_4.5.0 compiler_4.4.2 curl_6.0.1
#> [103] httr2_1.0.7 DelayedArray_0.33.2 plotly_4.10.4
#> [106] rtracklayer_1.67.0 scales_1.3.0 lmtest_0.9-40
#> [109] rappdirs_0.3.3 stringr_1.5.1 digest_0.6.37
#> [112] goftest_1.2-3 spatstat.utils_3.1-1 alabaster.matrix_1.7.2
#> [115] XVector_0.47.0 htmltools_0.5.8.1 pkgconfig_2.0.3
#> [118] ensembldb_2.31.0 dbplyr_2.5.0 fastmap_1.2.0
#> [121] rlang_1.1.4 htmlwidgets_1.6.4 UCSC.utils_1.3.0
#> [124] shiny_1.9.1 farver_2.1.2 jquerylib_0.1.4
#> [127] zoo_1.8-12 jsonlite_1.8.9 BiocParallel_1.41.0
#> [130] ModelMetrics_1.2.2.2 RCurl_1.98-1.16 magrittr_2.0.3
#> [133] GenomeInfoDbData_1.2.13 dotCall64_1.2 patchwork_1.3.0
#> [136] Rhdf5lib_1.29.0 munsell_0.5.1 Rcpp_1.0.13-1
#> [139] ape_5.8 reticulate_1.40.0 alabaster.schemas_1.7.0
#> [142] stringi_1.8.4 pROC_1.18.5 zlibbioc_1.52.0
#> [145] MASS_7.3-61 AnnotationHub_3.15.0 plyr_1.8.9
#> [148] parallel_4.4.2 listenv_0.9.1 ggrepel_0.9.6
#> [151] deldir_2.0-4 Biostrings_2.75.1 splines_4.4.2
#> [154] tensor_1.5 igraph_2.1.1 spatstat.geom_3.3-4
#> [157] RcppHNSW_0.6.0 buildtools_1.0.0 reshape2_1.4.4
#> [160] BiocVersion_3.21.1 XML_3.99-0.17 evaluate_1.0.1
#> [163] BiocManager_1.30.25 foreach_1.5.2 httpuv_1.6.15
#> [166] RANN_2.6.2 tidyr_1.3.1 purrr_1.0.2
#> [169] polyclip_1.10-7 alabaster.sce_1.7.0 future_1.34.0
#> [172] scattermore_1.2 xtable_1.8-4 AnnotationFilter_1.31.0
#> [175] restfulr_0.0.15 e1071_1.7-16 RSpectra_0.16-2
#> [178] later_1.4.1 viridisLite_0.4.2 class_7.3-22
#> [181] tibble_3.2.1 GenomicAlignments_1.43.0 memoise_2.0.1
#> [184] AnnotationDbi_1.69.0 cluster_2.1.6 timechange_0.3.0
#> [187] globals_0.16.3