SingleR is an automatic annotation method for single-cell RNA sequencing (scRNAseq) data (Aran et al. 2019). Given a reference dataset of samples (single-cell or bulk) with known labels, it labels new cells from a test dataset based on similarity to the reference. Thus, the burden of manually interpreting clusters and defining marker genes only has to be done once, for the reference dataset, and this biological knowledge can be propagated to new datasets in an automated manner.
To keep things brief, this vignette only provides a brief summary of the basic capabilities of SingleR. However, the package also provides more advanced functionality that includes the use of multiple references simultaneously, manipulating the cell ontology and improving performance on big datasets. Readers are referred to the book for more details.
The easiest way to use SingleR
is to annotate cells against built-in references. In particular, the
celldex
package provides access to several reference datasets (mostly derived
from bulk RNA-seq or microarray data) through dedicated retrieval
functions. Here, we will use the Human Primary Cell Atlas (Mabbott et al. 2013), represented as a
SummarizedExperiment
object containing a matrix of
log-expression values with sample-level labels.
## class: SummarizedExperiment
## dim: 19363 713
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont
Our test dataset consists of some human embryonic stem cells (La Manno et al. 2016) from the scRNAseq package. For the sake of speed, we will only label the first 100 cells from this dataset.
We use our hpca.se
reference to annotate each cell in
hESCs
via the SingleR()
function. This
identifies marker genes from the reference and uses them to compute
assignment scores (based on the Spearman correlation across markers) for
each cell in the test dataset against each label in the reference. The
label with the highest score is the assigned to the test cell, possibly
with further fine-tuning to resolve closely related labels.
library(SingleR)
pred.hesc <- SingleR(test = hESCs, ref = hpca.se, assay.type.test=1,
labels = hpca.se$label.main)
Each row of the output DataFrame
contains prediction
results for a single cell. Labels are shown before (labels
)
and after pruning (pruned.labels
), along with the
associated scores.
## DataFrame with 100 rows and 4 columns
## scores labels delta.next
## <matrix> <character> <numeric>
## 1772122_301_C02 0.347652:0.109547:0.123901:... Neuroepithelial_cell 0.08332864
## 1772122_180_E05 0.361187:0.134934:0.148672:... Neurons 0.07283500
## 1772122_300_H02 0.446411:0.190084:0.222594:... Neuroepithelial_cell 0.13882912
## 1772122_180_B09 0.373512:0.143537:0.164743:... Neuroepithelial_cell 0.00317443
## 1772122_180_G04 0.357341:0.126511:0.141987:... Neuroepithelial_cell 0.09717938
## ... ... ... ...
## 1772122_299_E07 0.371989:0.169379:0.1986877:... Neuroepithelial_cell 0.0837521
## 1772122_180_D02 0.353314:0.115864:0.1374981:... Neuroepithelial_cell 0.0842804
## 1772122_300_D09 0.348789:0.136732:0.1303042:... Neuroepithelial_cell 0.0595056
## 1772122_298_F09 0.332361:0.141439:0.1437860:... Neuroepithelial_cell 0.1200606
## 1772122_302_A11 0.324928:0.101609:0.0949826:... Astrocyte 0.0509478
## pruned.labels
## <character>
## 1772122_301_C02 Neuroepithelial_cell
## 1772122_180_E05 Neurons
## 1772122_300_H02 Neuroepithelial_cell
## 1772122_180_B09 Neuroepithelial_cell
## 1772122_180_G04 Neuroepithelial_cell
## ... ...
## 1772122_299_E07 Neuroepithelial_cell
## 1772122_180_D02 Neuroepithelial_cell
## 1772122_300_D09 Neuroepithelial_cell
## 1772122_298_F09 Neuroepithelial_cell
## 1772122_302_A11 Astrocyte
##
## Astrocyte Neuroepithelial_cell Neurons
## 14 81 5
At this point, it is worth noting that SingleR
is workflow/package agnostic. The above example uses
SummarizedExperiment
objects, but the same functions will
accept any (log-)normalized expression matrix.
Here, we will use two human pancreas datasets from the scRNAseq package. The aim is to use one pre-labelled dataset to annotate the other unlabelled dataset. First, we set up the Muraro et al. (2016) dataset to be our reference.
library(scRNAseq)
sceM <- MuraroPancreasData()
# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]
# SingleR() expects reference datasets to be normalized and log-transformed.
library(scuttle)
sceM <- logNormCounts(sceM)
We then set up our test dataset from Grun et al. (2016). To speed up this demonstration, we will subset to the first 100 cells.
sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
sceG <- logNormCounts(sceG)
We then run SingleR()
as described previously but with a
marker detection mode that considers the variance of expression across
cells. Here, we will use the Wilcoxon ranked sum test to identify the
top markers for each pairwise comparison between labels. This is slower
but more appropriate for single-cell data compared to the default marker
detection algorithm (which may fail for low-coverage data where the
median is frequently zero).
pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
table(pred.grun$labels)
##
## acinar alpha beta delta duct endothelial
## 722 228 233 65 345 34
## epsilon mesenchymal pp unclear
## 2 49 36 4
plotScoreHeatmap()
displays the scores for all cells
across all reference labels, which allows users to inspect the
confidence of the predicted labels across the dataset. Ideally, each
cell (i.e., column of the heatmap) should have one score that is
obviously larger than the rest, indicating that it is unambiguously
assigned to a single label. A spread of similar scores for a given cell
indicates that the assignment is uncertain, though this may be
acceptable if the uncertainty is distributed across similar cell types
that cannot be easily resolved.
Another diagnostic is based on the per-cell “deltas”, i.e., the
difference between the score for the assigned label and the median
across all labels for each cell. Low deltas indicate that the assignment
is uncertain, which is especially relevant if the cell’s true label does
not exist in the reference. We can inspect these deltas across cells for
each label using the plotDeltaDistribution()
function.
The pruneScores()
function will remove potentially
poor-quality or ambiguous assignments based on the deltas. The minimum
threshold on the deltas is defined using an outlier-based approach that
accounts for differences in the scale of the correlations in various
contexts - see ?pruneScores
for more details.
SingleR()
will also report the pruned scores automatically
in the pruned.labels
field where low-quality assignments
are replaced with NA
.
## Mode FALSE TRUE
## logical 1660 58
Finally, a simple yet effective diagnostic is to examine the
expression of the marker genes for each label in the test dataset. We
extract the identity of the markers from the metadata of the
SingleR()
results and use them in the
plotMarkerHeatmap()
function, as shown below for beta cell
markers. If a cell in the test dataset is confidently assigned to a
particular label, we would expect it to have strong expression of that
label’s markers. At the very least, it should exhibit upregulation of
those markers relative to cells assigned to other labels.
How can I use this with my Seurat
,
SingleCellExperiment
, or cell_data_set
object?
SingleR is workflow agnostic - all it needs is normalized counts. An example showing how to map its results back to common single-cell data objects is available in the README.
Where can I find reference sets appropriate for my data?
celldex contains many built-in references that were previously in SingleR but have been migrated into a separate package for more general use by other Bioconductor packages. scRNAseq contains many single-cell datasets, many of which contain the authors’ manual annotations. ArrayExpress and GEOquery can be used to download any of the bulk or single-cell datasets in ArrayExpress or GEO, respectively.
Where can I get more help?
It is likely that your questions is already answered by the function
documentation (e.g., ?SingleR
). Further explanations on the
reasoning behind certain functions can be found in the book. If
this is not sufficient, we recommend posting an issue on the Bioconductor support site or
the GitHub
repository for this package. Be sure to include your session
information and a minimal reproducible example.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scuttle_1.17.0 SingleR_2.9.1
## [3] scRNAseq_2.20.0 SingleCellExperiment_1.29.1
## [5] celldex_1.16.0 SummarizedExperiment_1.37.0
## [7] Biobase_2.67.0 GenomicRanges_1.59.1
## [9] GenomeInfoDb_1.43.1 IRanges_2.41.1
## [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [13] generics_0.1.3 MatrixGenerics_1.19.0
## [15] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] GenomicFeatures_1.59.1 gypsum_1.3.0
## [7] farver_2.1.2 rmarkdown_2.29
## [9] BiocIO_1.17.0 zlibbioc_1.52.0
## [11] vctrs_0.6.5 memoise_2.0.1
## [13] Rsamtools_2.23.0 DelayedMatrixStats_1.29.0
## [15] RCurl_1.98-1.16 htmltools_0.5.8.1
## [17] S4Arrays_1.7.1 AnnotationHub_3.15.0
## [19] curl_6.0.1 BiocNeighbors_2.1.0
## [21] Rhdf5lib_1.29.0 SparseArray_1.7.2
## [23] rhdf5_2.51.0 sass_0.4.9
## [25] alabaster.base_1.7.2 bslib_0.8.0
## [27] alabaster.sce_1.7.0 httr2_1.0.6
## [29] cachem_1.1.0 buildtools_1.0.0
## [31] GenomicAlignments_1.43.0 lifecycle_1.0.4
## [33] pkgconfig_2.0.3 Matrix_1.7-1
## [35] R6_2.5.1 fastmap_1.2.0
## [37] GenomeInfoDbData_1.2.13 digest_0.6.37
## [39] colorspace_2.1-1 AnnotationDbi_1.69.0
## [41] ExperimentHub_2.15.0 RSQLite_2.3.8
## [43] beachmat_2.23.1 labeling_0.4.3
## [45] filelock_1.0.3 fansi_1.0.6
## [47] httr_1.4.7 abind_1.4-8
## [49] compiler_4.4.2 withr_3.0.2
## [51] bit64_4.5.2 BiocParallel_1.41.0
## [53] viridis_0.6.5 DBI_1.2.3
## [55] HDF5Array_1.35.1 alabaster.ranges_1.7.0
## [57] alabaster.schemas_1.7.0 rappdirs_0.3.3
## [59] DelayedArray_0.33.2 rjson_0.2.23
## [61] tools_4.4.2 glue_1.8.0
## [63] restfulr_0.0.15 rhdf5filters_1.19.0
## [65] grid_4.4.2 gtable_0.3.6
## [67] ensembldb_2.31.0 utf8_1.2.4
## [69] XVector_0.47.0 BiocVersion_3.21.1
## [71] pillar_1.9.0 dplyr_1.1.4
## [73] BiocFileCache_2.15.0 lattice_0.22-6
## [75] rtracklayer_1.67.0 bit_4.5.0
## [77] tidyselect_1.2.1 maketools_1.3.1
## [79] Biostrings_2.75.1 knitr_1.49
## [81] gridExtra_2.3 scrapper_1.1.3
## [83] ProtGenerics_1.39.0 xfun_0.49
## [85] pheatmap_1.0.12 UCSC.utils_1.3.0
## [87] lazyeval_0.2.2 yaml_2.3.10
## [89] evaluate_1.0.1 codetools_0.2-20
## [91] tibble_3.2.1 alabaster.matrix_1.7.0
## [93] BiocManager_1.30.25 cli_3.6.3
## [95] munsell_0.5.1 jquerylib_0.1.4
## [97] Rcpp_1.0.13-1 dbplyr_2.5.0
## [99] png_0.1-8 XML_3.99-0.17
## [101] parallel_4.4.2 ggplot2_3.5.1
## [103] blob_1.2.4 AnnotationFilter_1.31.0
## [105] sparseMatrixStats_1.19.0 bitops_1.0-9
## [107] alabaster.se_1.7.0 viridisLite_0.4.2
## [109] scales_1.3.0 crayon_1.5.3
## [111] rlang_1.1.4 KEGGREST_1.47.0