The scDblFinder
package gathers various methods for the
detection and handling of doublets/multiplets in single-cell sequencing
data (i.e. multiple cells captured within the same droplet or reaction
volume). This vignette provides a brief overview of the different
approaches (which are each covered in their own vignettes) for
single-cell RNA sequencing. For doublet detection in genomic data,
see the scATACseq vignette. For a more
general introduction to the topic of doublets, refer to the OCSA
book.
All methods require as an input either a matrix of counts or a SingleCellExperiment containing count data. With the exception of findDoubletClusters, which operates at the level of clusters (and consequently requires clustering information), all methods try to assign each cell a score indicating its likelihood (broadly understood) of being a doublet.
The approaches described here are complementary to doublets identified via cell hashes and SNPs in multiplexed samples: while hashing/genotypes can identify doublets formed by cells of the same type (homotypic doublets) from two samples, which are often nearly undistinguishable from real cells transcriptionally (and hence generally unidentifiable through the present package), it cannot identify doublets made by cells of the same sample, even if they are heterotypic (formed by different cell types). Indeed, recent evidence suggests that doublets are for instance a serious and strongly underestimated issue in 10x Flex datasets (see Howitt et al., 2024). Instead, the methods presented here are primarily geared towards the identification of heterotypic doublets, which for most purposes are also the most critical ones.
The computeDoubletDensity
method (formerly
scran::doubletCells
) generates random artificial doublets
from the real cells, and tries to identify cells whose neighborhood has
a high local density of articial doublets. See computeDoubletDensity for more
information.
The recoverDoublets
method is meant to be used when some
doublets are already known, for instance through genotype-based calls or
cell hashing in multiplexed experiments. The function then tries to
identify intra-sample doublets that are neighbors to the known
inter-sample doublets. See recoverDoublets for more
information.
The scDblFinder
method combines both known doublets (if
available) and cluster-based artificial doublets to identify doublets.
The approach builds and improves on a variety of earlier efforts, and is
at present the most accurate approach included in this package. See scDblFinder for more information.
The directDblClassification
method identifies doublets
by training a classifier directly on gene expression. This follows the
same procedure as scDblFinder
for doublet generation and
iterative training, but skips the k-nearest neighbor step and
directly uses the matrix of real cells and artificial doublets. This is
computationally more intensive and generally leads to worse predictions
than scDblFinder
, and it is included chiefly for
comparative purposes. See ?directDblClassification
for more
information.
The findDoubletClusters
method identifies clusters that
are likely to be composed of doublets by estimating whether their
expression profile lies between two other clusters. See findDoubletClusters for more
information.
A benchmark of the main methods available in the package is presented
in the scDblFinder
paper. While the different methods included here have their values,
overall the scDblFinder
method had the best performance
(also superior to other methods not included in this package), and
should be used by default.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bluster_1.17.0 scDblFinder_1.21.0
## [3] scater_1.34.0 ggplot2_3.5.1
## [5] scran_1.34.0 scuttle_1.16.0
## [7] ensembldb_2.31.0 AnnotationFilter_1.31.0
## [9] GenomicFeatures_1.59.0 AnnotationDbi_1.69.0
## [11] scRNAseq_2.19.1 SingleCellExperiment_1.28.0
## [13] SummarizedExperiment_1.36.0 Biobase_2.67.0
## [15] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [17] IRanges_2.41.0 S4Vectors_0.44.0
## [19] BiocGenerics_0.53.0 MatrixGenerics_1.19.0
## [21] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9 magrittr_2.0.3
## [4] ggbeeswarm_0.7.2 gypsum_1.3.0 farver_2.1.2
## [7] rmarkdown_2.28 BiocIO_1.17.0 zlibbioc_1.52.0
## [10] vctrs_0.6.5 memoise_2.0.1 Rsamtools_2.22.0
## [13] RCurl_1.98-1.16 htmltools_0.5.8.1 S4Arrays_1.6.0
## [16] AnnotationHub_3.15.0 curl_5.2.3 BiocNeighbors_2.1.0
## [19] xgboost_1.7.8.1 Rhdf5lib_1.28.0 SparseArray_1.6.0
## [22] rhdf5_2.50.0 sass_0.4.9 alabaster.base_1.7.0
## [25] bslib_0.8.0 alabaster.sce_1.7.0 httr2_1.0.5
## [28] cachem_1.1.0 buildtools_1.0.0 GenomicAlignments_1.43.0
## [31] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
## [34] pkgconfig_2.0.3 rsvd_1.0.5 Matrix_1.7-1
## [37] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [40] digest_0.6.37 colorspace_2.1-1 dqrng_0.4.1
## [43] irlba_2.3.5.1 ExperimentHub_2.15.0 RSQLite_2.3.7
## [46] beachmat_2.23.0 labeling_0.4.3 filelock_1.0.3
## [49] fansi_1.0.6 httr_1.4.7 abind_1.4-8
## [52] compiler_4.4.1 bit64_4.5.2 withr_3.0.2
## [55] BiocParallel_1.41.0 viridis_0.6.5 DBI_1.2.3
## [58] highr_0.11 HDF5Array_1.35.0 alabaster.ranges_1.7.0
## [61] alabaster.schemas_1.7.0 MASS_7.3-61 rappdirs_0.3.3
## [64] DelayedArray_0.33.1 rjson_0.2.23 tools_4.4.1
## [67] vipor_0.4.7 beeswarm_0.4.0 glue_1.8.0
## [70] restfulr_0.0.15 rhdf5filters_1.18.0 grid_4.4.1
## [73] Rtsne_0.17 cluster_2.1.6 generics_0.1.3
## [76] gtable_0.3.6 data.table_1.16.2 BiocSingular_1.23.0
## [79] ScaledMatrix_1.14.0 metapod_1.14.0 utf8_1.2.4
## [82] XVector_0.46.0 ggrepel_0.9.6 BiocVersion_3.21.1
## [85] pillar_1.9.0 limma_3.63.0 dplyr_1.1.4
## [88] BiocFileCache_2.15.0 lattice_0.22-6 rtracklayer_1.66.0
## [91] bit_4.5.0 tidyselect_1.2.1 locfit_1.5-9.10
## [94] maketools_1.3.1 Biostrings_2.75.0 knitr_1.48
## [97] gridExtra_2.3 ProtGenerics_1.38.0 edgeR_4.4.0
## [100] xfun_0.48 statmod_1.5.0 UCSC.utils_1.2.0
## [103] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.1
## [106] codetools_0.2-20 tibble_3.2.1 alabaster.matrix_1.7.0
## [109] BiocManager_1.30.25 cli_3.6.3 munsell_0.5.1
## [112] jquerylib_0.1.4 Rcpp_1.0.13 dbplyr_2.5.0
## [115] png_0.1-8 XML_3.99-0.17 parallel_4.4.1
## [118] blob_1.2.4 bitops_1.0-9 viridisLite_0.4.2
## [121] alabaster.se_1.7.0 scales_1.3.0 purrr_1.0.2
## [124] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.47.0