See the relevant section of the OSCA
book for an example of the recoverDoublets()
function
in action on real data. A toy example is also provided in
?recoverDoublets
.
Consider any two cell states C1 and C2 forming a doublet population D12. We will focus on the relative frequency of inter-sample to intra-sample doublets in D12. Given a vector p⃗X containing the proportion of cells from each sample in state X, and assuming that doublets form randomly between pairs of samples, the expected proportion of intra-sample doublets in D12 is p⃗C1 ⋅ p⃗C2. Subtracting this from 1 gives us the expected proportion of inter-sample doublets qD12. Similarly, the expected proportion of inter-sample doublets in C1 is just qC1 = 1−∥p⃗C1∥22.
Now, let’s consider the observed proportion of events rX in each state X that are known doublets. We have rD12 = qD12 as there are no other events in D12 beyond actual doublets. On the other hand, we expect that rC1 ≪ qC1 due to presence of a large majority of non-doublet cells in C1 (same for C2). If we assume that qD12 ≥ qC1 and qC2, the observed proportion rD12 should be larger than rC1 and rC2. (The last assumption is not always true but the ≪ should give us enough wiggle room to be robust to violations.)
The above reasoning motivates the use of the proportion of known
doublet neighbors as a “doublet score” to identify events that are most
likely to be themselves doublets. recoverDoublets()
computes the proportion of known doublet neighbors for each cell by
performing a k-nearest
neighbor search against all other cells in the dataset. It is then
straightforward to calculate the proportion of neighboring cells that
are marked as known doublets, representing our estimate of rX for each
cell.
While the proportions are informative, there comes a time when we
need to convert these into explicit doublet calls. This is achieved with
S⃗, the vector of the
proportion of cells from each sample across the entire dataset (i.e.,
samples
). We assume that all cell states contributing to
doublet states have proportion vectors equal to S⃗, such that the expected proportion
of doublets that occur between cells from the same sample is ∥S⃗∥22. We then
solve
$$ \frac{N_{intra}}{(N_{intra} + N_{inter}} = \| \vec S\|_2^2 $$
for Nintra, where Ninter is the number of observed inter-sample doublets. The top Nintra events with the highest scores (and, obviously, are not already inter-sample doublets) are marked as putative intra-sample doublets.
The rate and manner of doublet formation is (mostly) irrelevant as we condition on the number of events in D12. This means that we do not have to make any assumptions about the relative likelihood of doublets forming between pairs of cell types, especially when cell types have different levels of “stickiness” (or worse, stick specifically to certain other cell types). Such convenience is only possible because of the known doublet calls that allow us to focus on the inter- to intra-sample ratio.
The most problematic assumption is that required to obtain Nintra from S⃗. Obtaining a better estimate would require, at least, the knowledge of the two parent states for each doublet population. This can be determined with some simulation-based heuristics but it is likely to be more trouble than it is worth.
In this theoretical framework, we can easily spot a case where our
method fails. If both C1 and C2 are unique to a given
sample, all events in D12 will be intra-sample
doublets. This means that no events in D12 will ever be detected
as inter-sample doublets, which precludes their detection as
intra-sample doublets by recoverDoublets
. The computational
remedy is to augment the predictions with simulation-based methods
(e.g., scDblFinder()
) while the experimental remedy is to
ensure that multiplexed samples include technical or biological
replicates.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bluster_1.17.0 scDblFinder_1.21.0
## [3] scater_1.35.0 ggplot2_3.5.1
## [5] scran_1.35.0 scuttle_1.17.0
## [7] ensembldb_2.31.0 AnnotationFilter_1.31.0
## [9] GenomicFeatures_1.59.1 AnnotationDbi_1.69.0
## [11] scRNAseq_2.20.0 SingleCellExperiment_1.29.1
## [13] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [15] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [17] IRanges_2.41.1 S4Vectors_0.45.2
## [19] BiocGenerics_0.53.3 generics_0.1.3
## [21] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [23] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9 magrittr_2.0.3
## [4] ggbeeswarm_0.7.2 gypsum_1.3.0 farver_2.1.2
## [7] rmarkdown_2.29 BiocIO_1.17.1 zlibbioc_1.52.0
## [10] vctrs_0.6.5 memoise_2.0.1 Rsamtools_2.23.1
## [13] RCurl_1.98-1.16 htmltools_0.5.8.1 S4Arrays_1.7.1
## [16] AnnotationHub_3.15.0 curl_6.0.1 BiocNeighbors_2.1.1
## [19] xgboost_1.7.8.1 Rhdf5lib_1.29.0 SparseArray_1.7.2
## [22] rhdf5_2.51.0 sass_0.4.9 alabaster.base_1.7.2
## [25] bslib_0.8.0 alabaster.sce_1.7.0 httr2_1.0.7
## [28] cachem_1.1.0 buildtools_1.0.0 GenomicAlignments_1.43.0
## [31] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
## [34] pkgconfig_2.0.3 rsvd_1.0.5 Matrix_1.7-1
## [37] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [40] digest_0.6.37 colorspace_2.1-1 dqrng_0.4.1
## [43] irlba_2.3.5.1 ExperimentHub_2.15.0 RSQLite_2.3.8
## [46] beachmat_2.23.2 labeling_0.4.3 filelock_1.0.3
## [49] fansi_1.0.6 httr_1.4.7 abind_1.4-8
## [52] compiler_4.4.2 bit64_4.5.2 withr_3.0.2
## [55] BiocParallel_1.41.0 viridis_0.6.5 DBI_1.2.3
## [58] HDF5Array_1.35.2 alabaster.ranges_1.7.0 alabaster.schemas_1.7.0
## [61] MASS_7.3-61 rappdirs_0.3.3 DelayedArray_0.33.2
## [64] rjson_0.2.23 tools_4.4.2 vipor_0.4.7
## [67] beeswarm_0.4.0 glue_1.8.0 restfulr_0.0.15
## [70] rhdf5filters_1.19.0 grid_4.4.2 Rtsne_0.17
## [73] cluster_2.1.6 gtable_0.3.6 data.table_1.16.2
## [76] BiocSingular_1.23.0 ScaledMatrix_1.15.0 metapod_1.15.0
## [79] utf8_1.2.4 XVector_0.47.0 ggrepel_0.9.6
## [82] BiocVersion_3.21.1 pillar_1.9.0 limma_3.63.2
## [85] dplyr_1.1.4 BiocFileCache_2.15.0 lattice_0.22-6
## [88] rtracklayer_1.67.0 bit_4.5.0 tidyselect_1.2.1
## [91] locfit_1.5-9.10 maketools_1.3.1 Biostrings_2.75.1
## [94] knitr_1.49 gridExtra_2.3 ProtGenerics_1.39.0
## [97] edgeR_4.5.0 xfun_0.49 statmod_1.5.0
## [100] UCSC.utils_1.3.0 lazyeval_0.2.2 yaml_2.3.10
## [103] evaluate_1.0.1 codetools_0.2-20 tibble_3.2.1
## [106] alabaster.matrix_1.7.2 BiocManager_1.30.25 cli_3.6.3
## [109] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.13-1
## [112] dbplyr_2.5.0 png_0.1-8 XML_3.99-0.17
## [115] parallel_4.4.2 blob_1.2.4 bitops_1.0-9
## [118] viridisLite_0.4.2 alabaster.se_1.7.0 scales_1.3.0
## [121] purrr_1.0.2 crayon_1.5.3 rlang_1.1.4
## [124] KEGGREST_1.47.0