tl;dr

See the relevant section of the OSCA book for an example of the recoverDoublets() function in action on real data. A toy example is also provided in ?recoverDoublets.

Mathematical background

Consider any two cell states C₁ and C₂ forming a doublet population D₁₂. We will focus on the relative frequency of inter-sample to intra-sample doublets in D₁₂. Given a vector p⃗_X containing the proportion of cells from each sample in state X, and assuming that doublets form randomly between pairs of samples, the expected proportion of intra-sample doublets in D₁₂ is p⃗_C₁ ⋅ p⃗_C₂. Subtracting this from 1 gives us the expected proportion of inter-sample doublets q_D₁₂. Similarly, the expected proportion of inter-sample doublets in C₁ is just q_C₁ = 1−∥p⃗_C₁∥₂².

Now, let’s consider the observed proportion of events r_X in each state X that are known doublets. We have r_D₁₂ = q_D₁₂ as there are no other events in D₁₂ beyond actual doublets. On the other hand, we expect that r_C₁ ≪ q_C₁ due to presence of a large majority of non-doublet cells in C₁ (same for C₂). If we assume that q_D₁₂ ≥ q_C₁ and q_C₂, the observed proportion r_D₁₂ should be larger than r_C₁ and r_C₂. (The last assumption is not always true but the ≪ should give us enough wiggle room to be robust to violations.)

The above reasoning motivates the use of the proportion of known doublet neighbors as a “doublet score” to identify events that are most likely to be themselves doublets. recoverDoublets() computes the proportion of known doublet neighbors for each cell by performing a k-nearest neighbor search against all other cells in the dataset. It is then straightforward to calculate the proportion of neighboring cells that are marked as known doublets, representing our estimate of r_X for each cell.

Obtaining explicit calls

While the proportions are informative, there comes a time when we need to convert these into explicit doublet calls. This is achieved with S⃗, the vector of the proportion of cells from each sample across the entire dataset (i.e., samples). We assume that all cell states contributing to doublet states have proportion vectors equal to S⃗, such that the expected proportion of doublets that occur between cells from the same sample is ∥S⃗∥₂². We then solve

$$ \frac{N_{intra}}{(N_{intra} + N_{inter}} = \| \vec S\|_2^2 $$

for N_intra, where N_inter is the number of observed inter-sample doublets. The top N_intra events with the highest scores (and, obviously, are not already inter-sample doublets) are marked as putative intra-sample doublets.

Discussion

The rate and manner of doublet formation is (mostly) irrelevant as we condition on the number of events in D₁₂. This means that we do not have to make any assumptions about the relative likelihood of doublets forming between pairs of cell types, especially when cell types have different levels of “stickiness” (or worse, stick specifically to certain other cell types). Such convenience is only possible because of the known doublet calls that allow us to focus on the inter- to intra-sample ratio.

The most problematic assumption is that required to obtain N_intra from S⃗. Obtaining a better estimate would require, at least, the knowledge of the two parent states for each doublet population. This can be determined with some simulation-based heuristics but it is likely to be more trouble than it is worth.

In this theoretical framework, we can easily spot a case where our method fails. If both C₁ and C₂ are unique to a given sample, all events in D₁₂ will be intra-sample doublets. This means that no events in D₁₂ will ever be detected as inter-sample doublets, which precludes their detection as intra-sample doublets by recoverDoublets. The computational remedy is to augment the predictions with simulation-based methods (e.g., scDblFinder()) while the experimental remedy is to ensure that multiplexed samples include technical or biological replicates.

Session information

sessionInfo()

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bluster_1.17.0              scDblFinder_1.21.1         
##  [3] scater_1.35.4               ggplot2_3.5.1              
##  [5] scran_1.35.0                scuttle_1.17.0             
##  [7] ensembldb_2.31.0            AnnotationFilter_1.31.0    
##  [9] GenomicFeatures_1.59.1      AnnotationDbi_1.69.0       
## [11] scRNAseq_2.20.0             SingleCellExperiment_1.29.2
## [13] SummarizedExperiment_1.37.0 Biobase_2.67.0             
## [15] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
## [17] IRanges_2.41.3              S4Vectors_0.45.4           
## [19] BiocGenerics_0.53.6         generics_0.1.3             
## [21] MatrixGenerics_1.19.1       matrixStats_1.5.0          
## [23] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] sys_3.4.3                jsonlite_1.9.1           magrittr_2.0.3          
##   [4] ggbeeswarm_0.7.2         gypsum_1.3.0             farver_2.1.2            
##   [7] rmarkdown_2.29           BiocIO_1.17.1            vctrs_0.6.5             
##  [10] memoise_2.0.1            Rsamtools_2.23.1         RCurl_1.98-1.16         
##  [13] htmltools_0.5.8.1        S4Arrays_1.7.3           AnnotationHub_3.15.0    
##  [16] curl_6.2.1               BiocNeighbors_2.1.3      xgboost_1.7.8.1         
##  [19] Rhdf5lib_1.29.1          SparseArray_1.7.6        rhdf5_2.51.2            
##  [22] sass_0.4.9               alabaster.base_1.7.8     bslib_0.9.0             
##  [25] alabaster.sce_1.7.0      httr2_1.1.1              cachem_1.1.0            
##  [28] buildtools_1.0.0         GenomicAlignments_1.43.0 igraph_2.1.4            
##  [31] mime_0.12                lifecycle_1.0.4          pkgconfig_2.0.3         
##  [34] rsvd_1.0.5               Matrix_1.7-2             R6_2.6.1                
##  [37] fastmap_1.2.0            GenomeInfoDbData_1.2.13  digest_0.6.37           
##  [40] colorspace_2.1-1         dqrng_0.4.1              irlba_2.3.5.1           
##  [43] ExperimentHub_2.15.0     RSQLite_2.3.9            beachmat_2.23.6         
##  [46] labeling_0.4.3           filelock_1.0.3           httr_1.4.7              
##  [49] abind_1.4-8              compiler_4.4.3           bit64_4.6.0-1           
##  [52] withr_3.0.2              BiocParallel_1.41.2      viridis_0.6.5           
##  [55] DBI_1.2.3                HDF5Array_1.35.15        alabaster.ranges_1.7.0  
##  [58] alabaster.schemas_1.7.0  MASS_7.3-65              rappdirs_0.3.3          
##  [61] DelayedArray_0.33.6      rjson_0.2.23             tools_4.4.3             
##  [64] vipor_0.4.7              beeswarm_0.4.0           glue_1.8.0              
##  [67] h5mread_0.99.4           restfulr_0.0.15          rhdf5filters_1.19.2     
##  [70] grid_4.4.3               Rtsne_0.17               cluster_2.1.8           
##  [73] gtable_0.3.6             data.table_1.17.0        BiocSingular_1.23.0     
##  [76] ScaledMatrix_1.15.0      metapod_1.15.0           XVector_0.47.2          
##  [79] ggrepel_0.9.6            BiocVersion_3.21.1       pillar_1.10.1           
##  [82] limma_3.63.8             dplyr_1.1.4              BiocFileCache_2.15.1    
##  [85] lattice_0.22-6           rtracklayer_1.67.1       bit_4.6.0               
##  [88] tidyselect_1.2.1         locfit_1.5-9.12          maketools_1.3.2         
##  [91] Biostrings_2.75.4        knitr_1.49               gridExtra_2.3           
##  [94] ProtGenerics_1.39.2      edgeR_4.5.8              xfun_0.51               
##  [97] statmod_1.5.0            UCSC.utils_1.3.1         lazyeval_0.2.2          
## [100] yaml_2.3.10              evaluate_1.0.3           codetools_0.2-20        
## [103] tibble_3.2.1             alabaster.matrix_1.7.8   BiocManager_1.30.25     
## [106] cli_3.6.4                munsell_0.5.1            jquerylib_0.1.4         
## [109] Rcpp_1.0.14              dbplyr_2.5.0             png_0.1-8               
## [112] XML_3.99-0.18            parallel_4.4.3           blob_1.2.4              
## [115] bitops_1.0-9             viridisLite_0.4.2        alabaster.se_1.7.0      
## [118] scales_1.3.0             purrr_1.0.4              crayon_1.5.3            
## [121] rlang_1.1.5              KEGGREST_1.47.0

Recovering intra-sample doublets

tl;dr

Mathematical background

Obtaining explicit calls

Discussion

Session information