scds:single cell doublet scoring: In-silico doublet annotation for single cell RNA sequencing data

Introduction

In this vignette, we provide an overview of the basic functionality and usage of the scds package, which interfaces with SingleCellExperiment objects.

Installation

Install the scds package using Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scds", version = "3.9")

Or from github:

library(devtools)
devtools::install_github('kostkalab/scds')

Quick start

scds takes as input a SingleCellExperiment object (see here SingleCellExperiment), where raw counts are stored in a counts assay, i.e. assay(sce,"counts"). An example dataset created by sub-sampling the cell-hashing cell-lines data set (see https://satijalab.org/seurat/hashing_vignette.html) is included with the package and accessible via data("sce").Note that scds is designed to workd with larger datasets, but for the purposes of this vignette, we work with a smaller example dataset. We apply scds to this data and compare/visualize reasults:

Example data set

Get example data set provided with the package.

library(scds)
library(scater)
library(rsvd)
library(Rtsne)
library(cowplot)
set.seed(30519)
data("sce_chcl")
sce = sce_chcl #- less typing
dim(sce)
## [1] 2000 2000

We see it contains 2,000 genes and 2,000 cells, 216 of which are identified as doublets:

table(sce$hto_classification_global)
## 
##  Doublet Negative  Singlet 
##      216       83     1701

We can visualize cells/doublets after projecting into two dimensions:

logcounts(sce) = log1p(counts(sce))
vrs            = apply(logcounts(sce),1,var)
pc             = rpca(t(logcounts(sce)[order(vrs,decreasing=TRUE)[1:100],]))
ts             = Rtsne(pc$x[,1:10],verb=FALSE)

reducedDim(sce,"tsne") = ts$Y; rm(ts,vrs,pc)
plotReducedDim(sce,"tsne",color_by="hto_classification_global")

Computational doublet annotation

We now run the scds doublet annotation approaches. Briefly, we identify doublets in two complementary ways: cxds is based on co-expression of gene pairs and works with absence/presence calls only, while bcds uses the full count information and a binary classification approach using artificially generated doublets. cxds_bcds_hybrid combines both approaches, for more details please consult (this manuscript). Each of the three methods returns a doublet score, with higher scores indicating more “doublet-like” barcodes.

#- Annotate doublet using co-expression based doublet scoring:
sce = cxds(sce,retRes = TRUE)
sce = bcds(sce,retRes = TRUE,verb=TRUE)
sce = cxds_bcds_hybrid(sce)
par(mfcol=c(1,3))
boxplot(sce$cxds_score   ~ sce$doublet_true_labels, main="cxds")
boxplot(sce$bcds_score   ~ sce$doublet_true_labels, main="bcds")
boxplot(sce$hybrid_score ~ sce$doublet_true_labels, main="hybrid")

Visualizing gene pairs

For cxds we can identify and visualize gene pairs driving doublet annoataions, with the expectation that the two genes in a pair might mark different types of cells (see manuscript). In the following we look at the top three pairs, each gene pair is a row in the plot below:

scds =
top3 = metadata(sce)$cxds$topPairs[1:3,]
rs   = rownames(sce)
hb   = rowData(sce)$cxds_hvg_bool
ho   = rowData(sce)$cxds_hvg_ordr[hb]
hgs  = rs[ho]

l1 =  ggdraw() + draw_text("Pair 1", x = 0.5, y = 0.5)
p1 = plotReducedDim(sce,"tsne",color_by=hgs[top3[1,1]])
p2 = plotReducedDim(sce,"tsne",color_by=hgs[top3[1,2]])

l2 =  ggdraw() + draw_text("Pair 2", x = 0.5, y = 0.5)
p3 = plotReducedDim(sce,"tsne",color_by=hgs[top3[2,1]])
p4 = plotReducedDim(sce,"tsne",color_by=hgs[top3[2,2]])

l3 = ggdraw() + draw_text("Pair 3", x = 0.5, y = 0.5)
p5 = plotReducedDim(sce,"tsne",color_by=hgs[top3[3,1]])
p6 = plotReducedDim(sce,"tsne",color_by=hgs[top3[3,2]])

plot_grid(l1,p1,p2,l2,p3,p4,l3,p5,p6,ncol=3, rel_widths = c(1,2,2))

Session Info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] cowplot_1.1.3               Rtsne_0.17                 
##  [3] rsvd_1.0.5                  scater_1.34.0              
##  [5] ggplot2_3.5.1               scuttle_1.16.0             
##  [7] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0
##  [9] Biobase_2.67.0              GenomicRanges_1.59.0       
## [11] GenomeInfoDb_1.43.0         IRanges_2.41.0             
## [13] S4Vectors_0.44.0            BiocGenerics_0.53.0        
## [15] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [17] scds_1.23.0                 BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        viridisLite_0.4.2       farver_2.1.2           
##  [4] dplyr_1.1.4             vipor_0.4.7             viridis_0.6.5          
##  [7] fastmap_1.2.0           pROC_1.18.5             digest_0.6.37          
## [10] lifecycle_1.0.4         magrittr_2.0.3          compiler_4.4.1         
## [13] rlang_1.1.4             sass_0.4.9              tools_4.4.1            
## [16] utf8_1.2.4              yaml_2.3.10             data.table_1.16.2      
## [19] knitr_1.48              labeling_0.4.3          S4Arrays_1.6.0         
## [22] xgboost_1.7.8.1         DelayedArray_0.33.1     plyr_1.8.9             
## [25] abind_1.4-8             BiocParallel_1.41.0     withr_3.0.2            
## [28] sys_3.4.3               grid_4.4.1              fansi_1.0.6            
## [31] beachmat_2.23.0         colorspace_2.1-1        scales_1.3.0           
## [34] cli_3.6.3               rmarkdown_2.28          crayon_1.5.3           
## [37] generics_0.1.3          httr_1.4.7              ggbeeswarm_0.7.2       
## [40] cachem_1.1.0            zlibbioc_1.52.0         parallel_4.4.1         
## [43] BiocManager_1.30.25     XVector_0.46.0          vctrs_0.6.5            
## [46] Matrix_1.7-1            jsonlite_1.8.9          BiocSingular_1.23.0    
## [49] BiocNeighbors_2.1.0     ggrepel_0.9.6           irlba_2.3.5.1          
## [52] beeswarm_0.4.0          maketools_1.3.1         jquerylib_0.1.4        
## [55] glue_1.8.0              codetools_0.2-20        gtable_0.3.6           
## [58] UCSC.utils_1.2.0        ScaledMatrix_1.14.0     munsell_0.5.1          
## [61] tibble_3.2.1            pillar_1.9.0            htmltools_0.5.8.1      
## [64] GenomeInfoDbData_1.2.13 R6_2.5.1                evaluate_1.0.1         
## [67] lattice_0.22-6          highr_0.11              bslib_0.8.0            
## [70] Rcpp_1.0.13             gridExtra_2.3           SparseArray_1.6.0      
## [73] xfun_0.48               buildtools_1.0.0        pkgconfig_2.0.3