Decontamination of single cell protein expression data with DecontPro

Introduction

DecontPro assess and decontaminate single-cell protein expression data, such as those generated from CITE-seq or Total-seq. The count matrix is decomposed into three matrices, the native, the ambient and the background that represent the contribution from the true protein expression on cells, the ambient material and other non-specific background contamination.

Installation

DecontX Package can be installed from Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install("decontX")

Then the package can be loaded in R using the following command:

library(decontX)

To see the latest updates and releases or to post a bug, see our GitHub page at https://github.com/campbio/decontX.

Importing data

Here we use an example dataset from SingleCellMultiModal package.

library(SingleCellMultiModal)
dat <- CITEseq("cord_blood", dry.run = FALSE)
#> Warning: 'ExperimentList' contains 'data.frame' or 'DataFrame',
#>   potential for errors with mixed data types
counts <- experiments(dat)$scADT

For this tutorial, we sample only 1000 droplets from the dataset to demonstrate the use of functions. When analyzing your dataset, sub-sampling should be done with caution, as decontPro approximates contamination profile using the dataset. A biased sampling may introduce bias to the contamination profile approximation.

set.seed(42)
sample_id <- sample(dim(counts)[2], 1000, replace = FALSE)
counts_sample <- counts[, sample_id]

Generate cell clusters

decontPro requires a vector indicating the cell types of each droplet. Here we use Seurat for clustering.

library(Seurat)
library(dplyr)
adt_seurat <- CreateSeuratObject(counts_sample, assay = "ADT")
#> Warning: Data is of class matrix. Coercing to dgCMatrix.
adt_seurat <- NormalizeData(adt_seurat, normalization.method = "CLR", margin = 2) %>%
  ScaleData(assay = "ADT") %>%
  RunPCA(assay = "ADT", features = rownames(adt_seurat), npcs = 10,
  reduction.name = "pca_adt") %>%
  FindNeighbors(dims = 1:10, assay = "ADT", reduction = "pca_adt") %>%
  FindClusters(resolution = 0.5)
#> Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
#> a percentage of total singular values, use a standard svd instead.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Warning: Requested number is larger than the number of available items (13).
#> Setting to 13.
#> Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
#> 
#> Number of nodes: 1000
#> Number of edges: 32498
#> 
#> Running Louvain algorithm...
#> Maximum modularity in 10 random starts: 0.8567
#> Number of communities: 9
#> Elapsed time: 0 seconds
adt_seurat <- RunUMAP(adt_seurat,
                      dims = 1:10,
                      assay = "ADT",
                      reduction = "pca_adt",
                      reduction.name = "adtUMAP",
                      verbose = FALSE)
#> Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
#> To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
#> This message will be shown once per session
DimPlot(adt_seurat, reduction = "adtUMAP", label = TRUE)

FeaturePlot(adt_seurat, 
            features = c("CD3", "CD4", "CD8", "CD19", "CD14", "CD16", "CD56"))


clusters <- as.integer(Idents(adt_seurat))

Run DecontPro

You can run DecontPro by simply:

counts <- as.matrix(counts_sample)
out <- deconPro(counts,
                clusters)

Priors (delta_sd and background_sd) may need tuning with the help of plotting the decontaminated results. The two priors encode belief if a more spread-out count should be considered contamination vs. native. We tested the default values on datasets ranging 5k to 10k droplets and 10 to 30 ADTs and the results are reasonable. A more contaminated or a smaller dataset may need larger priors. In this tutorial, since we sampled only 1k droplets from the original 10k droplets, we use slightly larger priors:

counts <- as.matrix(counts_sample)
out <- decontPro(counts,
                 clusters,
                 delta_sd = 2e-4,
                 background_sd = 2e-5)
#> Chain 1: ------------------------------------------------------------
#> Chain 1: EXPERIMENTAL ALGORITHM:
#> Chain 1:   This procedure has not been thoroughly tested and may be unstable
#> Chain 1:   or buggy. The interface is subject to change.
#> Chain 1: ------------------------------------------------------------
#> Chain 1: 
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Gradient evaluation took 0.008034 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 80.34 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Begin eta adaptation.
#> Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
#> Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
#> Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
#> Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
#> Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
#> Chain 1: Success! Found best value [eta = 1] earlier than expected.
#> Chain 1: 
#> Chain 1: Begin stochastic gradient ascent.
#> Chain 1:   iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
#> Chain 1:    100   -513803659.665             1.000            1.000
#> Chain 1:    200     -7177548.141            35.792           70.585
#> Chain 1:    300      -342758.348            30.508           19.941
#> Chain 1:    400      -253644.497            22.969           19.941
#> Chain 1:    500      -223785.346            18.402            1.000
#> Chain 1:    600      -212411.740            15.344            1.000
#> Chain 1:    700      -206476.954            13.156            0.351
#> Chain 1:    800      -202104.000            11.514            0.351
#> Chain 1:    900      -199781.145            10.236            0.133
#> Chain 1:   1000      -198758.154             9.213            0.133
#> Chain 1:   1100      -198020.264             8.376            0.054   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1200      -197397.014             7.678            0.054   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1300      -196906.436             7.088            0.029   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1400      -196442.570             6.582            0.029   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1500      -196230.683             6.143            0.022   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1600      -195981.145             5.759            0.022   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1700      -195830.350             5.420            0.012   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1800      -195732.312             5.119            0.012   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1:   1900      -195651.552             4.850            0.005   MEDIAN ELBO CONVERGED   MAY BE DIVERGING... INSPECT ELBO
#> Chain 1: 
#> Chain 1: Drawing a sample of size 1000 from the approximate posterior... 
#> Chain 1: COMPLETED.
#> Warning: Pareto k diagnostic value is 41.49. Resampling is disabled. Decreasing
#> tol_rel_obj may help if variational algorithm has terminated prematurely.
#> Otherwise consider using sampling instead.

The output contains three matrices, and model parameters after inference. decontaminated_counts represent the true protein expression on cells.

decontaminated_counts <- out$decontaminated_counts

Plot Results

Plot ADT density before and after decontamination. For bimodal ADTs, the background peak should be removed. Note CD4 is tri-modal with the intermediate mode corresponding to monocytes. This can be used as a QC metric for decontamination as only the lowest mode should be removed.

plotDensity(counts,
            decontaminated_counts,
            c("CD3", "CD4", "CD8", "CD14", "CD16", "CD19"))

We can also visualize the decontamination by each cell cluster.

plotBoxByCluster(counts,
                 decontaminated_counts,
                 clusters,
                 c("CD3", "CD4", "CD8", "CD16"))

Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] dplyr_1.1.4                 Seurat_5.1.0               
#>  [3] SeuratObject_5.0.2          sp_2.1-4                   
#>  [5] SingleCellMultiModal_1.17.4 MultiAssayExperiment_1.31.5
#>  [7] SummarizedExperiment_1.35.5 Biobase_2.67.0             
#>  [9] GenomicRanges_1.57.2        GenomeInfoDb_1.41.2        
#> [11] IRanges_2.39.2              S4Vectors_0.43.2           
#> [13] BiocGenerics_0.53.0         MatrixGenerics_1.17.1      
#> [15] matrixStats_1.4.1           decontX_1.5.0              
#> [17] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RcppAnnoy_0.0.22            splines_4.4.1              
#>   [3] later_1.3.2                 filelock_1.0.3             
#>   [5] tibble_3.2.1                polyclip_1.10-7            
#>   [7] fastDummies_1.7.4           lifecycle_1.0.4            
#>   [9] StanHeaders_2.32.10         globals_0.16.3             
#>  [11] lattice_0.22-6              MASS_7.3-61                
#>  [13] magrittr_2.0.3              plotly_4.10.4              
#>  [15] sass_0.4.9                  rmarkdown_2.28             
#>  [17] jquerylib_0.1.4             yaml_2.3.10                
#>  [19] httpuv_1.6.15               sctransform_0.4.1          
#>  [21] spam_2.11-0                 pkgbuild_1.4.5             
#>  [23] spatstat.sparse_3.1-0       reticulate_1.39.0          
#>  [25] DBI_1.2.3                   cowplot_1.1.3              
#>  [27] pbapply_1.7-2               buildtools_1.0.0           
#>  [29] RColorBrewer_1.1-3          abind_1.4-8                
#>  [31] zlibbioc_1.51.2             Rtsne_0.17                 
#>  [33] purrr_1.0.2                 rappdirs_0.3.3             
#>  [35] GenomeInfoDbData_1.2.13     ggrepel_0.9.6              
#>  [37] inline_0.3.19               irlba_2.3.5.1              
#>  [39] listenv_0.9.1               spatstat.utils_3.1-0       
#>  [41] maketools_1.3.1             goftest_1.2-3              
#>  [43] RSpectra_0.16-2             spatstat.random_3.3-2      
#>  [45] fitdistrplus_1.2-1          parallelly_1.38.0          
#>  [47] DelayedArray_0.31.14        leiden_0.4.3.1             
#>  [49] codetools_0.2-20            tidyselect_1.2.1           
#>  [51] UCSC.utils_1.1.0            farver_2.1.2               
#>  [53] BiocFileCache_2.15.0        spatstat.explore_3.3-3     
#>  [55] jsonlite_1.8.9              progressr_0.15.0           
#>  [57] ggridges_0.5.6              survival_3.7-0             
#>  [59] tools_4.4.1                 ica_1.0-3                  
#>  [61] Rcpp_1.0.13                 glue_1.8.0                 
#>  [63] BiocBaseUtils_1.9.0         SparseArray_1.5.45         
#>  [65] gridExtra_2.3               xfun_0.48                  
#>  [67] loo_2.8.0                   withr_3.0.2                
#>  [69] combinat_0.0-8              BiocManager_1.30.25        
#>  [71] fastmap_1.2.0               MCMCprecision_0.4.0        
#>  [73] fansi_1.0.6                 digest_0.6.37              
#>  [75] R6_2.5.1                    mime_0.12                  
#>  [77] colorspace_2.1-1            scattermore_1.2            
#>  [79] tensor_1.5                  RSQLite_2.3.7              
#>  [81] spatstat.data_3.1-2         utf8_1.2.4                 
#>  [83] tidyr_1.3.1                 generics_0.1.3             
#>  [85] data.table_1.16.2           S4Arrays_1.5.11            
#>  [87] httr_1.4.7                  htmlwidgets_1.6.4          
#>  [89] uwot_0.2.2                  pkgconfig_2.0.3            
#>  [91] gtable_0.3.6                blob_1.2.4                 
#>  [93] lmtest_0.9-40               SingleCellExperiment_1.27.2
#>  [95] XVector_0.45.0              sys_3.4.3                  
#>  [97] htmltools_0.5.8.1           dotCall64_1.2              
#>  [99] scales_1.3.0                png_0.1-8                  
#> [101] SpatialExperiment_1.15.1    spatstat.univar_3.0-1      
#> [103] knitr_1.48                  rjson_0.2.23               
#> [105] reshape2_1.4.4              nlme_3.1-166               
#> [107] curl_5.2.3                  cachem_1.1.0               
#> [109] zoo_1.8-12                  stringr_1.5.1              
#> [111] BiocVersion_3.21.1          KernSmooth_2.23-24         
#> [113] parallel_4.4.1              miniUI_0.1.1.1             
#> [115] AnnotationDbi_1.69.0        pillar_1.9.0               
#> [117] grid_4.4.1                  vctrs_0.6.5                
#> [119] RANN_2.6.2                  promises_1.3.0             
#> [121] dbplyr_2.5.0                xtable_1.8-4               
#> [123] cluster_2.1.6               evaluate_1.0.1             
#> [125] magick_2.8.5                cli_3.6.3                  
#> [127] compiler_4.4.1              crayon_1.5.3               
#> [129] rlang_1.1.4                 rstantools_2.4.0           
#> [131] future.apply_1.11.3         labeling_0.4.3             
#> [133] plyr_1.8.9                  stringi_1.8.4              
#> [135] rstan_2.32.6                viridisLite_0.4.2          
#> [137] deldir_2.0-4                QuickJSR_1.4.0             
#> [139] Biostrings_2.75.0           munsell_0.5.1              
#> [141] lazyeval_0.2.2              spatstat.geom_3.3-3        
#> [143] V8_6.0.0                    Matrix_1.7-1               
#> [145] ExperimentHub_2.13.1        RcppHNSW_0.6.0             
#> [147] patchwork_1.3.0             bit64_4.5.2                
#> [149] future_1.34.0               ggplot2_3.5.1              
#> [151] KEGGREST_1.45.1             shiny_1.9.1                
#> [153] highr_0.11                  AnnotationHub_3.15.0       
#> [155] ROCR_1.0-11                 memoise_2.0.1              
#> [157] igraph_2.1.1                RcppParallel_5.1.9         
#> [159] bslib_0.8.0                 bit_4.5.0