Normalization by distributional resampling of high throughput single-cell RNA-sequencing data

Introduction

Over the past decade, advances in single-cell RNA-sequencing (scRNA-seq) technologies have significantly increased the sensitivity and specificity with which cellular transcriptional dynamics can be analyzed. Further, parallel increases in the number cells which can be simultaneously sequenced have allowed for novel analysis pipelines including the description of transcriptional trajectories and the discovery of rare sub-populations of cells. The development of droplet-based, unique-molecular-identifier (UMI) protocols such as Drop-seq, inDrop, and the 10x Genomics Chromium platform have significantly contributed to these advances. In particular, the commercially available 10x Genomics platform has allowed the rapid and cost effective gene expression profiling of hundreds to tens of thousands of cells across many studies to date.

The use of UMIs in the 10x Genomics and related platforms has augmented these developments in sequencing technology by tagging individual mRNA transcripts with unique cell and transcript specific identifiers. In this way, biases due to transcript length and PCR amplification have been significantly reduced. However, technical variability in sequencing depth remains and, consequently, normalization to adjust for sequencing depth is required to ensure accurate downstream analyses. To address this, we introduce Dino, an R package implementing the Dino normalization method.

Dino utilizes a flexible mixture of Negative Binomials model of gene expression to reconstruct full gene-specific expression distributions which are independent of sequencing depth. By giving exact zeros positive probability, the Negative Binomial components are applicable to shallow sequencing (high proportions of zeros). Additionally, the mixture component is robust to cell heterogeneity as it accommodates multiple centers of gene expression in the distribution. By directly modeling (possibly heterogenous) gene-specific expression distributions, Dino outperforms competing approaches, especially for datasets in which the proportion of zeros is high as is typical for modern, UMI based protocols.

Dino does not attempt to correct for batch or other sample specific effects, and will only do so to the extent that they are correlated with sequencing depth. In situations where batch effects are expected, downstream analysis may benefit from such accommodations.

Quick Start

Installation

Dino is now available on BioConductor and can be easily installed from that repository by running:

# Install Bioconductor if not present, skip otherwise
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install Dino package
BiocManager::install("Dino")

# View (this) vignette from R
browseVignettes("Dino")

Dino is also available from Github, and bug fixes, patches, and updates are available there first. To install Dino from Github, run

devtools::install_github('JBrownBiostat/Dino', build_vignettes = TRUE)

Note, building vignettes can take a little time, so for a quicker install, consider setting build_vignettes = FALSE.

All-in-one function

Dino (function) is an all-in-one function to normalize raw UMI count data from 10X Cell Ranger or similar protocols. Under default options, Dino outputs a sparse matrix of normalized expression. SeuratFromDino provides one-line functionality to return a Seurat object from raw UMI counts or from a previously normalized expression matrix.

library(Dino)

# Return a sparse matrix of normalized expression
Norm_Mat <- Dino(UMI_Mat)

# Return a Seurat object from already normalized expression
# Use normalized (doNorm = FALSE) and un-transformed (doLog = FALSE) expression
Norm_Seurat <- SeuratFromDino(Norm_Mat, doNorm = FALSE, doLog = FALSE)

# Return a Seurat object from UMI expression
# Transform normalized expression as log(x + 1) to improve
# some types of downstream analysis
Norm_Seurat <- SeuratFromDino(UMI_Mat)

Detailed steps

Read UMI data

To facilitate concrete examples, we demonstrate normalization on a small subset of sequencing data from about 3,000 peripheral blood mononuclear cells (PBMCs) published by 10X Genomics. This dataset, named pbmcSmall contains 200 cells and 1,000 genes and is included with the Dino package.

set.seed(1)

# Bring pbmcSmall into R environment
library(Dino)
library(Seurat)
library(Matrix)
data("pbmcSmall")
print(dim(pbmcSmall))
## [1] 1000  200

While Dino was developed to normalize UMI count data, it will run on any matrix of non-negative expression data; user caution is advised if applying Dino to non-UMI sequencing protocols. Input formats may be sparse or dense matrices of expression with genes (features) on the rows and cells (samples) on the columns.

Clean UMI data

While Dino can normalize the pbmcSmall dataset as it currently exists, the resulting normalized matrix, and in particular, downstream analysis are likely to be improved by cleaning the data. Of greatest use is removing genes that are expected not to contain useful information. This set of genes may be case dependent, but a good rule of thumb for UMI protocols is to remove genes lacking a minimum of non-zero expression prior to normalization and analysis.

By default, Dino will not perform the resampling algorithm on any genes without at least 10 non-zero samples, and will rather normalize such genes by scaling with sequencing depth. To demonstrate a stricter threshold, we remove genes lacking at least 20 non-zero samples prior to normalization.

# Filter genes for a minimum of non-zero expression
pbmcSmall <- pbmcSmall[rowSums(pbmcSmall != 0) >= 20, ]
print(dim(pbmcSmall))
## [1] 907 200

Normalize UMI data

Dino contains several options to tune output. One of particular interest is nCores which allows for parallel computation of normalized expression. By default, Dino runs with two threads. Choosing nCores = 0 will utilize all available cores, and otherwise an integer number of parallel instances can be chosen.

# Normalize data
pbmcSmall_Norm <- Dino(pbmcSmall)

Clustering with Seurat

After normalization, Dino makes it easy to perform data analysis. The default output is the normalized matrix in sparse format, and Dino additionally provides a function to transform normalized output into a Seurat object. We demonstrate this by running a quick clustering pipeline in Seurat. Much of the pipeline is modified from the tutorial at https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html

# Reformat normalized expression as a Seurat object
pbmcSmall_Seurat <- SeuratFromDino(pbmcSmall_Norm, doNorm = FALSE)

# Cluster pbmcSmall_Seurat
pbmcSmall_Seurat <- FindVariableFeatures(pbmcSmall_Seurat, 
                        selection.method = "mvp")
pbmcSmall_Seurat <- ScaleData(pbmcSmall_Seurat, 
                        features = rownames(pbmcSmall_Norm))
pbmcSmall_Seurat <- RunPCA(pbmcSmall_Seurat, 
                        features = VariableFeatures(object = pbmcSmall_Seurat),
                        verbose = FALSE)
pbmcSmall_Seurat <- FindNeighbors(pbmcSmall_Seurat, dims = 1:10)
pbmcSmall_Seurat <- FindClusters(pbmcSmall_Seurat, verbose = FALSE)
pbmcSmall_Seurat <- RunUMAP(pbmcSmall_Seurat, dims = 1:10)
DimPlot(pbmcSmall_Seurat, reduction = "umap")

Normalizing data formatted as SingleCellExperiment

Dino additionally supports the normalization of datasets formatted as SingleCellExperiment. As with the Seurat pipeline, this functionality is implemented through the use of a wrapper function. We demonstrate this by quickly converting the pbmcSmall dataset to a SingleCellExperiment object and then normalizing.

# Reformatting pbmcSmall as a SingleCellExperiment
library(SingleCellExperiment)
pbmc_SCE <- SingleCellExperiment(assays = list("counts" = pbmcSmall))

# Run Dino
pbmc_SCE <- Dino_SCE(pbmc_SCE)
str(normcounts(pbmc_SCE))
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:162808] 0 1 2 3 4 5 6 7 8 9 ...
##   ..@ p       : int [1:201] 0 807 1620 2442 3249 4066 4874 5680 6483 7283 ...
##   ..@ Dim     : int [1:2] 907 200
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:907] "ENSG00000087086" "ENSG00000167996" "ENSG00000251562" "ENSG00000205542" ...
##   .. ..$ : chr [1:200] "CCAACCTGACGTAC-1" "ATCTGGGATTCCGC-1" "TACTTTCTTTTGGG-1" "CAGGCCGAACACGT-1" ...
##   ..@ x       : num [1:162808] 107.5 31.4 12.3 29.5 9.6 ...
##   ..@ factors : list()

Alternate sequencing depth

By default, Dino computes sequencing depth, which is corrected for in the normalized data, as the sum of expression for a cell (sample) across genes. This sum is then scaled such that the median depth is 1. For some datasets, however, it may be beneficial to run Dino on an alternately computed set of sequencing depths. Note: it is generally recommended that the median depth not be far from 1 as this corresponds to recomputing expression as though all cells had been sequenced at the median depth.

A simple pipeline to compute alternate sequencing depths utilizes the Scran method for computing normalization scale factors, and is demonstrated below.

library(scran)

# Compute scran size factors
scranSizes <- calculateSumFactors(pbmcSmall)

# Re-normalize data
pbmcSmall_SNorm <- Dino(pbmcSmall, nCores = 1, depth = log(scranSizes))

A fuller discussion of a specific use case for providing alternate sequencing depths can be viewed on the Dino Github page: Issue #1

Method

Model

Dino models observed UMI counts as a mixture of Negative Binomial random variables. The Negative Binomial distribution can, however, be decomposed into a hierarchical Gamma-Poisson distribution, so for gene g and cell j, the Dino model for UMI counts is: $$y_{gj}\sim f^{P}(\lambda_{gj}\delta_{j})\\ \lambda_{gj}\sim\sum_{K}\pi_{k}f^{G}\left(\frac{\mu_{gk}}{\theta_g},\theta_g\right)$$ where fP is a Poisson distribution parameterized by mean λgjδj and fG is a Gamma distribution parameterized by shape μgk/θg and scale θg. δj is the cell-specific sequencing depth, λgj is the latent level of gene/cell-specific expression independent of depth, component probabilities πk sum to 1, the Gamma distribution is parameterized such that μgk denotes the distribution mean, and the Gamma scale paramter, θg, is constant across mixture components.

Following model fitting for a fixed gene through an accelerated EM algorithm, Dino produces normalized expression values by resampling from the posterior distribution of the latent expression parameters, λgj. It can be shown that the distribution on the λj (dropping the gene-specific subscript g as calculations are repreated across genes) is a mixture of Gammas, specifically: $$\mathbb{P}(\lambda_{j}|y_{j},\delta_j)=\sum_{K}\tau_{kj}f^{G}\left(\frac{\mu_{k}}{\theta}+\gamma y_{j},\frac{1}{\frac{1}{\theta}+\gamma\delta_j}\right)$$ where τkj denotes the conditional probability that λgj was sampled from mixture component k and γ is a global concentration parameter. The τkj are estimated as part of the implementation of the EM algorithm in Dino. The adjustment from the concentration parameter can be seen as a bias in the normalized values towards a scale-factor version of normalization, since, in the limit of γ, the normalized expression for cell j converges to yj/δj. Default values of γ = 15 have proven successful.

Mixture components K

Approximating the flexibility of a non-parametric method, Dino uses a large number of mixture components, K, in order to capture the full heterogeneity of expression that may exist for a given gene. The gene-specific number of components is estimated as the square root of the number of strictly positive UMI counts for a given gene. By default, K is limited to be no larger than 100. In simulation, large values of K are shown to successfully reconstruct both unimodal and multimodal underlying distributions. For example, when UMI counts are estimated under a single negative binomial distribution, the Dino fitted prior distribution (black, right panel) which is used to sample normalized expression closely matches the theoretical sampling distribution (red, right panel). Likewise, the fitted means (μk in the model, gray lines, left panel) span the range of the simulated data (heat map of counts, left panel), but concentrate around the theoretical mean of the sampling distribution (red, left panel).

## TableGrob (2 x 2) "arrange": 3 grobs
##   z     cells    name                grob
## 1 1 (2-2,1-1) arrange      gtable[layout]
## 2 2 (2-2,2-2) arrange      gtable[layout]
## 3 3 (1-1,1-2) arrange text[GRID.text.278]

Simulating data from a pair of Negative Binomial distributions with different means and different dispersion parameters yields similar results in the multimodal case.

## TableGrob (2 x 2) "arrange": 3 grobs
##   z     cells    name                grob
## 1 1 (2-2,1-1) arrange      gtable[layout]
## 2 2 (2-2,2-2) arrange      gtable[layout]
## 3 3 (1-1,1-2) arrange text[GRID.text.480]

Session Information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggpubr_0.6.0                gridExtra_2.3              
##  [3] ggplot2_3.5.1               SingleCellExperiment_1.29.1
##  [5] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [7] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [9] IRanges_2.41.2              S4Vectors_0.45.2           
## [11] BiocGenerics_0.53.3         generics_0.1.3             
## [13] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [15] Matrix_1.7-1                Seurat_5.1.0               
## [17] SeuratObject_5.0.2          sp_2.1-4                   
## [19] Dino_1.13.0                 knitr_1.49                 
## [21] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.22        splines_4.4.2           later_1.4.1            
##   [4] tibble_3.2.1            polyclip_1.10-7         fastDummies_1.7.4      
##   [7] lifecycle_1.0.4         rstatix_0.7.2           edgeR_4.5.1            
##  [10] globals_0.16.3          lattice_0.22-6          MASS_7.3-61            
##  [13] backports_1.5.0         magrittr_2.0.3          limma_3.63.2           
##  [16] plotly_4.10.4           sass_0.4.9              rmarkdown_2.29         
##  [19] jquerylib_0.1.4         yaml_2.3.10             metapod_1.15.0         
##  [22] httpuv_1.6.15           sctransform_0.4.1       spam_2.11-0            
##  [25] spatstat.sparse_3.1-0   reticulate_1.40.0       cowplot_1.1.3          
##  [28] pbapply_1.7-2           buildtools_1.0.0        RColorBrewer_1.1-3     
##  [31] abind_1.4-8             zlibbioc_1.52.0         Rtsne_0.17             
##  [34] purrr_1.0.2             GenomeInfoDbData_1.2.13 ggrepel_0.9.6          
##  [37] irlba_2.3.5.1           listenv_0.9.1           spatstat.utils_3.1-1   
##  [40] maketools_1.3.1         goftest_1.2-3           RSpectra_0.16-2        
##  [43] spatstat.random_3.3-2   dqrng_0.4.1             fitdistrplus_1.2-1     
##  [46] parallelly_1.40.1       leiden_0.4.3.1          codetools_0.2-20       
##  [49] DelayedArray_0.33.3     scuttle_1.17.0          tidyselect_1.2.1       
##  [52] UCSC.utils_1.3.0        farver_2.1.2            ScaledMatrix_1.15.0    
##  [55] spatstat.explore_3.3-3  jsonlite_1.8.9          BiocNeighbors_2.1.2    
##  [58] Formula_1.2-5           progressr_0.15.1        ggridges_0.5.6         
##  [61] survival_3.8-3          tools_4.4.2             ica_1.0-3              
##  [64] Rcpp_1.0.13-1           glue_1.8.0              SparseArray_1.7.2      
##  [67] xfun_0.49               dplyr_1.1.4             withr_3.0.2            
##  [70] BiocManager_1.30.25     fastmap_1.2.0           bluster_1.17.0         
##  [73] digest_0.6.37           rsvd_1.0.5              R6_2.5.1               
##  [76] mime_0.12               colorspace_2.1-1        scattermore_1.2        
##  [79] tensor_1.5              spatstat.data_3.1-4     hexbin_1.28.5          
##  [82] tidyr_1.3.1             data.table_1.16.4       httr_1.4.7             
##  [85] htmlwidgets_1.6.4       S4Arrays_1.7.1          uwot_0.2.2             
##  [88] pkgconfig_2.0.3         gtable_0.3.6            lmtest_0.9-40          
##  [91] XVector_0.47.0          sys_3.4.3               htmltools_0.5.8.1      
##  [94] carData_3.0-5           dotCall64_1.2           scales_1.3.0           
##  [97] png_0.1-8               spatstat.univar_3.1-1   scran_1.35.0           
## [100] reshape2_1.4.4          nlme_3.1-166            cachem_1.1.0           
## [103] zoo_1.8-12              stringr_1.5.1           KernSmooth_2.23-24     
## [106] parallel_4.4.2          miniUI_0.1.1.1          pillar_1.10.0          
## [109] vctrs_0.6.5             RANN_2.6.2              promises_1.3.2         
## [112] car_3.1-3               BiocSingular_1.23.0     beachmat_2.23.5        
## [115] xtable_1.8-4            cluster_2.1.8           evaluate_1.0.1         
## [118] cli_3.6.3               locfit_1.5-9.10         compiler_4.4.2         
## [121] rlang_1.1.4             crayon_1.5.3            future.apply_1.11.3    
## [124] ggsignif_0.6.4          labeling_0.4.3          plyr_1.8.9             
## [127] stringi_1.8.4           viridisLite_0.4.2       deldir_2.0-4           
## [130] BiocParallel_1.41.0     munsell_0.5.1           lazyeval_0.2.2         
## [133] spatstat.geom_3.3-4     RcppHNSW_0.6.0          patchwork_1.3.0        
## [136] future_1.34.0           statmod_1.5.0           shiny_1.10.0           
## [139] ROCR_1.0-11             igraph_2.1.2            broom_1.0.7            
## [142] bslib_0.8.0

Citation

If you use Dino in your analysis, please cite our paper:

Brown, J., Ni, Z., Mohanty, C., Bacher, R., and Kendziorski, C. (2021). “Normalization by distributional resampling of high throughput single-cell RNA-sequencing data.” Bioinformatics, 37, 4123-4128. https://academic.oup.com/bioinformatics/article/37/22/4123/6306403.

Other work referenced in this vignette include:

Satija, R., Farrell, J.A., Gennert, D., Schier, A.F. and Regev, A. (2015). “Spatial reconstruction of single-cell gene expression data.” Nat. Biotechnol., 33, 495–502. https://doi.org/10.1038/nbt.3192

Amezquita, R.A., Lun, A.T.L., Becht, E., Carey, V.J., Carpp, L.N., Geistlinger, L., Marini, F., Rue-Albrecht, K., Risso, D., Soneson, C., et al. (2020). “Orchestrating single-cell analysis with Bioconductor.” Nat. Methods, 17, 137–145. https://doi.org/10.1038/s41592-019-0654-x

Lun, A. T. L., Bach, K. and Marioni, J. C. (2016). “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol., 17, 1–14. https://doi.org/10.1186/s13059-016-0947-7

Contact

Jared Brown:

Christina Kendziorski: