Over the past decade, advances in single-cell RNA-sequencing (scRNA-seq) technologies have significantly increased the sensitivity and specificity with which cellular transcriptional dynamics can be analyzed. Further, parallel increases in the number cells which can be simultaneously sequenced have allowed for novel analysis pipelines including the description of transcriptional trajectories and the discovery of rare sub-populations of cells. The development of droplet-based, unique-molecular-identifier (UMI) protocols such as Drop-seq, inDrop, and the 10x Genomics Chromium platform have significantly contributed to these advances. In particular, the commercially available 10x Genomics platform has allowed the rapid and cost effective gene expression profiling of hundreds to tens of thousands of cells across many studies to date.
The use of UMIs in the 10x Genomics and related platforms has
augmented these developments in sequencing technology by tagging
individual mRNA transcripts with unique cell and transcript specific
identifiers. In this way, biases due to transcript length and PCR
amplification have been significantly reduced. However, technical
variability in sequencing depth remains and, consequently, normalization
to adjust for sequencing depth is required to ensure accurate downstream
analyses. To address this, we introduce Dino
, an
R
package implementing the Dino
normalization method.
Dino
utilizes a flexible mixture of Negative Binomials
model of gene expression to reconstruct full gene-specific expression
distributions which are independent of sequencing depth. By giving exact
zeros positive probability, the Negative Binomial components are
applicable to shallow sequencing (high proportions of zeros).
Additionally, the mixture component is robust to cell heterogeneity as
it accommodates multiple centers of gene expression in the distribution.
By directly modeling (possibly heterogenous) gene-specific expression
distributions, Dino outperforms competing approaches, especially for
datasets in which the proportion of zeros is high as is typical for
modern, UMI based protocols.
Dino
does not attempt to correct for batch or other
sample specific effects, and will only do so to the extent that they are
correlated with sequencing depth. In situations where batch effects are
expected, downstream analysis may benefit from such accommodations.
Dino
is now available on BioConductor
and
can be easily installed from that repository by running:
# Install Bioconductor if not present, skip otherwise
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Install Dino package
BiocManager::install("Dino")
# View (this) vignette from R
browseVignettes("Dino")
Dino
is also available from Github, and bug fixes,
patches, and updates are available there first. To install
Dino
from Github, run
Note, building vignettes can take a little time, so for a
quicker install, consider setting
build_vignettes = FALSE
.
Dino
(function) is an all-in-one function to normalize
raw UMI count data from 10X Cell Ranger or similar protocols. Under
default options, Dino
outputs a sparse matrix of normalized
expression. SeuratFromDino
provides one-line functionality
to return a Seurat object from raw UMI counts or from a previously
normalized expression matrix.
library(Dino)
# Return a sparse matrix of normalized expression
Norm_Mat <- Dino(UMI_Mat)
# Return a Seurat object from already normalized expression
# Use normalized (doNorm = FALSE) and un-transformed (doLog = FALSE) expression
Norm_Seurat <- SeuratFromDino(Norm_Mat, doNorm = FALSE, doLog = FALSE)
# Return a Seurat object from UMI expression
# Transform normalized expression as log(x + 1) to improve
# some types of downstream analysis
Norm_Seurat <- SeuratFromDino(UMI_Mat)
To facilitate concrete examples, we demonstrate normalization on a
small subset of sequencing data from about 3,000 peripheral blood
mononuclear cells (PBMCs) published by 10X
Genomics. This dataset, named pbmcSmall
contains 200
cells and 1,000 genes and is included with the Dino
package.
set.seed(1)
# Bring pbmcSmall into R environment
library(Dino)
library(Seurat)
library(Matrix)
data("pbmcSmall")
print(dim(pbmcSmall))
## [1] 1000 200
While Dino
was developed to normalize UMI count data, it
will run on any matrix of non-negative expression data; user caution is
advised if applying Dino
to non-UMI sequencing protocols.
Input formats may be sparse or dense matrices of expression with genes
(features) on the rows and cells (samples) on the columns.
While Dino
can normalize the pbmcSmall
dataset as it currently exists, the resulting normalized matrix, and in
particular, downstream analysis are likely to be improved by cleaning
the data. Of greatest use is removing genes that are expected
not to contain useful information. This set of genes may be
case dependent, but a good rule of thumb for UMI protocols is to remove
genes lacking a minimum of non-zero expression prior to normalization
and analysis.
By default, Dino
will not perform the resampling
algorithm on any genes without at least 10 non-zero samples, and will
rather normalize such genes by scaling with sequencing depth. To
demonstrate a stricter threshold, we remove genes lacking at least 20
non-zero samples prior to normalization.
# Filter genes for a minimum of non-zero expression
pbmcSmall <- pbmcSmall[rowSums(pbmcSmall != 0) >= 20, ]
print(dim(pbmcSmall))
## [1] 907 200
Dino
contains several options to tune output. One of
particular interest is nCores
which allows for parallel
computation of normalized expression. By default, Dino
runs
with two threads. Choosing nCores = 0
will utilize all
available cores, and otherwise an integer number of parallel instances
can be chosen.
After normalization, Dino
makes it easy to perform data
analysis. The default output is the normalized matrix in sparse format,
and Dino
additionally provides a function to transform
normalized output into a Seurat
object. We demonstrate this
by running a quick clustering pipeline in Seurat
. Much of
the pipeline is modified from the tutorial at https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html
# Reformat normalized expression as a Seurat object
pbmcSmall_Seurat <- SeuratFromDino(pbmcSmall_Norm, doNorm = FALSE)
# Cluster pbmcSmall_Seurat
pbmcSmall_Seurat <- FindVariableFeatures(pbmcSmall_Seurat,
selection.method = "mvp")
pbmcSmall_Seurat <- ScaleData(pbmcSmall_Seurat,
features = rownames(pbmcSmall_Norm))
pbmcSmall_Seurat <- RunPCA(pbmcSmall_Seurat,
features = VariableFeatures(object = pbmcSmall_Seurat),
verbose = FALSE)
pbmcSmall_Seurat <- FindNeighbors(pbmcSmall_Seurat, dims = 1:10)
pbmcSmall_Seurat <- FindClusters(pbmcSmall_Seurat, verbose = FALSE)
pbmcSmall_Seurat <- RunUMAP(pbmcSmall_Seurat, dims = 1:10)
DimPlot(pbmcSmall_Seurat, reduction = "umap")
Dino
additionally supports the normalization of datasets
formatted as SingleCellExperiment. As with the
Seurat
pipeline, this functionality is implemented through
the use of a wrapper function. We demonstrate this by quickly converting
the pbmcSmall dataset to a SingleCellExperiment object
and then normalizing.
# Reformatting pbmcSmall as a SingleCellExperiment
library(SingleCellExperiment)
pbmc_SCE <- SingleCellExperiment(assays = list("counts" = pbmcSmall))
# Run Dino
pbmc_SCE <- Dino_SCE(pbmc_SCE)
str(normcounts(pbmc_SCE))
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:162808] 0 1 2 3 4 5 6 7 8 9 ...
## ..@ p : int [1:201] 0 807 1620 2442 3249 4066 4874 5680 6483 7283 ...
## ..@ Dim : int [1:2] 907 200
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:907] "ENSG00000087086" "ENSG00000167996" "ENSG00000251562" "ENSG00000205542" ...
## .. ..$ : chr [1:200] "CCAACCTGACGTAC-1" "ATCTGGGATTCCGC-1" "TACTTTCTTTTGGG-1" "CAGGCCGAACACGT-1" ...
## ..@ x : num [1:162808] 107.5 31.4 12.3 29.5 9.6 ...
## ..@ factors : list()
By default, Dino
computes sequencing depth, which is
corrected for in the normalized data, as the sum of expression for a
cell (sample) across genes. This sum is then scaled such that the median
depth is 1. For some datasets, however, it may be beneficial to run
Dino
on an alternately computed set of sequencing depths.
Note: it is generally recommended that the median depth not be
far from 1 as this corresponds to recomputing expression as though all
cells had been sequenced at the median depth.
A simple pipeline to compute alternate sequencing depths utilizes the
Scran
method for computing normalization scale factors, and
is demonstrated below.
library(scran)
# Compute scran size factors
scranSizes <- calculateSumFactors(pbmcSmall)
# Re-normalize data
pbmcSmall_SNorm <- Dino(pbmcSmall, nCores = 1, depth = log(scranSizes))
A fuller discussion of a specific use case for providing alternate
sequencing depths can be viewed on the Dino
Github page: Issue #1
Dino
models observed UMI counts as a mixture of Negative
Binomial random variables. The Negative Binomial distribution can,
however, be decomposed into a hierarchical Gamma-Poisson distribution,
so for gene g and cell j, the Dino
model for
UMI counts is: $$y_{gj}\sim
f^{P}(\lambda_{gj}\delta_{j})\\
\lambda_{gj}\sim\sum_{K}\pi_{k}f^{G}\left(\frac{\mu_{gk}}{\theta_g},\theta_g\right)$$
where fP
is a Poisson distribution parameterized by mean λgjδj
and fG is
a Gamma distribution parameterized by shape μgk/θg
and scale θg. δj is the
cell-specific sequencing depth, λgj is
the latent level of gene/cell-specific expression independent of depth,
component probabilities πk sum to 1, the
Gamma distribution is parameterized such that μgk
denotes the distribution mean, and the Gamma scale paramter, θg, is constant
across mixture components.
Following model fitting for a fixed gene through an accelerated EM
algorithm, Dino
produces normalized expression values by
resampling from the posterior distribution of the latent expression
parameters, λgj. It
can be shown that the distribution on the λj (dropping the
gene-specific subscript g as
calculations are repreated across genes) is a mixture of Gammas,
specifically: $$\mathbb{P}(\lambda_{j}|y_{j},\delta_j)=\sum_{K}\tau_{kj}f^{G}\left(\frac{\mu_{k}}{\theta}+\gamma
y_{j},\frac{1}{\frac{1}{\theta}+\gamma\delta_j}\right)$$ where
τkj
denotes the conditional probability that λgj was
sampled from mixture component k and γ is a global concentration
parameter. The τkj are
estimated as part of the implementation of the EM algorithm in
Dino
. The adjustment from the concentration parameter can
be seen as a bias in the normalized values towards a scale-factor
version of normalization, since, in the limit of γ, the normalized expression for
cell j converges to yj/δj.
Default values of γ = 15 have
proven successful.
Approximating the flexibility of a non-parametric method,
Dino
uses a large number of mixture components, K, in order to capture the full
heterogeneity of expression that may exist for a given gene. The
gene-specific number of components is estimated as the square root of
the number of strictly positive UMI counts for a given gene. By default,
K is limited to be no larger
than 100. In simulation, large values of K are shown to successfully
reconstruct both unimodal and multimodal underlying distributions. For
example, when UMI counts are estimated under a single negative binomial
distribution, the Dino
fitted prior distribution (black,
right panel) which is used to sample normalized expression closely
matches the theoretical sampling distribution (red, right panel).
Likewise, the fitted means (μk in the model,
gray lines, left panel) span the range of the simulated data (heat map
of counts, left panel), but concentrate around the theoretical mean of
the sampling distribution (red, left panel).
## TableGrob (2 x 2) "arrange": 3 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (1-1,1-2) arrange text[GRID.text.278]
Simulating data from a pair of Negative Binomial distributions with different means and different dispersion parameters yields similar results in the multimodal case.
## TableGrob (2 x 2) "arrange": 3 grobs
## z cells name grob
## 1 1 (2-2,1-1) arrange gtable[layout]
## 2 2 (2-2,2-2) arrange gtable[layout]
## 3 3 (1-1,1-2) arrange text[GRID.text.480]
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggpubr_0.6.0 gridExtra_2.3
## [3] ggplot2_3.5.1 SingleCellExperiment_1.29.1
## [5] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [7] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [9] IRanges_2.41.2 S4Vectors_0.45.2
## [11] BiocGenerics_0.53.3 generics_0.1.3
## [13] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [15] Matrix_1.7-1 Seurat_5.1.0
## [17] SeuratObject_5.0.2 sp_2.1-4
## [19] Dino_1.13.0 knitr_1.49
## [21] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RcppAnnoy_0.0.22 splines_4.4.2 later_1.4.1
## [4] tibble_3.2.1 polyclip_1.10-7 fastDummies_1.7.4
## [7] lifecycle_1.0.4 rstatix_0.7.2 edgeR_4.5.1
## [10] globals_0.16.3 lattice_0.22-6 MASS_7.3-61
## [13] backports_1.5.0 magrittr_2.0.3 limma_3.63.2
## [16] plotly_4.10.4 sass_0.4.9 rmarkdown_2.29
## [19] jquerylib_0.1.4 yaml_2.3.10 metapod_1.15.0
## [22] httpuv_1.6.15 sctransform_0.4.1 spam_2.11-0
## [25] spatstat.sparse_3.1-0 reticulate_1.40.0 cowplot_1.1.3
## [28] pbapply_1.7-2 buildtools_1.0.0 RColorBrewer_1.1-3
## [31] abind_1.4-8 zlibbioc_1.52.0 Rtsne_0.17
## [34] purrr_1.0.2 GenomeInfoDbData_1.2.13 ggrepel_0.9.6
## [37] irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.1-1
## [40] maketools_1.3.1 goftest_1.2-3 RSpectra_0.16-2
## [43] spatstat.random_3.3-2 dqrng_0.4.1 fitdistrplus_1.2-1
## [46] parallelly_1.40.1 leiden_0.4.3.1 codetools_0.2-20
## [49] DelayedArray_0.33.3 scuttle_1.17.0 tidyselect_1.2.1
## [52] UCSC.utils_1.3.0 farver_2.1.2 ScaledMatrix_1.15.0
## [55] spatstat.explore_3.3-3 jsonlite_1.8.9 BiocNeighbors_2.1.2
## [58] Formula_1.2-5 progressr_0.15.1 ggridges_0.5.6
## [61] survival_3.8-3 tools_4.4.2 ica_1.0-3
## [64] Rcpp_1.0.13-1 glue_1.8.0 SparseArray_1.7.2
## [67] xfun_0.49 dplyr_1.1.4 withr_3.0.2
## [70] BiocManager_1.30.25 fastmap_1.2.0 bluster_1.17.0
## [73] digest_0.6.37 rsvd_1.0.5 R6_2.5.1
## [76] mime_0.12 colorspace_2.1-1 scattermore_1.2
## [79] tensor_1.5 spatstat.data_3.1-4 hexbin_1.28.5
## [82] tidyr_1.3.1 data.table_1.16.4 httr_1.4.7
## [85] htmlwidgets_1.6.4 S4Arrays_1.7.1 uwot_0.2.2
## [88] pkgconfig_2.0.3 gtable_0.3.6 lmtest_0.9-40
## [91] XVector_0.47.0 sys_3.4.3 htmltools_0.5.8.1
## [94] carData_3.0-5 dotCall64_1.2 scales_1.3.0
## [97] png_0.1-8 spatstat.univar_3.1-1 scran_1.35.0
## [100] reshape2_1.4.4 nlme_3.1-166 cachem_1.1.0
## [103] zoo_1.8-12 stringr_1.5.1 KernSmooth_2.23-24
## [106] parallel_4.4.2 miniUI_0.1.1.1 pillar_1.10.0
## [109] vctrs_0.6.5 RANN_2.6.2 promises_1.3.2
## [112] car_3.1-3 BiocSingular_1.23.0 beachmat_2.23.5
## [115] xtable_1.8-4 cluster_2.1.8 evaluate_1.0.1
## [118] cli_3.6.3 locfit_1.5-9.10 compiler_4.4.2
## [121] rlang_1.1.4 crayon_1.5.3 future.apply_1.11.3
## [124] ggsignif_0.6.4 labeling_0.4.3 plyr_1.8.9
## [127] stringi_1.8.4 viridisLite_0.4.2 deldir_2.0-4
## [130] BiocParallel_1.41.0 munsell_0.5.1 lazyeval_0.2.2
## [133] spatstat.geom_3.3-4 RcppHNSW_0.6.0 patchwork_1.3.0
## [136] future_1.34.0 statmod_1.5.0 shiny_1.10.0
## [139] ROCR_1.0-11 igraph_2.1.2 broom_1.0.7
## [142] bslib_0.8.0
If you use Dino in your analysis, please cite our paper:
Brown, J., Ni, Z., Mohanty, C., Bacher, R., and Kendziorski, C. (2021). “Normalization by distributional resampling of high throughput single-cell RNA-sequencing data.” Bioinformatics, 37, 4123-4128. https://academic.oup.com/bioinformatics/article/37/22/4123/6306403.
Other work referenced in this vignette include:
Satija, R., Farrell, J.A., Gennert, D., Schier, A.F. and Regev, A. (2015). “Spatial reconstruction of single-cell gene expression data.” Nat. Biotechnol., 33, 495–502. https://doi.org/10.1038/nbt.3192
Amezquita, R.A., Lun, A.T.L., Becht, E., Carey, V.J., Carpp, L.N., Geistlinger, L., Marini, F., Rue-Albrecht, K., Risso, D., Soneson, C., et al. (2020). “Orchestrating single-cell analysis with Bioconductor.” Nat. Methods, 17, 137–145. https://doi.org/10.1038/s41592-019-0654-x
Lun, A. T. L., Bach, K. and Marioni, J. C. (2016). “Pooling across cells to normalize single-cell RNA sequencing data with many zero counts.” Genome Biol., 17, 1–14. https://doi.org/10.1186/s13059-016-0947-7
Jared Brown:
Christina Kendziorski: