The package can be installed from bioconductor
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scDDboost")
Issue can be reported at “https://github.com/wiscstatman/scDDboost/issues”
scDDboost scores evidence of a gene being differentially distributed(DD) across two conditions for single cell RNA-seq data. Higher resolution brings several chanllenges for analyzing the data, specifically, the distribution of gene expression tends to have high prevalence of zero and multi-modes. To account for those characteristics and utilizing some biological intuition, we view the expression values sampled from a pool of cells mixed by distinct cellular subtypes blind to condition label. Consequently, the distributional change can be fully determined by the the change of subtype proportions. One tricky part is that not any change of proportions will lead to a distributional change. Given that some genes could be equivalent expressed across several subtypes, even the individual subytpe proportion may differ between conditions but as long as the aggregated proportions over those subtypes remain the same between conditions, it will not introduce different distribution. For example
Proportions of subtypes 1 and 2 changed between the 2 conditions. The gene is not DD if subtype 1 and 2 have the same expression level
For subtype 1 and 2 have different expression level, there is different distribution
pdd
is the core function developed to quantify the
posterior probabilities of DD for input genes. Let’s look at an
example,
Next, we load the toy simulated example a object that we will use for identifying and classifying DD genes.
Verify that this object is a member of the SingleCellExperiment class
and that it contains 200 cells and 1000 genes. The colData slot (which
contains a dataframe of metadata for the cells) should have a column
that contains the biological condition or grouping of interest. In this
example data, that variable is the condition
variable. Note
that the input gene set needs to be a matrix of normalized counts. We
run the function pdd
data_counts <- SummarizedExperiment::assays(sim_dat)$counts
conditions <- SummarizedExperiment::colData(sim_dat)$conditions
rownames(data_counts) <- seq_len(1000)
##here we use 2 cores to compute the distance matrix
bp <- BiocParallel::MulticoreParam(2)
D_c <- calD(data_counts,bp)
ProbDD <- pdd(data = data_counts,cd = conditions, bp = bp, D = D_c)
There are 4 input parameters needed to be specified by user, the dataset, the condition label, number of cpu cores used for computation and a distance matrix of cells. Other input parameters have default settings.
We provide a default method of getting the distance matrix, archived
by calD
, in general pdd
accept all valid
distance matrix. User can also input a cluster label rather than
distance matrix for the argument D
, but the random
distancing mechanism which relies on distance matrix will be disabled
and random
should be set to false.
For the number of sutypes, we provide a default function
detK
, which consider the smallest number of sutypes such
that the ratio of difference within cluster between difference between
clusters become smaller than a threshold (default setting is 1).
If user have other ways to determine K, K should be specified in
pdd
.
If we set threshold to be 5% then we have estimated DD genes
Notice that, pdd is actually local false discovery rate, this is a conservative estimation of DD genes. We could gain further power, let index gene by g = 1, 2, ..., G and let pg = P(DDg|data), p(1), ..., p(G) be ranked local false discovery rate from small to large. To control the false discovery rate at 5%, our positive set is those genes with the s* smallest lFDR, where $$s^* = \text{argmax}_s\{s,\frac{\Sigma_{i = 1}^s p_{(i)}}{s} \leq 0.05\}$$
Function getDD
extracts the estimated DD genes using the
above transformation.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scDDboost_1.9.0 ggplot2_3.5.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.36.0 gtable_0.3.6
## [3] xfun_0.48 bslib_0.8.0
## [5] caTools_1.18.3 Biobase_2.67.0
## [7] lattice_0.22-6 bitops_1.0-9
## [9] vctrs_0.6.5 tools_4.4.1
## [11] stats4_4.4.1 parallel_4.4.1
## [13] tibble_3.2.1 fansi_1.0.6
## [15] cluster_2.1.6 highr_0.11
## [17] pkgconfig_2.0.3 blockmodeling_1.1.5
## [19] KernSmooth_2.23-24 Matrix_1.7-1
## [21] EBSeq_2.5.0 desc_1.4.3
## [23] S4Vectors_0.44.0 lifecycle_1.0.4
## [25] GenomeInfoDbData_1.2.13 compiler_4.4.1
## [27] farver_2.1.2 brio_1.1.5
## [29] gplots_3.2.0 munsell_0.5.1
## [31] codetools_0.2-20 GenomeInfoDb_1.43.0
## [33] htmltools_0.5.8.1 sys_3.4.3
## [35] buildtools_1.0.0 sass_0.4.9
## [37] yaml_2.3.10 pillar_1.9.0
## [39] crayon_1.5.3 jquerylib_0.1.4
## [41] BiocParallel_1.41.0 SingleCellExperiment_1.28.0
## [43] cachem_1.1.0 DelayedArray_0.33.1
## [45] abind_1.4-8 mclust_6.1.1
## [47] gtools_3.9.5 digest_0.6.37
## [49] labeling_0.4.3 maketools_1.3.1
## [51] rprojroot_2.0.4 fastmap_1.2.0
## [53] grid_4.4.1 colorspace_2.1-1
## [55] cli_3.6.3 SparseArray_1.6.0
## [57] magrittr_2.0.3 S4Arrays_1.6.0
## [59] utf8_1.2.4 withr_3.0.2
## [61] scales_1.3.0 UCSC.utils_1.2.0
## [63] rmarkdown_2.28 XVector_0.46.0
## [65] httr_1.4.7 matrixStats_1.4.1
## [67] RcppEigen_0.3.4.0.2 evaluate_1.0.1
## [69] knitr_1.48 GenomicRanges_1.59.0
## [71] IRanges_2.41.0 testthat_3.2.1.1
## [73] Oscope_1.37.0 rlang_1.1.4
## [75] Rcpp_1.0.13 glue_1.8.0
## [77] BiocManager_1.30.25 BiocGenerics_0.53.0
## [79] pkgload_1.4.0 jsonlite_1.8.9
## [81] R6_2.5.1 MatrixGenerics_1.19.0
## [83] zlibbioc_1.52.0