$ singularity pull --name scShapes.sif shub://Malindrie/scShapes
$ singularity shell scShapes.sif
Singularity scShapes.sif:~ >
$ docker pull maldharm/scshapes
$ docker run -it maldharm/scshapes
root@516e456b0762:/#
To use containerized scShapes
package, need to launch
R
or execute an R script through Rscript
.
scShapes
from install_github
This is a basic example which shows how you can use scShapes for
identifying differential distributions in single-cell RNA-seq data. For
this example data we use the toy example data scData
included in the package.
We first filter the genes to keep only genes expressed in at least 10% of cells:
In order to normalize for differences in sequencing depth, the log of
the total UMI counts assigned per cell will be used as an offset in the
GLM. This function is inbuilt in the algorithm; however the user is
required to input the library sizes. In our example data this
information together with the covariates is given under the lists
lib_size
and covariates
respectively.
Perform Kolmogorov-Smirnov test to select genes belonging to the family of ZINB distributions.
scData_KS <- ks_test(counts=scData$counts, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=SnowParam(workers=8,type="SOCK"))
# Select genes significant from the KS test.
# By default the 'ks_sig' function performs Benjamini-Hochberg correction for multiple hypothese testing
# and selects genes significant at p-value of 0.01
scData_KS_sig <- ks_sig(scData_KS)
# Subset UMI counts corresponding to the genes significant from the KS test
scData.sig.genes <- scData$counts[rownames(scData$counts) %in% names(scData_KS_sig$genes),]
Fit the 4 distributions P,NB,ZIP,ZINB for genes that belong to the ZINB family of distributions by fitting GLM with log of the library sizes as an offset and cell types as a covariate in the GLM.
scData_models <- fit_models(counts=scData.sig.genes, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=SnowParam(workers=8,type="SOCK"))
Once the 4 distributions are fitted, we next calculate the BIC value for each model and select the model with the least BIC value.
scData_bicvals <- model_bic(scData_models)
# select model with least bic value
scData_least.bic <- lbic_model(scData_bicvals, scData$counts)
To ensure the fit of the models selected based on the least BIC value, additionally we perform LRT to test for model adequacy and presence of zero-inflation.
scData_gof <- gof_model(scData_least.bic, cexpr=scData$covariates, lib.size=scData$lib_size, BPPARAM=SnowParam(workers=8,type="SOCK"))
Finally based on the results of the model adequacy tests, we can identify the distribution of best fit for each gene.
Once the distribution of best fit is identified for genes of interest, it is also possible to extract parameters of interest for the models.
If our dataset consists of multiple conditions we can follow the above approach to identify the best fir distribution shape for each gene under each treatment condition (selecting the subset of genes common between conditions). Then using the dataframe of genes and distribution followed under each condition, now we can identify genes changing distribution between conditions.
For example suppose we follow above pipeline for scRNA-seq data on
two treatment conditions ‘CTRL’ and ‘STIM’ and have identified the best
distribution fit for each gene under each condition independently.
Suppose the dataframe ifnb.distr
; with genes as rows and
columns as ‘CTRL’ and ‘STIM’ with the corresponding distribution name a
particular gene follows, then we can identify genes changing
distribution shape between ‘CTRL’ and ‘STIM’ as;
This will give a list of two lists with genes changing distribution between condition and genes changing distribution from unimodal in one condition to zero-inflated in the other condition.
Here is the output of sessionInfo() on the system on which this document was compiled:
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BiocParallel_1.41.0 scShapes_1.13.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-1 jsonlite_1.8.9 compiler_4.4.2
#> [4] Rcpp_1.0.14 dgof_1.5.1 parallel_4.4.2
#> [7] VGAM_1.1-12 jquerylib_0.1.4 splines_4.4.2
#> [10] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
#> [13] coda_0.19-4.1 R6_2.5.1 plyr_1.8.9
#> [16] bbmle_1.0.25.1 knitr_1.49 MASS_7.3-64
#> [19] bdsmatrix_1.3-7 snow_0.4-4 maketools_1.3.1
#> [22] bslib_0.8.0 rlang_1.1.4 cachem_1.1.0
#> [25] xfun_0.50 sass_0.4.9 sys_3.4.3
#> [28] cli_3.6.3 magrittr_2.0.3 emdbook_1.3.13
#> [31] digest_0.6.37 grid_4.4.2 mvtnorm_1.3-3
#> [34] lifecycle_1.0.4 pscl_1.5.9 evaluate_1.0.3
#> [37] numDeriv_2016.8-1.1 codetools_0.2-20 buildtools_1.0.0
#> [40] stats4_4.4.2 tools_4.4.2 htmltools_0.5.8.1