scDesign3 is a unified probabilistic framework that generates realistic in silico high-dimensional single-cell omics data of various cell states, including discrete cell types, continuous trajectories, and spatial locations by learning from real datasets. Since the functions of scDesign3 is very comprehensive, here we only introduce how scDesign3 simulates an scRNA-seq dataset with one continuous developmental trajectory. For more information, please check the Articles on our website: (https://songdongyuan1994.github.io/scDesign3/docs/index.html).
The raw data is from the scvelo, which describes pancreatic endocrinogenesis. We pre-select the top 1000 highly variable genes and filter out some cell types to ensure a single trajectory.
example_sce <- readRDS((url("https://figshare.com/ndownloader/files/40581992")))
print(example_sce)
#> class: SingleCellExperiment
#> dim: 1000 2087
#> metadata(5): clusters_coarse_colors clusters_colors day_colors
#> neighbors pca
#> assays(6): X spliced ... cpm logcounts
#> rownames(1000): Pyy Iapp ... Eya2 Kif21a
#> rowData names(1): highly_variable_genes
#> colnames(2087): AAACCTGAGAGGGATA AAACCTGGTAAGTGGC ... TTTGTCAAGTGACATA
#> TTTGTCAAGTGTGGCA
#> colData names(7): clusters_coarse clusters ... sizeFactor pseudotime
#> reducedDimNames(4): X_pca X_umap PCA UMAP
#> mainExpName: NULL
#> altExpNames(0):
To save computational time, we only use the top 100 genes.
The function scdesign3()
takes in a
SinglecellExperiment
object with the cell covariates (such
as cell types, pesudotime, or spatial coordinates) stored in the
colData
of the SinglecellExperiment
object.
set.seed(123)
example_simu <- scdesign3(
sce = example_sce,
assay_use = "counts",
celltype = "cell_type",
pseudotime = "pseudotime",
spatial = NULL,
other_covariates = NULL,
mu_formula = "s(pseudotime, k = 10, bs = 'cr')",
sigma_formula = "1", # If you want your dispersion also varies along pseudotime, use "s(pseudotime, k = 5, bs = 'cr')"
family_use = "nb",
n_cores = 2,
usebam = FALSE,
corr_formula = "1",
copula = "gaussian",
DT = TRUE,
pseudo_obs = FALSE,
return_model = FALSE,
nonzerovar = FALSE
)
The output of scdesign3()
is a list which includes:
new_count
: This is the synthetic count matrix generated
by scdesign3()
.new_covariate
:
ncell
is set to a number that is
different from the number of cells in the input data, this will be a
matrix that has the new cell covariates that are used for generating new
data.ncell
is the default value, this will
be NULL
.model_aic
: This is a vector include the genes’ marginal
models’ AIC, fitted copula’s AIC, and total AIC, which is the sum of the
previous two AIC.model_bic
: This is a vector include the genes’ marginal
models’ BIC, fitted copula’s BIC, and total BIC, which is the sum of the
previous two BIC.marginal_list
:
return_model
is set to
TRUE
, this will be a list which contains the fitted gam or
gamlss model for all genes in the input data.return_model
is set to the default
value FALSE
, this will be NULL
.corr_list
:
return_model
is set to
TRUE
, this will be a list which contains the either a
correlation matrix (when copula = "gaussian"
) or the fitted
vine copula (when copula = "vine
) for each user specified
correlation groups (based on the parameter corr_by
).return_model
is set to the default
value FALSE
, this will be NULL
.In this example, since we did not change the parameter
ncell
, the synthetic count matrix will have the same
dimension as the input count matrix.
Then, we can create the SinglecellExperiment
object
using the synthetic count matrix and store the logcounts
to
the input and synthetic SinglecellExperiment
objects.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 SingleCellExperiment_1.28.0
#> [3] SummarizedExperiment_1.36.0 Biobase_2.67.0
#> [5] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [7] IRanges_2.41.0 S4Vectors_0.44.0
#> [9] BiocGenerics_0.53.0 MatrixGenerics_1.19.0
#> [11] matrixStats_1.4.1 scDesign3_1.5.0
#> [13] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
#> [4] lattice_0.22-6 gamlss_5.4-22 vctrs_0.6.5
#> [7] tools_4.4.1 generics_0.1.3 parallel_4.4.1
#> [10] tibble_3.2.1 fansi_1.0.6 highr_0.11
#> [13] pkgconfig_2.0.3 Matrix_1.7-1 lifecycle_1.0.4
#> [16] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.1
#> [19] munsell_0.5.1 coop_0.6-3 htmltools_0.5.8.1
#> [22] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
#> [25] yaml_2.3.10 pillar_1.9.0 crayon_1.5.3
#> [28] jquerylib_0.1.4 MASS_7.3-61 openssl_2.2.2
#> [31] cachem_1.1.0 DelayedArray_0.33.1 viridis_0.6.5
#> [34] abind_1.4-8 mclust_6.1.1 RSpectra_0.16-2
#> [37] nlme_3.1-166 tidyselect_1.2.1 digest_0.6.37
#> [40] mvtnorm_1.3-1 dplyr_1.1.4 labeling_0.4.3
#> [43] maketools_1.3.1 splines_4.4.1 gamlss.dist_6.1-1
#> [46] fastmap_1.2.0 grid_4.4.1 gamlss.data_6.0-6
#> [49] colorspace_2.1-1 cli_3.6.3 SparseArray_1.6.0
#> [52] magrittr_2.0.3 S4Arrays_1.6.0 survival_3.7-0
#> [55] utf8_1.2.4 withr_3.0.2 scales_1.3.0
#> [58] UCSC.utils_1.2.0 rmarkdown_2.28 XVector_0.46.0
#> [61] httr_1.4.7 umap_0.2.10.0 gridExtra_2.3
#> [64] reticulate_1.39.0 png_0.1-8 askpass_1.2.1
#> [67] evaluate_1.0.1 knitr_1.48 viridisLite_0.4.2
#> [70] irlba_2.3.5.1 mgcv_1.9-1 rlang_1.1.4
#> [73] Rcpp_1.0.13 glue_1.8.0 BiocManager_1.30.25
#> [76] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.52.0