An introduction to BASiCStan

Outline

This package implements alternative inference methods for BASiCS. The original package uses adaptive Metropolis within Gibbs sampling, while BASiCStan uses Stan to enable the use of maximum a posteriori estimation, variational inference, and Hamiltonian Monte Carlo. These each have advantages for different use cases.

Use

To use BASiCStan, we can first use BASICS_MockSCE() to generate an toy example dataset. We will also set a seed for reproducibility.

suppressPackageStartupMessages(library("BASiCStan"))
set.seed(42)
sce <- BASiCS_MockSCE()

The interface for running MCMC using BASiCS and using alternative inference methods using Stan is very similar. It is worth noting that the joint prior between mean and over-dispersion parameters, corresponding to Regression = TRUE, is the only supported mode in BASiCStan().

amwg_fit <- BASiCS_MCMC(
    sce,
    N = 200,
    Thin = 10,
    Burn = 100,
    Regression = TRUE
)
#> altExp 'spike-ins' is assumed to contain spike-in genes.
#> see help(altExp) for details.
#> Running with spikes BASiCS sampler (regression case) ...
#> -------------------------------------------------------------
#> MCMC running time 
#> -------------------------------------------------------------
#> user: 0.188
#> system: 0.008
#> elapsed: 0.196
#> -------------------------------------------------------------
#> Output 
#> -------------------------------------------------------------
#> -------------------------------------------------------------
#> BASiCS version 2.19.0 : 
#> vertical integration (spikes case) 
#> -------------------------------------------------------------
stan_fit <- BASiCStan(sce, Method = "sampling", iter = 10)
#> Warning: There were 5 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems

The output of BASiCStan() is an object of class BASiCS_Chain, similar to the output of BASiCS_MCMC(). Therefore, you could use these as you would an object generated using Metropolis within Gibbs sampling. For example, we can plot the relationship between mean and over-dispersion estimated using the joint regression prior:

BASiCS_ShowFit(amwg_fit)

BASiCS_ShowFit(stan_fit)

Stan MCMC diagnostics

Using Stan has advantages outside of the variety of inference engines available. By returning a Stan object that we can later convert to a BASiCS_Chain object, we can leverage an even broader set of functionality. For example, Stan has the ability to easily generate MCMC diagnostics using simple functions. For example, summary() outputs a number of useful per-chain and across-chain diagnostics:

stan_obj <- BASiCStan(sce, ReturnBASiCS = FALSE, Method = "sampling", iter = 10)
#> 
#> SAMPLING FOR MODEL 'basics_regression' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 0.00188 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 18.8 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: WARNING: No variance estimation is
#> Chain 1:          performed for num_warmup < 20
#> Chain 1: 
#> Chain 1: Iteration: 1 / 10 [ 10%]  (Warmup)
#> Chain 1: Iteration: 2 / 10 [ 20%]  (Warmup)
#> Chain 1: Iteration: 3 / 10 [ 30%]  (Warmup)
#> Chain 1: Iteration: 4 / 10 [ 40%]  (Warmup)
#> Chain 1: Iteration: 5 / 10 [ 50%]  (Warmup)
#> Chain 1: Iteration: 6 / 10 [ 60%]  (Sampling)
#> Chain 1: Iteration: 7 / 10 [ 70%]  (Sampling)
#> Chain 1: Iteration: 8 / 10 [ 80%]  (Sampling)
#> Chain 1: Iteration: 9 / 10 [ 90%]  (Sampling)
#> Chain 1: Iteration: 10 / 10 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 0.033 seconds (Warm-up)
#> Chain 1:                0.018 seconds (Sampling)
#> Chain 1:                0.051 seconds (Total)
#> Chain 1:
#> Warning: There were 5 divergent transitions after warmup. See
#> https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
#> to find out why this is a problem and how to eliminate them.
#> Warning: Examine the pairs() plot to diagnose sampling problems
head(summary(stan_obj)$summary)
#>               mean se_mean sd     2.5%      25%      50%      75%    97.5%
#> log_mu[1] 3.098420     NaN  0 3.098420 3.098420 3.098420 3.098420 3.098420
#> log_mu[2] 1.058488     NaN  0 1.058488 1.058488 1.058488 1.058488 1.058488
#> log_mu[3] 2.107672     NaN  0 2.107672 2.107672 2.107672 2.107672 2.107672
#> log_mu[4] 2.245089     NaN  0 2.245089 2.245089 2.245089 2.245089 2.245089
#> log_mu[5] 2.002693     NaN  0 2.002693 2.002693 2.002693 2.002693 2.002693
#> log_mu[6] 1.457520     NaN  0 1.457520 1.457520 1.457520 1.457520 1.457520
#>           n_eff Rhat
#> log_mu[1]   NaN  NaN
#> log_mu[2]   NaN  NaN
#> log_mu[3]   NaN  NaN
#> log_mu[4]   NaN  NaN
#> log_mu[5]   NaN  NaN
#> log_mu[6]   NaN  NaN

We can then convert this object to a BASiCS_Chain and carry on a workflow as before:

Stan2BASiCS(stan_obj)
#> An object of class BASiCS_Chain
#>  5 MCMC samples.
#>  Dataset contains 100 biological genes and 100 cells (2 batches). 
#>  Object stored using BASiCS version:  2.19.0 
#>  Parameters:  mu delta s nu theta epsilon phi beta

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] BASiCStan_1.9.0             rstan_2.32.6               
#>  [3] StanHeaders_2.32.10         BASiCS_2.19.0              
#>  [5] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#>  [7] Biobase_2.67.0              GenomicRanges_1.59.1       
#>  [9] GenomeInfoDb_1.43.2         IRanges_2.41.1             
#> [11] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [13] generics_0.1.3              MatrixGenerics_1.19.0      
#> [15] matrixStats_1.4.1           rmarkdown_2.29             
#> 
#> loaded via a namespace (and not attached):
#>   [1] gridExtra_2.3             inline_0.3.20            
#>   [3] rlang_1.1.4               magrittr_2.0.3           
#>   [5] compiler_4.4.2            DelayedMatrixStats_1.29.0
#>   [7] loo_2.8.0                 vctrs_0.6.5              
#>   [9] pkgconfig_2.0.3           crayon_1.5.3             
#>  [11] fastmap_1.2.0             backports_1.5.0          
#>  [13] XVector_0.47.0            labeling_0.4.3           
#>  [15] scuttle_1.17.0            utf8_1.2.4               
#>  [17] promises_1.3.2            UCSC.utils_1.3.0         
#>  [19] xfun_0.49                 bluster_1.17.0           
#>  [21] zlibbioc_1.52.0           cachem_1.1.0             
#>  [23] beachmat_2.23.2           jsonlite_1.8.9           
#>  [25] later_1.4.1               DelayedArray_0.33.2      
#>  [27] BiocParallel_1.41.0       irlba_2.3.5.1            
#>  [29] parallel_4.4.2            cluster_2.1.6            
#>  [31] R6_2.5.1                  bslib_0.8.0              
#>  [33] limma_3.63.2              jquerylib_0.1.4          
#>  [35] Rcpp_1.0.13-1             assertthat_0.2.1         
#>  [37] knitr_1.49                splines_4.4.2            
#>  [39] httpuv_1.6.15             Matrix_1.7-1             
#>  [41] igraph_2.1.1              abind_1.4-8              
#>  [43] yaml_2.3.10               viridis_0.6.5            
#>  [45] codetools_0.2-20          miniUI_0.1.1.1           
#>  [47] curl_6.0.1                pkgbuild_1.4.5           
#>  [49] lattice_0.22-6            tibble_3.2.1             
#>  [51] withr_3.0.2               shiny_1.9.1              
#>  [53] posterior_1.6.0           coda_0.19-4.1            
#>  [55] evaluate_1.0.1            RcppParallel_5.1.9       
#>  [57] pillar_1.9.0              tensorA_0.36.2.1         
#>  [59] checkmate_2.3.2           distributional_0.5.0     
#>  [61] ggplot2_3.5.1             sparseMatrixStats_1.19.0 
#>  [63] rstantools_2.4.0          munsell_0.5.1            
#>  [65] scales_1.3.0              xtable_1.8-4             
#>  [67] glue_1.8.0                metapod_1.15.0           
#>  [69] maketools_1.3.1           tools_4.4.2              
#>  [71] hexbin_1.28.5             BiocNeighbors_2.1.1      
#>  [73] sys_3.4.3                 ScaledMatrix_1.15.0      
#>  [75] locfit_1.5-9.10           buildtools_1.0.0         
#>  [77] scran_1.35.0              cowplot_1.1.3            
#>  [79] grid_4.4.2                QuickJSR_1.4.0           
#>  [81] edgeR_4.5.0               colorspace_2.1-1         
#>  [83] GenomeInfoDbData_1.2.13   BiocSingular_1.23.0      
#>  [85] cli_3.6.3                 rsvd_1.0.5               
#>  [87] fansi_1.0.6               S4Arrays_1.7.1           
#>  [89] viridisLite_0.4.2         V8_6.0.0                 
#>  [91] glmGamPoi_1.19.1          gtable_0.3.6             
#>  [93] sass_0.4.9                digest_0.6.37            
#>  [95] SparseArray_1.7.2         dqrng_0.4.1              
#>  [97] farver_2.1.2              htmltools_0.5.8.1        
#>  [99] lifecycle_1.0.4           httr_1.4.7               
#> [101] statmod_1.5.0             mime_0.12                
#> [103] ggExtra_0.10.1            MASS_7.3-61