simPIC: simulating single-cell ATAC-seq data

Introduction to simPIC

simPIC is an R package for the simple simulation of single-cell RNA sequencing data. This vignette gives an overview and introduction to simPIC’s functionality.

Installation

simPIC can be installed from Bioconductor:

BiocManager::install("simPIC")

Quick start

Assuming you already have a matrix of count data similar to that you wish to simulate there are two simple steps to creating a simulated data set with simPIC.

# Load package
suppressPackageStartupMessages({
    library(simPIC)
})

# Load test data
set.seed(12)
counts <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))

# Estimate parameters
est <- simPICestimate(counts)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean

# Simulate data using estimated parameters
sim <- simPICsimulate(est)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!

These steps will be explained in detail in the following sections but briefly the first step takes a dataset and estimates simulation parameters from it and the second step takes those parameters and simulates a new dataset.

Input data

simPIC recommends to use a Paired-Insertion Count (PIC) matrix for optimal use of the quantitative information present in scATAC-seq data. You can convert your own matrix to PIC by following the PICsnATAC vignette. Briefly, you need three input files for PIC -

  1. cell barcodes with metadata (singlecell.csv)
  2. list of peak regions (peaks.bed)
  3. fragment files (fragment.tsv.gz)
pic_mat <- PIC_counting(
    cells = cells,
    fragment_tsv_gz_file_location = fragment_tsv_gz_file_location,
    peak_sets = peak_sets
)

The simPIC simulation

The core of the simPIC model is a gamma-Poisson distribution. This model is used to generate a uniform quantification based peak-by-cell count matrix. Mean chromatin accessibility signals for each peak are simulated from a weibull distribution with default settings. Users have the flexibility to choose from gamma, lognormal or pareto distribution as well. Each cell is given an expected library size, simulated from a log-normal distribution to match to a given dataset. Sparsity is imposed on counts simulated from a Poisson distribution.

The simPICcount class

All the parameters for the simPIC simulation are stored in a simPICcount object. A class specifically desgined for storing simPIC scATAC-seq simulation parameters. Let’s create a new one and see what it looks like.

sim.params <- newsimPICcount()

we can see the default values for the simPICcount object parameters. These values are based on provided test data.

sim.params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 5000
#> 
#> Slot "nCells":
#> [1] 700
#> 
#> Slot "seed":
#> [1] 4361
#> 
#> Slot "default":
#> [1] TRUE
#> 
#> Slot "pm.distr":
#> [1] "weibull"
#> 
#> Slot "lib.size.meanlog":
#> [1] 6.687082
#> 
#> Slot "lib.size.sdlog":
#> [1] 0.344361
#> 
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#> 
#> Slot "peak.mean.rate":
#> [1] 7.100648
#> 
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#> 
#> Slot "peak.mean.pi":
#> [1] -17.17441
#> 
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#> 
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#> 
#> Slot "sparsity":
#>  [1] 0.03183703 0.12368925 0.71255933 0.29731652 0.59878107 0.27771208
#>  [7] 0.11872009 0.76971198 0.80653983 0.05440784 0.88948437 0.46711477
#> [13] 0.26328122 0.93000336 0.19992803 0.19175698 0.07359594 0.38368023
#> [19] 0.26143361 0.26350299 0.72370504 0.47078848 0.55950116 0.35243315
#> [25] 0.73758191 0.54356535 0.82020172 0.39431261 0.65959837 0.32139775
#>  [ reached getOption("max.print") -- omitted 199970 entries ]

Getting and setting parameters

To get a particular parameter, for e.g., number of peaks, we can use simPICget function:

simPICget(sim.params, "nPeaks")
#> [1] 5000

Alternatively, to give a parameter a new value we can use the setsimPICparameters function:

sim.params <- setsimPICparameters(sim.params, nPeaks = 2000)
simPICget(sim.params, "nPeaks")
#> [1] 2000

To get or set multiple parameters use simPICget or setsimPICparameters functions:

# Set multiple parameters at once (using a list)
sim.params <- setsimPICparameters(sim.params,
    update = list(nPeaks = 8000, nCells = 500)
)
# Extract multiple parameters as a list
params <- simPICgetparameters(
    sim.params,
    c("nPeaks", "nCells", "peak.mean.shape")
)
# Set multiple parameters at once (using additional arguments)
params <- setsimPICparameters(sim.params,
    lib.size.sdlog = 3.5, lib.size.meanlog = 9.07
)
params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 8000
#> 
#> Slot "nCells":
#> [1] 500
#> 
#> Slot "seed":
#> [1] 4361
#> 
#> Slot "default":
#> [1] TRUE
#> 
#> Slot "pm.distr":
#> [1] "weibull"
#> 
#> Slot "lib.size.meanlog":
#> [1] 9.07
#> 
#> Slot "lib.size.sdlog":
#> [1] 3.5
#> 
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#> 
#> Slot "peak.mean.rate":
#> [1] 7.100648
#> 
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#> 
#> Slot "peak.mean.pi":
#> [1] -17.17441
#> 
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#> 
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#> 
#> Slot "sparsity":
#>  [1] 0.03183703 0.12368925 0.71255933 0.29731652 0.59878107 0.27771208
#>  [7] 0.11872009 0.76971198 0.80653983 0.05440784 0.88948437 0.46711477
#> [13] 0.26328122 0.93000336 0.19992803 0.19175698 0.07359594 0.38368023
#> [19] 0.26143361 0.26350299 0.72370504 0.47078848 0.55950116 0.35243315
#> [25] 0.73758191 0.54356535 0.82020172 0.39431261 0.65959837 0.32139775
#>  [ reached getOption("max.print") -- omitted 199970 entries ]

Estimating parameters

simPIC allows you to estimate many of it’s parameters from a SingleCellExperiment object containing counts or a counts matrix using the simPICestimate function.

# Get the counts from test data
count <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))

# Check that counts is a dgCMatrix
class(count)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
typeof(count)
#> [1] "S4"

# Check the dimensions, each row is a peak, each column is a cell
dim(count)
#> [1] 5000  700

# Show the first few entries
count[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#>                          TTGCCCACACTCAAGT-1 TATGTGGGTCCAAGAG-1
#> chr22-29837913-29838812                   1                  2
#> chr12-90076794-90077665                   .                  .
#> chr11-63574191-63575102                   .                  .
#> chr12-54977628-54978479                   .                  .
#> chr1-221715272-221716182                  .                  .
#>                          CGATGATCAAGTAACA-1 GTCGTAATCAGGAAGC-1
#> chr22-29837913-29838812                   .                  .
#> chr12-90076794-90077665                   .                  .
#> chr11-63574191-63575102                   .                  .
#> chr12-54977628-54978479                   .                  .
#> chr1-221715272-221716182                  .                  .
#>                          TACTAGGCACAAACAA-1
#> chr22-29837913-29838812                   .
#> chr12-90076794-90077665                   .
#> chr11-63574191-63575102                   .
#> chr12-54977628-54978479                   .
#> chr1-221715272-221716182                  1

new <- newsimPICcount()
new <- simPICestimate(count)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean

## estimating using gamma distribution
## new <- simPICestimate(count, pm.distr = "gamma")

Here we estimated parameters from a counts matrix using default parameters. The estimation process has the following steps:

  1. Mean parameters are estimated by fitting a weibull distribution (default) to the peak means.
  2. Library size parameters are estimated by fitting a log-normal distribution to the library sizes.
  3. Sparsity parameter is estimated by fitting a Bernoulli distribution.

For more details of the estimation procedures see ?simPICestimate.

Simulating counts

Once we have a set of parameters we are happy with we can use simPICsimulate to simulate counts. To make adjustments to the parameters provide them as additional arguments. Alternatively if we don’t supply any parameters defaults will be used:

sim <- simPICsimulate(new, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim
#> class: SingleCellExperiment 
#> dim: 5000 1000 
#> metadata(1): Params
#> assays(1): counts
#> rownames(5000): Peak1 Peak2 ... Peak4999 Peak5000
#> rowData names(2): Peak exp.peakmean
#> colnames(1000): Cell1 Cell2 ... Cell999 Cell1000
#> colData names(2): Cell exp.libsize
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

## simulating using gamma distribution
## sim <- simPICsimulate(new, nCells =1000, pm.distr = "gamma")

Looking at the output of simPICsimulate we can see that sim is SingleCellExperiment object with 5000 features (peaks) and 1000 samples (cells). The main part of this object is a features by samples matrix containing the simulated counts (accessed using counts). Additionally a SingleCellExperiment contains information about each cell (accessed using colData) and each peak (accessed using rowData). simPIC uses these slots, as well as assays, to store information about the intermediate values of the simulation.

# Access the counts
counts(sim)[1:5, 1:5]
#>       Cell1 Cell2 Cell3 Cell4 Cell5
#> Peak1     0     0     0     0     0
#> Peak2     0     0     1     1     0
#> Peak3     1     0     1     1     0
#> Peak4     0     0     0     0     0
#> Peak5     0     0     0     0     0
# Information about peaks
head(rowData(sim))
#> DataFrame with 6 rows and 2 columns
#>              Peak exp.peakmean
#>       <character>    <numeric>
#> Peak1       Peak1  6.75102e-06
#> Peak2       Peak2  2.68961e-04
#> Peak3       Peak3  1.02505e-04
#> Peak4       Peak4  3.80896e-06
#> Peak5       Peak5  4.83026e-05
#> Peak6       Peak6  4.76012e-05
# Information about cells
head(colData(sim))
#> DataFrame with 6 rows and 2 columns
#>              Cell exp.libsize
#>       <character>   <numeric>
#> Cell1       Cell1        1467
#> Cell2       Cell2         575
#> Cell3       Cell3        1122
#> Cell4       Cell4        1105
#> Cell5       Cell5         925
#> Cell6       Cell6         887
# Peak by cell matrices
names(assays(sim))
#> [1] "counts"

For more details about the SingleCellExperiment object refer to thevignette.

The simPICsimulate function provides additional simulation details:

* Cell information (colData)

* Cell - Unique cell identifier.

* exp.libsize - The expected library size for that cell. (not obtained from the final simulated counts)

* Peak information (rowData)

* Peak - Unique peak identifier.

* exp.peakmean - The expected peak means for that peak. (not obtained from the final simulated counts)

* Peak by cell information (assays)

* counts - The final simulated counts.

For more information on the simulation see ?simPICsimulate.

Comparing simulations and real data

simPIC provides a function simPICcompare that aims to make these comparisons easier. This function takes a list of SingleCellExperiment objects, combines the datasets and produces comparison plots. Let’s make two small simulations and see how they compare.

sim1 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim2 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
comparison <- simPICcompare(list(real = sim1, simPIC = sim2))

names(comparison)
#> [1] "RowData" "ColData" "Plots"
names(comparison$Plots)
#> [1] "Means"        "Variances"    "MeanVar"      "LibrarySizes" "ZerosPeak"   
#> [6] "ZerosCell"    "MeanZeros"    "PeakMeanNzp"

The returned list has three items. The first two are the combined datasets by peak (RowData) and by cell (ColData) and the third contains some comparison plots (produced using ggplot2), for example a plot of the distribution of means:

comparison$Plots$Means

These are only a few of the plots you might want to consider but it should be easy to make more using the returned data.

Citing simPIC

If you use simPIC in your work please cite our paper:

citation("simPIC")
#> To cite package 'simPIC' in publications use:
#> 
#>   Chugh S, McCarthy D, Shim H (2023). _simPIC: simPIC: flexible
#>   simulation of paired-insertion counts for single-cell ATAC-sequencing
#>   data_. R package version 1.3.0,
#>   <https://github.com/sagrikachugh/simPIC>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {simPIC: simPIC: flexible simulation of paired-insertion counts for
#> single-cell ATAC-sequencing data},
#>     author = {Sagrika Chugh and Davis McCarthy and Heejung Shim},
#>     year = {2023},
#>     note = {R package version 1.3.0},
#>     url = {https://github.com/sagrikachugh/simPIC},
#>   }

Session information

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] simPIC_1.3.0                SingleCellExperiment_1.29.1
#>  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
#>  [7] IRanges_2.41.2              S4Vectors_0.45.2           
#>  [9] BiocGenerics_0.53.3         generics_0.1.3             
#> [11] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> [13] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6            xfun_0.49               bslib_0.8.0            
#>  [4] ggplot2_3.5.1           lattice_0.22-6          vctrs_0.6.5            
#>  [7] tools_4.4.2             parallel_4.4.2          fansi_1.0.6            
#> [10] tibble_3.2.1            pkgconfig_2.0.3         Matrix_1.7-1           
#> [13] checkmate_2.3.2         actuar_3.3-4            lifecycle_1.0.4        
#> [16] GenomeInfoDbData_1.2.13 farver_2.1.2            compiler_4.4.2         
#> [19] munsell_0.5.1           codetools_0.2-20        htmltools_0.5.8.1      
#> [22] sys_3.4.3               buildtools_1.0.0        sass_0.4.9             
#> [25] yaml_2.3.10             pillar_1.9.0            crayon_1.5.3           
#> [28] jquerylib_0.1.4         MASS_7.3-61             fitdistrplus_1.2-1     
#>  [ reached getOption("max.print") -- omitted 36 entries ]