simPIC is an R package for the simple simulation of single-cell RNA sequencing data. This vignette gives an overview and introduction to simPIC’s functionality.
Assuming you already have a matrix of count data similar to that you wish to simulate there are two simple steps to creating a simulated data set with simPIC.
# Load package
suppressPackageStartupMessages({
library(simPIC)
})
# Load test data
set.seed(12)
counts <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))
# Estimate parameters
est <- simPICestimate(counts)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean
# Simulate data using estimated parameters
sim <- simPICsimulate(est)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
These steps will be explained in detail in the following sections but briefly the first step takes a dataset and estimates simulation parameters from it and the second step takes those parameters and simulates a new dataset.
simPIC
recommends to use a Paired-Insertion Count (PIC)
matrix for optimal use of the quantitative information present in
scATAC-seq data. You can convert your own matrix to PIC by following the
PICsnATAC
vignette.
Briefly, you need three input files for PIC -
singlecell.csv
)peaks.bed
)fragment.tsv.gz
)The core of the simPIC model is a gamma-Poisson distribution. This
model is used to generate a uniform quantification based peak-by-cell
count matrix. Mean chromatin accessibility signals for each peak are
simulated from a weibull
distribution with default settings. Users have the flexibility to
choose from gamma
, lognormal
or
pareto
distribution as well. Each cell is given an expected
library size, simulated from a log-normal distribution to match to a
given dataset. Sparsity is imposed on counts simulated from a Poisson
distribution.
simPICcount
classAll the parameters for the simPIC simulation are stored in a
simPICcount
object. A class specifically desgined for
storing simPIC scATAC-seq simulation parameters. Let’s create a new one
and see what it looks like.
we can see the default values for the simPICcount
object
parameters. These values are based on provided test data.
sim.params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 5000
#>
#> Slot "nCells":
#> [1] 700
#>
#> Slot "seed":
#> [1] 4361
#>
#> Slot "default":
#> [1] TRUE
#>
#> Slot "pm.distr":
#> [1] "weibull"
#>
#> Slot "lib.size.meanlog":
#> [1] 6.687082
#>
#> Slot "lib.size.sdlog":
#> [1] 0.344361
#>
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#>
#> Slot "peak.mean.rate":
#> [1] 7.100648
#>
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#>
#> Slot "peak.mean.pi":
#> [1] -17.17441
#>
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#>
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#>
#> Slot "sparsity":
#> [1] 0.03183703 0.12368925 0.71255933 0.29731652 0.59878107 0.27771208
#> [7] 0.11872009 0.76971198 0.80653983 0.05440784 0.88948437 0.46711477
#> [13] 0.26328122 0.93000336 0.19992803 0.19175698 0.07359594 0.38368023
#> [19] 0.26143361 0.26350299 0.72370504 0.47078848 0.55950116 0.35243315
#> [25] 0.73758191 0.54356535 0.82020172 0.39431261 0.65959837 0.32139775
#> [ reached getOption("max.print") -- omitted 199970 entries ]
To get a particular parameter, for e.g., number of peaks, we can use
simPICget
function:
Alternatively, to give a parameter a new value we can use the
setsimPICparameters
function:
sim.params <- setsimPICparameters(sim.params, nPeaks = 2000)
simPICget(sim.params, "nPeaks")
#> [1] 2000
To get or set multiple parameters use simPICget
or
setsimPICparameters
functions:
# Set multiple parameters at once (using a list)
sim.params <- setsimPICparameters(sim.params,
update = list(nPeaks = 8000, nCells = 500)
)
# Extract multiple parameters as a list
params <- simPICgetparameters(
sim.params,
c("nPeaks", "nCells", "peak.mean.shape")
)
# Set multiple parameters at once (using additional arguments)
params <- setsimPICparameters(sim.params,
lib.size.sdlog = 3.5, lib.size.meanlog = 9.07
)
params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 8000
#>
#> Slot "nCells":
#> [1] 500
#>
#> Slot "seed":
#> [1] 4361
#>
#> Slot "default":
#> [1] TRUE
#>
#> Slot "pm.distr":
#> [1] "weibull"
#>
#> Slot "lib.size.meanlog":
#> [1] 9.07
#>
#> Slot "lib.size.sdlog":
#> [1] 3.5
#>
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#>
#> Slot "peak.mean.rate":
#> [1] 7.100648
#>
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#>
#> Slot "peak.mean.pi":
#> [1] -17.17441
#>
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#>
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#>
#> Slot "sparsity":
#> [1] 0.03183703 0.12368925 0.71255933 0.29731652 0.59878107 0.27771208
#> [7] 0.11872009 0.76971198 0.80653983 0.05440784 0.88948437 0.46711477
#> [13] 0.26328122 0.93000336 0.19992803 0.19175698 0.07359594 0.38368023
#> [19] 0.26143361 0.26350299 0.72370504 0.47078848 0.55950116 0.35243315
#> [25] 0.73758191 0.54356535 0.82020172 0.39431261 0.65959837 0.32139775
#> [ reached getOption("max.print") -- omitted 199970 entries ]
simPIC allows you to estimate many of it’s parameters from a
SingleCellExperiment object containing counts or a counts matrix using
the simPICestimate
function.
# Get the counts from test data
count <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))
# Check that counts is a dgCMatrix
class(count)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
typeof(count)
#> [1] "S4"
# Check the dimensions, each row is a peak, each column is a cell
dim(count)
#> [1] 5000 700
# Show the first few entries
count[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> TTGCCCACACTCAAGT-1 TATGTGGGTCCAAGAG-1
#> chr22-29837913-29838812 1 2
#> chr12-90076794-90077665 . .
#> chr11-63574191-63575102 . .
#> chr12-54977628-54978479 . .
#> chr1-221715272-221716182 . .
#> CGATGATCAAGTAACA-1 GTCGTAATCAGGAAGC-1
#> chr22-29837913-29838812 . .
#> chr12-90076794-90077665 . .
#> chr11-63574191-63575102 . .
#> chr12-54977628-54978479 . .
#> chr1-221715272-221716182 . .
#> TACTAGGCACAAACAA-1
#> chr22-29837913-29838812 .
#> chr12-90076794-90077665 .
#> chr11-63574191-63575102 .
#> chr12-54977628-54978479 .
#> chr1-221715272-221716182 1
new <- newsimPICcount()
new <- simPICestimate(count)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean
## estimating using gamma distribution
## new <- simPICestimate(count, pm.distr = "gamma")
Here we estimated parameters from a counts matrix using default parameters. The estimation process has the following steps:
For more details of the estimation procedures see
?simPICestimate
.
Once we have a set of parameters we are happy with we can use
simPICsimulate
to simulate counts. To make adjustments to
the parameters provide them as additional arguments. Alternatively if we
don’t supply any parameters defaults will be used:
sim <- simPICsimulate(new, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim
#> class: SingleCellExperiment
#> dim: 5000 1000
#> metadata(1): Params
#> assays(1): counts
#> rownames(5000): Peak1 Peak2 ... Peak4999 Peak5000
#> rowData names(2): Peak exp.peakmean
#> colnames(1000): Cell1 Cell2 ... Cell999 Cell1000
#> colData names(2): Cell exp.libsize
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
## simulating using gamma distribution
## sim <- simPICsimulate(new, nCells =1000, pm.distr = "gamma")
Looking at the output of simPICsimulate
we can see that
sim
is SingleCellExperiment
object with 5000
features (peaks) and 1000 samples (cells). The main part of this object
is a features by samples matrix containing the simulated counts
(accessed using counts
). Additionally a
SingleCellExperiment
contains information about each cell
(accessed using colData
) and each peak (accessed using
rowData
). simPIC uses these slots, as well as
assays
, to store information about the intermediate values
of the simulation.
# Access the counts
counts(sim)[1:5, 1:5]
#> Cell1 Cell2 Cell3 Cell4 Cell5
#> Peak1 0 0 0 0 0
#> Peak2 0 0 1 1 0
#> Peak3 1 0 1 1 0
#> Peak4 0 0 0 0 0
#> Peak5 0 0 0 0 0
# Information about peaks
head(rowData(sim))
#> DataFrame with 6 rows and 2 columns
#> Peak exp.peakmean
#> <character> <numeric>
#> Peak1 Peak1 6.75102e-06
#> Peak2 Peak2 2.68961e-04
#> Peak3 Peak3 1.02505e-04
#> Peak4 Peak4 3.80896e-06
#> Peak5 Peak5 4.83026e-05
#> Peak6 Peak6 4.76012e-05
# Information about cells
head(colData(sim))
#> DataFrame with 6 rows and 2 columns
#> Cell exp.libsize
#> <character> <numeric>
#> Cell1 Cell1 1467
#> Cell2 Cell2 575
#> Cell3 Cell3 1122
#> Cell4 Cell4 1105
#> Cell5 Cell5 925
#> Cell6 Cell6 887
# Peak by cell matrices
names(assays(sim))
#> [1] "counts"
For more details about the SingleCellExperiment object refer to thevignette.
The simPICsimulate
function provides additional
simulation details:
* Cell information (colData
)
* Cell
- Unique cell identifier.
* exp.libsize
- The expected library size for that cell.
(not obtained from the final simulated counts)
* Peak information (rowData
)
* Peak
- Unique peak identifier.
* exp.peakmean
- The expected peak means for that peak.
(not obtained from the final simulated counts)
* Peak by cell information (assays
)
* counts
- The final simulated counts.
For more information on the simulation see
?simPICsimulate
.
simPIC provides a function simPICcompare
that aims to
make these comparisons easier. This function takes a list of
SingleCellExperiment
objects, combines the datasets and
produces comparison plots. Let’s make two small simulations and see how
they compare.
sim1 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim2 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
comparison <- simPICcompare(list(real = sim1, simPIC = sim2))
names(comparison)
#> [1] "RowData" "ColData" "Plots"
names(comparison$Plots)
#> [1] "Means" "Variances" "MeanVar" "LibrarySizes" "ZerosPeak"
#> [6] "ZerosCell" "MeanZeros" "PeakMeanNzp"
The returned list has three items. The first two are the combined
datasets by peak (RowData
) and by cell
(ColData
) and the third contains some comparison plots
(produced using ggplot2
), for example a plot of the
distribution of means:
These are only a few of the plots you might want to consider but it should be easy to make more using the returned data.
If you use simPIC in your work please cite our paper:
citation("simPIC")
#> To cite package 'simPIC' in publications use:
#>
#> Chugh S, McCarthy D, Shim H (2023). _simPIC: simPIC: flexible
#> simulation of paired-insertion counts for single-cell ATAC-sequencing
#> data_. R package version 1.3.0,
#> <https://github.com/sagrikachugh/simPIC>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {simPIC: simPIC: flexible simulation of paired-insertion counts for
#> single-cell ATAC-sequencing data},
#> author = {Sagrika Chugh and Davis McCarthy and Heejung Shim},
#> year = {2023},
#> note = {R package version 1.3.0},
#> url = {https://github.com/sagrikachugh/simPIC},
#> }
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] simPIC_1.3.0 SingleCellExperiment_1.29.1
#> [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [7] IRanges_2.41.2 S4Vectors_0.45.2
#> [9] BiocGenerics_0.53.3 generics_0.1.3
#> [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
#> [13] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.49 bslib_0.8.0
#> [4] ggplot2_3.5.1 lattice_0.22-6 vctrs_0.6.5
#> [7] tools_4.4.2 parallel_4.4.2 fansi_1.0.6
#> [10] tibble_3.2.1 pkgconfig_2.0.3 Matrix_1.7-1
#> [13] checkmate_2.3.2 actuar_3.3-4 lifecycle_1.0.4
#> [16] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.2
#> [19] munsell_0.5.1 codetools_0.2-20 htmltools_0.5.8.1
#> [22] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
#> [25] yaml_2.3.10 pillar_1.9.0 crayon_1.5.3
#> [28] jquerylib_0.1.4 MASS_7.3-61 fitdistrplus_1.2-1
#> [ reached getOption("max.print") -- omitted 36 entries ]