spatialSimGP
is a simulation tool that generates spatial
transcriptomics data. The purpose of this package is to use a Gaussian
Process for each gene to simulate data with spatial variation. We use
the Poisson distribution to simulate the values on the raw counts scale.
The mean and variance are tied together in the Poisson distribution, so
we simulate the mean-variance relationship with our function. The
mean-variance relationship is a bias in real spatial transcriptomics
data, so we must make sure it is a feature of in silico data as well.
spatialSimGP
provides the option to simulate data with a
fixed or unique length scale for each gene. The simulated data can be
used to evaluate the performance of spatial transcriptomics analysis
methods.
Bioconductor houses the infrastructure to store and analyze spatially
resolved transcriptomics data for R users, including many SVG detection
methods. This simulation framework can be used to benchmark SVG
detection methods and to develop new methods for spatially resolved
transcriptomics data. Additionally, this package interfaces with the
widely used SpatialExperiment
class from Bioconductor.
The following code will install the latest release version of the
spatialSimGP
package from Bioconductor. Additional details
are shown on the Bioconductor
page.
The latest development version can also be installed from the
devel
version of Bioconductor or from GitHub.
The simulation framework is as follows:
c(s)|λ(s) ∼ Poisson(λ(s)); λ(s) = exp(β + C(σ2))
The exponential covariance function is as follows:
$$(C_{ij}(\boldsymbol{\theta})) = \sigma^2\exp(\frac{-||\boldsymbol{s_i}-\boldsymbol{s_j}||}{l})$$
We calculate the covariance matrix using the exponential covariance function. Using mean 0 and covariance C(θ) in the multivariate Normal distribution, we simulate a Gaussian Process per gene. We use the Gaussian process and β to calculate λ and then use the Poisson distribution to simulate the gene expression levels for each spot.
Load packages and data
## Loading required package: SingleCellExperiment
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:generics':
##
## intersect, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
## setequal, table, tapply, union, unique, unsplit, which.max,
## which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
## Loading required package: ExperimentHub
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
Simulating Data with Prior Coordinates Matrix
One way to simulate data is to provide a matrix of coordinates. In
this example, we use a subset of spots from
STexampleData::Visium_mouseCoronal()
, which is available
from Bioconductor.
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
colData(spe_demo)$subset <- ifelse(
colData(spe_demo)$array_row > 20 &
colData(spe_demo)$array_row < 65 &
colData(spe_demo)$array_col > 30 &
colData(spe_demo)$array_col < 65,
TRUE, FALSE
)
spe_demo <- spe_demo[, colData(spe_demo)$subset]
coords <- spatialCoords(spe_demo)
We also have to define our remaining parameters before simulating the data.
n_genes
is the total number of genes to simulate. In
this example, we simulate 5 genes.proportion
is the proportion of genes that will have no
spatially varying patterns. In other words, these genes will just have
random noise. In this example, 40% of the genes will have no spatial
patterns.range_sigma.sq
is the range of the spatial variance
parameter. In this example, the spatial variance parameter will range
from 1.5 to 3.range_beta
is the range of the mean expression value.
In this example, the mean parameter will range from 3 to 7.(A) Simulating Data with Fixed Length Scale
We first simulate 5 genes with a fixed length scale parameter. The
length scale parameter determines how quickly the correlation decays
with distance. Larger length scale parameters simulate larger spatial
patterns. The simulate
function returns a
SpatialExperiment
object with the simulated data. Remember
to set the seed for reproducibility.
length_scale <- 60
set.seed(16)
spe <- spatial_simulate(n_genes, proportion, coords,
range_sigma.sq, range_beta,
length_scale, length_scale_option = "fixed")
## Simulating gene 1
## Simulating gene 2
## Simulating gene 3
## Simulating gene 4
## Simulating gene 5
We can visualize the first gene in the simulated data below:
df <- as.data.frame(cbind(spatialCoords(spe), expr = counts(spe)[1, ]))
ggplot(df, aes(x = pxl_col_in_fullres, y = pxl_row_in_fullres,
color = expr)) +
geom_point(size = 2.2) +
coord_fixed() +
scale_y_reverse() +
scale_color_gradient(low = "gray90", high = "blue",
trans = "sqrt", breaks = range(df$expr),
name = "counts") +
theme_bw() +
theme(plot.title = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
(B) Simulating Data with Unique Length Scale
We can also simulate data with a unique length scale for each gene. This process is slower than simulating data with a fixed length scale, but it allows for more flexibility in the spatial patterns of each gene. Each gene has a unique length scale parameter, so the Gaussian Process kernel must be calculated for each gene, slowing down the simulation process.
length_scale <- c(60, 40, 20, 80, 100)
set.seed(1)
spe <- spatial_simulate(n_genes, proportion, coords,
range_sigma.sq, range_beta,
length_scale, length_scale_option = "unique")
## Simulating gene 1
## Simulating gene 2
## Simulating gene 3
## Simulating gene 4
## Simulating gene 5
Simulating Data with User-Created Coordinates Matrix
If you have your own coordinates matrix, you can use that to simulate data. We have included an example below.
# 20 spots per side
n_side <- 20
# x and y coordinates for the grid
x_coords <- rep(1:n_side, each = n_side)
y_coords <- rep(1:n_side, times = n_side)
# combine into a matrix
coords <- cbind(x_coords, y_coords)
colnames(coords) <- c("pxl_col_in_fullres", "pxl_row_in_fullres")
# run the simulation
set.seed(1)
length_scale <- 60
spe <- spatial_simulate(n_genes, proportion, coords,
range_sigma.sq, range_beta,
length_scale, length_scale_option = "fixed")
## Simulating gene 1
## Simulating gene 2
## Simulating gene 3
## Simulating gene 4
## Simulating gene 5
We can visualize the first gene in the simulated data below:
df <- as.data.frame(cbind(spatialCoords(spe), expr = counts(spe)[1, ]))
ggplot(df, aes(x = pxl_col_in_fullres, y = pxl_row_in_fullres,
color = expr)) +
geom_point(size = 5) +
coord_fixed() +
scale_y_reverse() +
scale_color_gradient(low = "gray90", high = "blue",
trans = "sqrt", breaks = range(df$expr),
name = "counts") +
theme_bw() +
theme(plot.title = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
Note: If you want to have complete control over each simulated gene,
you can set n_genes
= 1, proportion
= 0,
range_sigma.sq
= c(a,a), and range_beta
=
c(b,b). This will allow you to simulate one gene at a time at the exact
spatial variance and mean expression level desired. You could loop
through this process to simulate multiple genes with different
parameters.
set.seed(123)
n_genes <- 1
proportion <- 0
range_sigma.sq <- c(1, 1)
range_beta <- c(3, 3)
length_scale <- 60
spe <- spatial_simulate(n_genes, proportion, coords,
range_sigma.sq, range_beta,
length_scale, length_scale_option = "fixed")
## Simulating gene 1
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] spatialSimGP_1.1.0 ggplot2_3.5.1
## [3] STexampleData_1.13.3 ExperimentHub_2.15.0
## [5] AnnotationHub_3.15.0 BiocFileCache_2.15.0
## [7] dbplyr_2.5.0 SpatialExperiment_1.16.0
## [9] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0
## [11] Biobase_2.67.0 GenomicRanges_1.59.0
## [13] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [15] S4Vectors_0.44.0 BiocGenerics_0.53.1
## [17] generics_0.1.3 MatrixGenerics_1.19.0
## [19] matrixStats_1.4.1 MASS_7.3-61
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4
## [4] blob_1.2.4 filelock_1.0.3 Biostrings_2.75.0
## [7] fastmap_1.2.0 digest_0.6.37 mime_0.12
## [10] lifecycle_1.0.4 KEGGREST_1.47.0 RSQLite_2.3.7
## [13] magrittr_2.0.3 compiler_4.4.1 rlang_1.1.4
## [16] sass_0.4.9 tools_4.4.1 utf8_1.2.4
## [19] yaml_2.3.10 knitr_1.48 labeling_0.4.3
## [22] S4Arrays_1.6.0 bit_4.5.0 curl_5.2.3
## [25] DelayedArray_0.33.1 abind_1.4-8 purrr_1.0.2
## [28] withr_3.0.2 sys_3.4.3 grid_4.4.1
## [31] fansi_1.0.6 colorspace_2.1-1 scales_1.3.0
## [34] cli_3.6.3 rmarkdown_2.28 crayon_1.5.3
## [37] httr_1.4.7 rjson_0.2.23 DBI_1.2.3
## [40] cachem_1.1.0 zlibbioc_1.52.0 AnnotationDbi_1.69.0
## [43] BiocManager_1.30.25 XVector_0.46.0 vctrs_0.6.5
## [46] Matrix_1.7-1 jsonlite_1.8.9 bit64_4.5.2
## [49] maketools_1.3.1 magick_2.8.5 jquerylib_0.1.4
## [52] glue_1.8.0 gtable_0.3.6 BiocVersion_3.21.1
## [55] UCSC.utils_1.2.0 munsell_0.5.1 tibble_3.2.1
## [58] pillar_1.9.0 rappdirs_0.3.3 htmltools_0.5.8.1
## [61] GenomeInfoDbData_1.2.13 R6_2.5.1 evaluate_1.0.1
## [64] lattice_0.22-6 highr_0.11 png_0.1-8
## [67] memoise_2.0.1 bslib_0.8.0 Rcpp_1.0.13
## [70] SparseArray_1.6.0 xfun_0.48 buildtools_1.0.0
## [73] pkgconfig_2.0.3