This vignette demonstrates the use of the smoppix package: Single MOlecule sPatial omics data analysed through the Probabilistic IndeX. It unifies methods for statistical inference on single-molecule spatial transcriptomics datasets with replication in a single framework using probabilistic indices (PIs). Tests provided include univariate testing for vicinity to fixed objects such as cell walls or nuclei, univariate aggregation tests as well as bivariate tests for co- or antilocalization. Two different null hypothesis can be tested against: complete spatial randomness (CSR, also known as homogeneous Poisson process), and the background distribution of molecules of all other genes. In addition, differences between groups in terms of any of these univariate and bivariate PIs can be tested for, while accounting for complex design structures using mixed models. smoppix is scalable to large numbers of molecules and genes thanks to an exact permutation null distribution in combination with a custom C++ implementation. An adequate variance weighting scheme harnesses the high-dimensionality of the data to account for heteroscedasticity. The smoppix package is also applicable to other types of single-molecule omics data such as spatial proteomics or metabolomics, or to cell type localization data.
The package can be installed from Bioconductor as follows:
Alternatively, the latest version can be installed from github as:
Once installed, we can load the package:
The smoppix package can be computationally intensive, especially since the number transcript pairs grows quadratically with the number of transcripts. For this reason we provide multithreading through the BiocParallel package.
All the user needs to do, is choose the number of cores:
and prepare the parallel backend. The code is different for windows or unix-based systems, but is taken care of by the following code snippet:
if(.Platform$OS.type == "unix"){
#On unix-based systems, use MulticoreParam
register(MulticoreParam(nCores))
} else {
#On windows, use makeCluster
library(doParallel)
Clus = makeCluster(nCores)
registerDoParallel(Clus)
register(DoparParam(), default = TRUE)
}
Wherever applicable, the registered backend will be used throughout the analysis. For serial calculations, for instance if memory is limiting, choose
As example dataset we use the data from a publication by Yang et al. (2023) on Selaginella moellendorffii . Only a subset of the data, consisting of sections 1-5 of roots 1-3 is included for computational reasons. The data is loaded as follows.
A quick glimpse …
## x y gene day root section
## 1 357 1265 SmEIL1e day0 Root2 section3
## 2 425 1281 SmEIL1e day0 Root2 section3
## 3 381 1475 SmEIL1e day0 Root2 section3
## 4 317 1919 SmEIL1e day0 Root2 section3
## 5 1617 1801 SmEIL1e day0 Root2 section3
## 6 713 201 SmCASP1 day0 Root2 section3
It is a dataframe containing molecule locations (x- and y-variables), gene identity and covariates: day, root and section. At each day, 3 roots are included, with up to 5 sections for each root. The analysis will need to account for this nested design. The first step in the analysis is to build a hyperframe as provided in the spatstat package, which consists of all different point patterns and their covariates
## Found 29 unique images
As printed by the code, 29 images or point patterns were found. We take a closer look at the hyperframe
## Hyperframe:
## ppp image tabObs day root section
## day0_Root1_section1 (ppp) day0_Root1_section1 (table) day0 Root1 section1
## day0_Root1_section2 (ppp) day0_Root1_section2 (table) day0 Root1 section2
## day0_Root1_section3 (ppp) day0_Root1_section3 (table) day0 Root1 section3
## day0_Root1_section4 (ppp) day0_Root1_section4 (table) day0 Root1 section4
## day0_Root1_section5 (ppp) day0_Root1_section5 (table) day0 Root1 section5
## day0_Root2_section1 (ppp) day0_Root2_section1 (table) day0 Root2 section1
We see the following elements:
Zooming in on the first point pattern:
## Marked planar point pattern: 2279 points
## marks are of storage type 'character'
## window: rectangle = [2451, 6275] x [138, 1929] units
As a final confirmation of successfully reading in the data, we can make an exploratory plot:
A couple of most expressed genes are shown as coloured dots as shown in the legend, all other genes’ transcripts are shown in grey.
The first round of tests concerns the aggregation or regularity of individual transcripts, for which we look at the distribution of nearest neighbour distances (pi = ‘nn’). Since the shape and size differ strongly between the roots, we choose the background of all other transcripts as null distribution, i.e. we look for transcripts aggregated with respect to all other transcripts. As this is computationally most efficient, we also fit the PI for the bivariate analysis in this step (‘nnPair’, see below). For computation reasons, we limit the analysis to 15 genes.
nnObj <- estPis(hypYang,
pis = c("nn", "nnPair"), null = "background", verbose = FALSE,
features = c("SmVND2", "SmPINR", getFeatures(hypYang)[seq_len(numFeats)])
)
A data-driven weighting function is now fitted to the estimates of all transcripts to share information on variances across features. This requires information on the design too: all variables that will be included in the final modelling should be provided to ‘designVars’. This excludes the lowest level of the design variable, ‘section’ in this case. The variance is estimated over the different sections of the same day and root. If there are no design variables, i.e. all point patterns are replicates of the exact same condition, no variable needs to be provided.
The fitted trend can be plotted:
As expected the weight increases with the number of observations; quickly at first but latter plateauing. This relation between number of observations and variance will provide weights for a linear (mixed) model to be fitted per feature. A built-in function prepares the dataframe for a specified gene (here SmAHK4e):
This dataframe can then be used for mixed modelling, either in R or in an other software package. We present an example analysis below using the lmerTest package, which enhances the lmer package by providing approximate p-values. We fit the following linear mixed model (LMM) for once manually, before automating it. The day variable enters the model as a fixed effect, the root as a random effect (a random intercept). We use sum coding for the day variable, meaning that two dummies are created, one for “day0” and one for “day3”, so without a reference level. Instead the corresponding parameters are constrained to sum to zero, i.e. to be each other’s opposite in this case. This maintains the interpretability of the intercept as the baseline pi, from which the different days can depart, and allows meaningful inference on the intercept.
##
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
##
## lmer
## The following object is masked from 'package:stats':
##
## step
lmeMod <- lmerTest::lmer(pi - 0.5 ~ day + (1 | root),
data = dfUniNN, na.action = na.omit,
weights = weight, contrasts = list("day" = "contr.sum")
)
summary(lmeMod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pi - 0.5 ~ day + (1 | root)
## Data: dfUniNN
## Weights: weight
##
## REML criterion at convergence: -47.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.3345 -0.4304 0.1746 0.6866 2.5413
##
## Random effects:
## Groups Name Variance Std.Dev.
## root (Intercept) 0.0008015 0.02831
## Residual 0.0001612 0.01270
## Number of obs: 28, groups: root, 3
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -0.10241 0.02120 1.62379 -4.832 0.0597 .
## day1 0.01397 0.01298 23.71414 1.076 0.2925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## day1 0.127
Alternatively, a wrapper function is available that fits such LMMs for every transcript. A simple random intercept is assumed for every random effect, for more complex designs it is better to supply your own formula through the Formula argument.
Various errors may occur while fitting, e.g. insufficient non-missing observations. The function finishes silently but returns the error messages in the output. The results can now be viewed as follows using the getResults function, with the genes sorted by p-value.
## Estimate SE pVal pAdj
## SmBIRDa 0.2630099 0.006806981 5.662332e-24 9.625964e-23
## SmAUX1a 0.3401426 0.005119277 9.974217e-23 8.478084e-22
## SmAPL 0.3838950 0.010725703 3.968558e-11 2.248850e-10
## SmARF2 0.3567110 0.020845977 8.567159e-07 3.641043e-06
## dayday0 dayday3 pVal pAdj
## SmARF2 0.046879531 -0.046879531 0.03538130 0.6014820
## SmBIRDa 0.011641263 -0.011641263 0.09869917 0.6264600
## SmAPL 0.017719415 -0.017719415 0.11055177 0.6264600
## SmAUX1a -0.006094273 0.006094273 0.24422918 0.7299121
We see many genes with significant aggregation (intercept estimate smaller than 0.5), but no significant differences between day 0 and 3.
Tests for co/antilocalization are requested by supplying “nnPair” as PI. By default, all combinations of features into pairs are fitted. Again we plot the weighting function, which is now a bivariate spline as a function of the minority and majority gene, i.e. the gene in the gene pair with least respectively most events:
The least expressed gene or minority gene has the strongest influence on the weight. The dataframe is built in a similar way as before, separating the gene pair through a double hyphen:
Finally for the mixed model:
lmeModNN <- lmerTest::lmer(pi - 0.5 ~ day + (1 | root),
data = dfBiNN, na.action = na.omit,
weights = weight, contrasts = list("day" = "contr.sum")
)
summary(lmeModNN)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: pi - 0.5 ~ day + (1 | root)
## Data: dfBiNN
## Weights: weight
##
## REML criterion at convergence: -28.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.1537 -0.7921 0.2034 0.6126 1.2885
##
## Random effects:
## Groups Name Variance Std.Dev.
## root (Intercept) 0.0011099 0.03331
## Residual 0.0004211 0.02052
## Number of obs: 27, groups: root, 3
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 0.324946 0.028296 1.664615 11.484 0.014 *
## day1 0.009305 0.020552 22.760120 0.453 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## day1 0.028
We find a significant positive intercept 0.325, indicating strong antilocalization (PI = 0.825). We investigate this gene pair visually:
Shows clear antilocalization indeed. And similarly, this process can be wrapped using _fitLMMs()_as before:
## Estimate SE pVal pAdj
## SmAUX1a--SmBIRDa 0.7499238 0.01224710 6.110920e-18 8.310851e-16
## SmBIRDa--SmVND2 0.7665930 0.01535273 1.840166e-15 1.251313e-13
## SmAHP6b--SmBIRDa 0.7614779 0.01586418 2.785018e-15 1.262542e-13
## SmAUX1a--SmBRN 0.7673129 0.02560541 2.619859e-09 8.907519e-08
## SmAHK4e--SmBIRDa 0.6636389 0.01923712 5.490238e-09 1.493345e-07
## SmARF2--SmBIRDa 0.6969935 0.02525941 8.986252e-08 2.036884e-06
The results can also be written to a spreadsheet for consultation outside R:
A second dataset was published in a paper by Eng et al. (2019) on mouse fibroblast cells, which will serve to illustrate methods given knowledge of cell boundaries. The dataset was subset to the eight most expressed genes for memory reasons. The hyperframe is built as before:
data(Eng)
hypEng <- buildHyperFrame(Eng,
coordVars = c("x", "y"),
imageVars = c("fov", "experiment")
)
## Found 17 unique images
Plotting an overview
Next, the cell boundaries are added, as a list with names corresponding to the rownames of the hyperframe. Different formats are allowed to provide windows in case the correct packages are installed, see ?addCell. Based on these cells, every event is assigned to a cell. Although available for this dataset, also nuclei can be added analogously via the addNuclei functions.
Now plot the complete hyperframe with cell boundaries:
First, we test for vicinity of the RNA molecules to cell boundaries or centroids, and for localization patterns within the cell. For this we pass different PI arguments to the estPis() function. As null hypothesis, we choose complete spatial randomness (CSR) within the cell, which seems most natural, although the “background” option is also available. The suffix “Cell” indicates that aggregation and colocalization is tested within the same cell only, so not across the tissue.
engPis <- estPis(hypEng,
pis = c("edge", "centroid", "nnCell", "nnPairCell"),
null = "CSR", verbose = FALSE
)
As before, the weight function needs to be added for the nearest neighbour PIs. The lowest level variable defaults to “cell” in this case, so there is no need to supply it:
We the bivariate weight function as an example:
As before, the minority gene of the pair is the main driver of variance. Next, we fit linear mixed models on the obtained PIs. The experiment was conducted twice, so we add the experiment variable as fixed effect. The image and cell variables are added automatically where appropriate, as printed in the output:
## Fitted formula for pi edge:
## pi - 0.5 ~ 1 + experiment + (1 | image/cell)
## Fitted formula for Moran's I for pi edge:
## MoransI ~ 1 + experiment
## Fitted formula for pi centroid:
## pi - 0.5 ~ 1 + experiment + (1 | image/cell)
## Fitted formula for Moran's I for pi centroid:
## MoransI ~ 1 + experiment
## Fitted formula for pi nnCell:
## pi - 0.5 ~ 1 + experiment + (1 | image)
## Fitted formula for Moran's I for pi nnCell:
## MoransI ~ 1 + experiment
## Fitted formula for pi nnPairCell:
## pi - 0.5 ~ 1 + experiment + (1 | image)
## Fitted formula for Moran's I for pi nnPairCell:
## MoransI ~ 1 + experiment
For the “edge” and “centroid” options, each individual RNA-molecule is an observation in the model, and random effects are added for both cell and image. For “nnCell” and “nnPairCell”, the observations are averaged per cell, and so only a random effect common to all cells of a single image remains. Now we plot the genes with the strongest remoteness from the cell edge:
Note that the cells are selected for high expression of the gene and are not shown in their original position! Now plot the gene with strongest intracellular aggregation:
And finally the gene pair with the strongest antilocalization:
We can see that the molecule distribution within the cell is all but uniform!
A side question that may arise is whether the measures of intracellular aggregation have a global spatial patterning throughout the tissue. As a start, we can explore this graphically for the gene most significantly remote from the cell edge:
plotExplore(hypEng,
piEsts = engPis, piColourCell = "edge", features =
remEdge <- rownames(getResults(allModsNNcell, "edge", "Intercept"))[1]
)
The molecules are shown in red, while the yellow-blue scale indicates the PI estimate. It seems like in the central cells, the PI is smaller (molecules closer to the cell edge) than the outer cells. We test whether the Moran’s I statistic, a measure for spatial aggregation of numeric variables, is significant over the different point patterns:
## Estimate SE pVal pAdj
## -0.006229325 0.011647713 0.600619286 0.665717592
Although the estimated Moran’s I is positive indeed, indicating positive spatial autocorrelation, it is not significant over the different repeats. Writing the results to a spreadsheet again:
Another biologically interesting spatial pattern are gradients. Again, they can be estimated image-wide or within cells. Parametric model fitting of Poisson point processes is provided in the spatstat.model package by Baddeley, Rubak, and Turner (2015), we wrap it here for convenience. Detecting presence of gradients over different point patterns is done by fitting a model including interaction terms between x and y coordinates and image identifiers. If this interactions are significant, this implies existence of gradients in the different point patterns, albeit with different directions. The fitting process is quite computation intensive, so we limit ourselves to a subset of the data. By default both overall gradients and within-cell gradients are fitted.
engGrads <- estGradients(hypEng[seq_len(2), ], features = feat <- getFeatures(hypEng)[seq_len(3)])
pValsGrads <- getPvaluesGradient(engGrads, "cell")
Keep in mind though that a gradient that is significant for a computer may look very different from the human perspective; many types of spatial patterns can be captured by a gradient to some extent without providing a satisfactory fit.
Probabilistic indices are features extracted from point patterns that can serve purposes other than hypothesis testing. Here we demonstrate their use as interpretable predictors for predicting tumor type from cell type locations on a breast cancer dataset from a publication by Keren et al. (2018), previously analysed by VanderDoes et al. (2023). Make sure to install the funkycells package to access the data.
if(reqFunky <- require(funkycells)){
data("TNBC_pheno")
data("TNBC_meta")
TNBC_pheno$age <- TNBC_meta$Age[match(TNBC_pheno$Person, TNBC_meta$Person)]
TNBC_pheno$Class <- ifelse(TNBC_pheno$Class=="0", "mixed", "compartentalised")
hypTNBC <- buildHyperFrame(TNBC_pheno, coordVars = c("cellx", "celly"),
imageVars = c("Person", "Class", "age"), featureName = "Phenotype")
hypTNBC <- hypTNBC[order(hypTNBC$Class),] #Order by class for plots
}
## Loading required package: funkycells
## Found 33 unique images
if(reqFunky){
plotExplore(hypTNBC, Cex.main = 0.7, features = c("Tumor", "Macrophage", "CD8T"),
CexLegend = 0.85)
}
Fit a penalised regression model, and estimate its predictive accuracy through cross-validation.
if(reqFunky){
#Select feature through hypothesis tests
pisTNBC = estPis(hypTNBC, pis = c("nn", "nnPair"), null = "background", verbose = FALSE)
pisTNBC = addWeightFunction(pisTNBC)
TNBClmms = fitLMMs(pisTNBC, fixedVars = c("age", "Class"), verbose = TRUE)
#Extract PIs of significant features
nnRes = getResults(TNBClmms, "nn", "Class")
sigLevel = 0.05
nnSig = rownames(nnRes)[nnRes[, "pAdj"]<sigLevel]
nnPairRes = getResults(TNBClmms, "nnPair", "Class")
nnPairSig = rownames(nnPairRes)[nnPairRes[, "pAdj"]<sigLevel & !is.na(nnPairRes[, "pAdj"])]
nnPis = t(vapply(pisTNBC$hypFrame$pimRes, FUN.VALUE = double(length(nnSig)+ length(nnPairSig)), function(x){
c(if(length(length(nnSig))) x$pointDists$nn[nnSig],
if(length(length(nnPairSig))) vapply(nnPairSig, FUN.VALUE = 1, function(y) {
tmp = getGp(gp = y, x$pointDists$nnPair)
if(is.null(tmp)) NA else tmp
}))
}))
colnames(nnPis) = c(nnSig, nnPairSig)
nnPis[is.na(nnPis)] = 0.5 #Replace missing values by 0.5
if(require(glmnet, quietly = TRUE)){
set.seed(20062024)
lassoModel = cv.glmnet(factor(hypTNBC$Class), x = cbind("Age" = hypTNBC$Age, nnPis), family = "binomial", alpha = 1, nfolds = 10, type.measure = "class")
lassoModel #In-sample misclassification error is quite low
}
}
## Fitted formula for pi nn:
## pi - 0.5 ~ 1 + age + Class
## Fitted formula for pi nnPair:
## pi - 0.5 ~ 1 + age + Class
## Loaded glmnet 4.1-8
##
## Call: cv.glmnet(x = cbind(Age = hypTNBC$Age, nnPis), y = factor(hypTNBC$Class), type.measure = "class", nfolds = 10, family = "binomial", alpha = 1)
##
## Measure: Misclassification Error
##
## Lambda Index Measure SE Nonzero
## min 0.1990 9 0.06061 0.03945 2
## 1se 0.2184 8 0.09091 0.04656 1
An error like
Error in reducer$value.cache[[as.character(idx)]] <- values : wrong args for environment subassignment In addition: Warning message: In parallel::mccollect(wait = FALSE, timeout = 1) : 1 parallel job did not deliver a result
often indicates insufficient memory. Try reducing the number of cores requested with MultiCoreParam(), or switch to serial processing with register(SerialParam()).
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] glmnet_4.1-8 funkycells_1.1.1 lmerTest_3.1-3
## [4] lme4_1.1-36 Matrix_1.7-2 BiocParallel_1.41.0
## [7] smoppix_0.99.41 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.2 deldir_2.0-4
## [3] rlang_1.1.5 magrittr_2.0.3
## [5] matrixStats_1.5.0 compiler_4.4.2
## [7] spatstat.geom_3.3-5 mgcv_1.9-1
## [9] spatstat.model_3.3-4 vctrs_0.6.5
## [11] pkgconfig_2.0.3 SpatialExperiment_1.17.0
## [13] shape_1.4.6.1 crayon_1.5.3
## [15] fastmap_1.2.0 magick_2.8.5
## [17] XVector_0.47.2 labeling_0.4.3
## [19] UCSC.utils_1.3.1 nloptr_2.1.1
## [21] xfun_0.50 cachem_1.1.0
## [23] GenomeInfoDb_1.43.4 jsonlite_1.8.9
## [25] goftest_1.2-3 DelayedArray_0.33.4
## [27] spatstat.utils_3.1-2 parallel_4.4.2
## [29] R6_2.5.1 bslib_0.9.0
## [31] stringi_1.8.4 spatstat.data_3.1-4
## [33] spatstat.univar_3.1-1 boot_1.3-31
## [35] rpart_4.1.24 GenomicRanges_1.59.1
## [37] jquerylib_0.1.4 numDeriv_2016.8-1.1
## [39] Rcpp_1.0.14 SummarizedExperiment_1.37.0
## [41] iterators_1.0.14 knitr_1.49
## [43] tensor_1.5 IRanges_2.41.2
## [45] splines_4.4.2 tidyselect_1.2.1
## [47] abind_1.4-8 yaml_2.3.10
## [49] codetools_0.2-20 spatstat.random_3.3-2
## [51] spatstat.explore_3.3-4 lattice_0.22-6
## [53] tibble_3.2.1 Biobase_2.67.0
## [55] withr_3.0.2 evaluate_1.0.3
## [57] survival_3.8-3 polyclip_1.10-7
## [59] zip_2.3.1 pillar_1.10.1
## [61] MatrixGenerics_1.19.1 foreach_1.5.2
## [63] stats4_4.4.2 reformulas_0.4.0
## [65] generics_0.1.3 S4Vectors_0.45.2
## [67] ggplot2_3.5.1 munsell_0.5.1
## [69] scales_1.3.0 extraDistr_1.10.0
## [71] minqa_1.2.8 glue_1.8.0
## [73] maketools_1.3.1 tools_4.4.2
## [75] sys_3.4.3 openxlsx_4.2.8
## [77] buildtools_1.0.0 grid_4.4.2
## [79] rbibutils_2.3 colorspace_2.1-1
## [81] SingleCellExperiment_1.29.1 nlme_3.1-167
## [83] GenomeInfoDbData_1.2.13 cli_3.6.3
## [85] spatstat.sparse_3.1-0 S4Arrays_1.7.1
## [87] scam_1.2-18 dplyr_1.1.4
## [89] gtable_0.3.6 sass_0.4.9
## [91] digest_0.6.37 BiocGenerics_0.53.6
## [93] SparseArray_1.7.4 rjson_0.2.23
## [95] farver_2.1.2 htmltools_0.5.8.1
## [97] lifecycle_1.0.4 httr_1.4.7
## [99] mime_0.12 MASS_7.3-64