--- title: "Vignette of the smoppix package" author: Stijn Hawinkel output: rmarkdown::html_vignette: toc: true number_sections: true keep_md: true citation_package: biblatex vignette: > %\VignetteIndexEntry{Vignette of the smoppix package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ../inst/REFERENCES.bib --- ```{r, echo=FALSE} htmltools::img(src = knitr::image_uri("../inst/Smoppix.jpg"), alt = 'logo', style = 'position:absolute; top:0; right:0; padding:10px;', width = "200px", heigth = "200px") ``` ```{r setup, include=FALSE, echo=FALSE} knitr::opts_chunk$set(fig.heigh = 7, fig.width = 7) ``` # Introduction This vignette demonstrates the use of the _smoppix_ package: Single MOlecule sPatial omics data analysed through the Probabilistic IndeX. It unifies methods for statistical inference on single-molecule spatial transcriptomics datasets with replication in a single framework using probabilistic indices (PIs). Tests provided include univariate testing for vicinity to fixed objects such as cell walls or nuclei, univariate aggregation tests as well as bivariate tests for co- or antilocalization. Two different null hypothesis can be tested against: complete spatial randomness (CSR, also known as homogeneous Poisson process), and the background distribution of molecules of all other genes. In addition, differences between groups in terms of any of these univariate and bivariate PIs can be tested for, while accounting for complex design structures using mixed models. _smoppix_ is scalable to large numbers of molecules and genes thanks to an exact permutation null distribution in combination with a custom C++ implementation. An adequate variance weighting scheme harnesses the high-dimensionality of the data to account for heteroscedasticity. The _smoppix_ package is also applicable to other types of single-molecule omics data such as spatial proteomics or metabolomics, or to cell type localization data. \setcounter{tocdepth}{5} \tableofcontents # Installation instructions The package can be installed from Bioconductor as follows: ```{r install, eval = FALSE} library(Biocmanager) install("smoppix") ``` Alternatively, the latest version can be installed from github as: ```{r installAndLoadGitHub, eval = FALSE} library(devtools) install_github("sthawinke/smoppix") ``` Once installed, we can load the package: ```{r load} library(smoppix) ``` # Multithreading The _smoppix_ package can be computationally intensive, especially since the number transcript pairs grows quadratically with the number of transcripts. For this reason we provide multithreading through the _BiocParallel_ package. ```{r} library(BiocParallel) ``` All the user needs to do, is choose the number of cores: ```{r} nCores <- 2 # The number of CPUs ``` and prepare the parallel backend. The code is different for windows or unix-based systems, but is taken care of by the following code snippet: ```{r nonwindows} if(.Platform$OS.type == "unix"){ #On unix-based systems, use MulticoreParam register(MulticoreParam(nCores)) } else { #On windows, use makeCluster library(doParallel) Clus = makeCluster(nCores) registerDoParallel(Clus) register(DoparParam(), default = TRUE) } ``` Wherever applicable, the registered backend will be used throughout the analysis. For serial calculations, for instance if memory is limiting, choose ```{r , eval = FALSE} register(SerialParam()) ``` # Yang data: lycophyte roots As example dataset we use the data from a [publication](https://doi.org/10.1016/j.cub.2023.08.030) by @Yang2023 on _Selaginella moellendorffii_ . Only a subset of the data, consisting of sections 1-5 of roots 1-3 is included for computational reasons. The data is loaded as follows. ```{r} data(Yang) ``` A quick glimpse ... ```{r} head(Yang) ``` It is a dataframe containing molecule locations (x- and y-variables), gene identity and covariates: day, root and section. At each day, 3 roots are included, with up to 5 sections for each root. The analysis will need to account for this nested design. The first step in the analysis is to build a hyperframe as provided in the _spatstat_ package, which consists of all different point patterns and their covariates ```{r} hypYang <- buildHyperFrame(Yang, coordVars = c("x", "y"), imageVars = c("day", "root", "section") ) ``` As printed by the code, `r nrow(hypYang)` images or point patterns were found. We take a closer look at the hyperframe ```{r} head(hypYang) ``` We see the following elements: - __ppp__: the _spatstat_ point patterns of class _ppp_ - __image__: The image id uniquely defining each point pattern. In our case, it is a concatenation of the day, root and section variables. - __tabObs__: tabulation of the frequency of each transcript in each point pattern - __day__, __root__ and __section__: the covariates Zooming in on the first point pattern: ```{r} hypYang$ppp[[1]] ``` As a final confirmation of successfully reading in the data, we can make an exploratory plot: ```{r plotExplore, fig.height = 6.5} plotExplore(hypYang, Cex = 2) ``` A couple of most expressed genes are shown as coloured dots as shown in the legend, all other genes' transcripts are shown in grey. ## Univariate tests ### Tests for univariate aggregation or regularity: nearest neighbour distances ```{r numFeats} numFeats <- 15 #Limit number of genes in the analysis for computational reasons ``` The first round of tests concerns the aggregation or regularity of individual transcripts, for which we look at the distribution of nearest neighbour distances (pi = 'nn'). Since the shape and size differ strongly between the roots, we choose the background of all other transcripts as null distribution, i.e. we look for transcripts aggregated _with respect to_ all other transcripts. As this is computationally most efficient, we also fit the PI for the bivariate analysis in this step ('nnPair', see below). For computation reasons, we limit the analysis to `r numFeats` genes. ```{r estpisyang} nnObj <- estPis(hypYang, pis = c("nn", "nnPair"), null = "background", verbose = FALSE, features = c("SmVND2", "SmPINR", getFeatures(hypYang)[seq_len(numFeats)]) ) ``` A data-driven weighting function is now fitted to the estimates of all transcripts to share information on variances across features. This requires information on the design too: all variables that will be included in the final modelling should be provided to 'designVars'. This excludes the lowest level of the design variable, 'section' in this case. The variance is estimated over the different sections of the same day and root. If there are no design variables, i.e. all point patterns are replicates of the exact same condition, no variable needs to be provided. ```{r} nnObj <- addWeightFunction(nnObj, designVars = c("day", "root")) ``` The fitted trend can be plotted: ```{r} plotWf(nnObj, pi = "nn") ``` As expected the weight increases with the number of observations; quickly at first but latter plateauing. This relation between number of observations and variance will provide weights for a linear (mixed) model to be fitted per feature. A built-in function prepares the dataframe for a specified gene (here _SmAHK4e_): ```{r} dfUniNN <- buildDataFrame(nnObj, gene = "SmAHK4e", pi = "nn") ``` This dataframe can then be used for mixed modelling, either in R or in an other software package. We present an example analysis below using the _lmerTest_ package, which enhances the _lmer_ package by providing approximate p-values. We fit the following linear mixed model (LMM) for once manually, before automating it. The day variable enters the model as a fixed effect, the root as a random effect (a random intercept). We use sum coding for the day variable, meaning that two dummies are created, one for "day0" and one for "day3", so without a reference level. Instead the corresponding parameters are constrained to sum to zero, i.e. to be each other's opposite in this case. This maintains the interpretability of the intercept as the baseline pi, from which the different days can depart, and allows meaningful inference on the intercept. ```{r} library(lmerTest, quietly = TRUE) lmeMod <- lmerTest::lmer(pi - 0.5 ~ day + (1 | root), data = dfUniNN, na.action = na.omit, weights = weight, contrasts = list("day" = "contr.sum") ) summary(lmeMod) ``` Alternatively, a wrapper function is available that fits such LMMs for every transcript. A simple random intercept is assumed for every random effect, for more complex designs it is better to supply your own formula through the __Formula__ argument. ```{r fitLmms, include = FALSE} allModsNN <- fitLMMs(nnObj, fixedVars = "day", randomVars = "root") ``` Various errors may occur while fitting, e.g. insufficient non-missing observations. The function finishes silently but returns the error messages in the output. The results can now be viewed as follows using the _getResults_ function, with the genes sorted by p-value. ```{r headgetresults} head(getResults(allModsNN, "nn", "Intercept"), 4) head(getResults(allModsNN, "nn", "day"), 4) ``` We see many genes with significant aggregation (intercept estimate smaller than 0.5), but no significant differences between day 0 and 3. ## Bivariate tests ### Tests for colocalization or antilocalization: interfeature nearest neighbour distances Tests for co/antilocalization are requested by supplying "nnPair" as PI. By default, all combinations of features into pairs are fitted. Again we plot the weighting function, which is now a bivariate spline as a function of the minority and majority gene, i.e. the gene in the gene pair with least respectively most events: ```{r plotwf, fig.height = 5} plotWf(nnObj, pi = "nnPair") ``` The least expressed gene or minority gene has the strongest influence on the weight. The dataframe is built in a similar way as before, separating the gene pair through a double hyphen: ```{r builddf} dfBiNN <- buildDataFrame(nnObj, gene = "SmVND2--SmPINR", pi = "nnPair") ``` Finally for the mixed model: ```{r} lmeModNN <- lmerTest::lmer(pi - 0.5 ~ day + (1 | root), data = dfBiNN, na.action = na.omit, weights = weight, contrasts = list("day" = "contr.sum") ) summary(lmeModNN) ``` We find a significant positive intercept `r signif(summary(lmeModNN)$coef["(Intercept)", "Estimate"], 3)`, indicating strong antilocalization (PI = `r signif(summary(lmeModNN)$coef["(Intercept)", "Estimate"], 3)+0.5`). We investigate this gene pair visually: ```{r visualInvest, fig.height = 6} plotExplore(hypYang, features = "SmVND2--SmPINR") ``` Shows clear antilocalization indeed. And similarly, this process can be wrapped using _fitLMMs()_as before: ```{r fitlmmPair} head(getResults(allModsNN, "nnPair", "Intercept")) ``` The results can also be written to a spreadsheet for consultation outside R: ```{r writeXlsxYang, eval = FALSE} writeToXlsx(allModsNN, file = "myfile.xlsx") ``` # Eng2019: Mouse brain cells A second dataset was published in a [paper](https://www.nature.com/articles/s41586-019-1049-y) by @Eng2019 on mouse fibroblast cells, which will serve to illustrate methods given knowledge of cell boundaries. The dataset was subset to the eight most expressed genes for memory reasons. The hyperframe is built as before: ```{r readInEngData, warning=FALSE} data(Eng) hypEng <- buildHyperFrame(Eng, coordVars = c("x", "y"), imageVars = c("fov", "experiment") ) ``` Plotting an overview ```{r engNoCells, fig.height = 6} plotExplore(hypEng) ``` Next, the cell boundaries are added, as a list with names corresponding to the rownames of the hyperframe. Different formats are allowed to provide windows in case the correct packages are installed, see _?addCell_. Based on these cells, every event is assigned to a cell. Although available for this dataset, also nuclei can be added analogously via the _addNuclei_ functions. ```{r CellEngData} # Add cell identifiers hypEng <- addCell(hypEng, EngRois, verbose = FALSE) ``` Now plot the complete hyperframe with cell boundaries: ```{r plotExpCell, fig.height = 7} plotExplore(hypEng) ``` ## Vicinity to cell edge or centroid, and other within-cell localization patterns First, we test for vicinity of the RNA molecules to cell boundaries or centroids, and for localization patterns within the cell. For this we pass different PI arguments to the _estPis()_ function. As null hypothesis, we choose complete spatial randomness (CSR) within the cell, which seems most natural, although the "background" option is also available. The suffix "Cell" indicates that aggregation and colocalization is tested within the same cell only, so not across the tissue. ```{r} engPis <- estPis(hypEng, pis = c("edge", "centroid", "nnCell", "nnPairCell"), null = "CSR", verbose = FALSE ) ``` As before, the weight function needs to be added for the nearest neighbour PIs. The lowest level variable defaults to "cell" in this case, so there is no need to supply it: ```{r addWfCell} engPis <- addWeightFunction(engPis) ``` We the bivariate weight function as an example: ```{r, fig.height = 5} plotWf(engPis, "nnPairCell") ``` As before, the minority gene of the pair is the main driver of variance. Next, we fit linear mixed models on the obtained PIs. The experiment was conducted twice, so we add the experiment variable as fixed effect. The image and cell variables are added automatically where appropriate, as printed in the output: ```{r FitEngModels} allModsNNcell <- fitLMMs(engPis, fixedVars = "experiment", addMoransI = TRUE) ``` For the "edge" and "centroid" options, each individual RNA-molecule is an observation in the model, and random effects are added for both cell and image. For "nnCell" and "nnPairCell", the observations are averaged per cell, and so only a random effect common to all cells of a single image remains. Now we plot the genes with the strongest remoteness from the cell edge: ```{r plotEdgeEng, fig.height = 6.5} plotTopResults(hypEng, results = allModsNNcell, pi = "edge", what = "far", numFeats = 1 ) ``` Note that the cells are selected for high expression of the gene and are not shown in their original position! Now plot the gene with strongest intracellular aggregation: ```{r plotnnCellEng, fig.height = 6.5} plotTopResults(hypEng, results = allModsNNcell, pi = "nnCell", numFeats = 1) ``` And finally the gene pair with the strongest antilocalization: ```{r a, fig.height = 6.5} plotTopResults(hypEng, results = allModsNNcell, pi = "nnPairCell", numFeats = 1, what = "anti" ) ``` We can see that the molecule distribution within the cell is all but uniform! ### Global spatial distribution of intracellular patterns A side question that may arise is whether the measures of intracellular aggregation have a global spatial patterning throughout the tissue. As a start, we can explore this graphically for the gene most significantly remote from the cell edge: ```{r exploreSpatCell, fig.height = 7} plotExplore(hypEng, piEsts = engPis, piColourCell = "edge", features = remEdge <- rownames(getResults(allModsNNcell, "edge", "Intercept"))[1] ) ``` The molecules are shown in red, while the yellow-blue scale indicates the PI estimate. It seems like in the central cells, the PI is smaller (molecules closer to the cell edge) than the outer cells. We test whether the Moran's I statistic, a measure for spatial aggregation of numeric variables, is significant over the different point patterns: ```{r testMoransI} getResults(allModsNNcell, "edge", "Intercept", moransI = TRUE)[remEdge, ] ``` Although the estimated Moran's I is positive indeed, indicating positive spatial autocorrelation, it is not significant over the different repeats. Writing the results to a spreadsheet again: ```{r writeXlsxEng, eval = FALSE} writeToXlsx(allModsNNcell, file = "mysecondfile.xlsx") ``` ## Gradients Another biologically interesting spatial pattern are gradients. Again, they can be estimated image-wide or within cells. Parametric model fitting of Poisson point processes is provided in the [spatstat.model](http://book.spatstat.org/) package by @Baddeley2015, we wrap it here for convenience. Detecting presence of gradients over different point patterns is done by fitting a model including interaction terms between x and y coordinates and image identifiers. If this interactions are significant, this implies existence of gradients in the different point patterns, albeit with different directions. The fitting process is quite computation intensive, so we limit ourselves to a subset of the data. By default both overall gradients and within-cell gradients are fitted. ```{r enggrads} engGrads <- estGradients(hypEng[seq_len(2), ], features = feat <- getFeatures(hypEng)[seq_len(3)]) pValsGrads <- getPvaluesGradient(engGrads, "cell") ``` Keep in mind though that a gradient that is significant for a computer may look very different from the human perspective; many types of spatial patterns can be captured by a gradient to some extent without providing a satisfactory fit. # Predicting tumor types with probabilistic indices as predictor Probabilistic indices are features extracted from point patterns that can serve purposes other than hypothesis testing. Here we demonstrate their use as interpretable predictors for predicting tumor type from cell type locations on a breast cancer dataset from a [publication](https://doi.org/10.1016/j.cell.2018.08.039) by @Keren2018, previously analysed by @VanderDoes2023. Make sure to install the _funkycells_ package to access the data. ```{r loadTNBC} if(reqFunky <- require(funkycells)){ data("TNBC_pheno") data("TNBC_meta") TNBC_pheno$age <- TNBC_meta$Age[match(TNBC_pheno$Person, TNBC_meta$Person)] TNBC_pheno$Class <- ifelse(TNBC_pheno$Class=="0", "mixed", "compartentalised") hypTNBC <- buildHyperFrame(TNBC_pheno, coordVars = c("cellx", "celly"), imageVars = c("Person", "Class", "age"), featureName = "Phenotype") hypTNBC <- hypTNBC[order(hypTNBC$Class),] #Order by class for plots } ``` ```{r plotTNBC, fig.cap = "Explorative plot of triple negative breast cancer dataset.\\parencite{Keren2018}", fig.height = 8} if(reqFunky){ plotExplore(hypTNBC, Cex.main = 0.7, features = c("Tumor", "Macrophage", "CD8T"), CexLegend = 0.85) } ``` Fit a penalised regression model, and estimate its predictive accuracy through cross-validation. ```{r analyseTNBC} if(reqFunky){ #Select feature through hypothesis tests pisTNBC = estPis(hypTNBC, pis = c("nn", "nnPair"), null = "background", verbose = FALSE) pisTNBC = addWeightFunction(pisTNBC) TNBClmms = fitLMMs(pisTNBC, fixedVars = c("age", "Class"), verbose = TRUE) #Extract PIs of significant features nnRes = getResults(TNBClmms, "nn", "Class") sigLevel = 0.05 nnSig = rownames(nnRes)[nnRes[, "pAdj"] Error in reducer$value.cache[[as.character(idx)]] <- values : > wrong args for environment subassignment > In addition: Warning message: > In parallel::mccollect(wait = FALSE, timeout = 1) : > 1 parallel job did not deliver a result often indicates insufficient memory. Try reducing the number of cores requested with _MultiCoreParam()_, or switch to serial processing with _register(SerialParam())_. # Session info ```{r sessionInfo} sessionInfo() ``` # Bibliography \printbibliography