This vignette demonstrates how to use spatialFDA to perform functional data analysis on spatial statistics metrics. The main aim of this package is to detect differential spatial arrangements within and between cell types given several samples/conditions. It does so by calculating spatial statistics metrics via the spatstat package and comparing differences using functional additive mixed models as implemented in the refund package (Baddeley and Turner 2005; Goldsmith et al. 2024).
The use case is a dataset from Damond et al.
(2019) which contains images from 12 human donors. The raw data
is published under a CC-BY-4.0
License on Mendeley.
This package is similar to other packages in python
and
R
. The following table shows the main differences in terms
of functionality (Ali et al. 2024; Canete et al.
2022; Wrobel et al. 2024).
Package name | Foundation | Testing procedure |
---|---|---|
spicyR | L-function | Scalar comparison |
GraphCompass |
Graph-based | Graph and scalar comparison |
mxfda | K-, G- and L- function | Function as input for survival modelling |
spatialFDA | most spatstat functions | Functional comparison over domain |
spatialFDA can be installed and loaded from Bioconductor as follows
In this vignette we will analyse a diabetes dataset acquired by imaging mass cytometry (IMC) as acquired by Damond et al. (2019). The dataset contains images from 12 human donors, 4 healthy and 8 with type 1 diabetes (T1D). With IMC, 35 markers were measured at single cell resolution (Damond et al. 2019).
The Damond et al. (2019) dataset is
easily loaded from ExperimentHub
via a small reader
function .loadExample()
. The entire dataset can be loaded
by setting full = TRUE
. For computational reasons, one can
reduce to three patients as well by setting this flag to
FALSE
. We will subset the entire dataset to two samples per
condition in order to have a multi-condition/multi-sample setting. The
package offers multiple datatypes, we will use the SpatialExperiment
(SPE) object.
We can look at the fields of view (FOVs) of the diabetes dataset. To do so we extract the spatial coordinates, store them as a dataframe and add the colData from the SPE to this. We will look only at the first four FOVs of the healthy sample. We plot both the cell categories of all cells and then the cell types of secretory cells (α, β and δ cells) and T-cells (CD8+ and CD4+ T-cells).
df <- data.frame(spatialCoords(spe), colData(spe))
dfSub <- df %>%
subset(image_name %in% c("E02", "E03", "E04", "E05"))
p <- ggplot(dfSub, aes(x = cell_x, y = cell_y, color = cell_category)) +
geom_point(size= 0.5) +
facet_wrap(~image_name) +
theme(legend.title.size = 20, legend.text.size = 20) +
xlab("x") +
ylab("y") +
labs(color = "cell category")+
coord_equal() +
theme_light()
dfSub <- dfSub %>%
subset(cell_type %in% c("alpha", "beta", "delta", "Th", "Tc"))
q <- ggplot(dfSub, aes(x = cell_x, y = cell_y, color = cell_type)) +
geom_point(size= 0.5) +
facet_wrap(~image_name) +
theme(legend.title.size = 20, legend.text.size = 20) +
xlab("x") +
ylab("y") +
labs(color = "cell type") +
coord_equal() +
theme_light()
wrap_plots(list(p,q), widths = c(1,1), heights = c(1,1), nrow = 2, ncol = 1)
In a first step, we calculate a spatial statistic curve as implemented by spatstat. One can choose from a range of metrics for discrete marks and calculate these within a mark or between two marks. Common metrics are:
Ripley’s K function and its variance stabilised form, Besag’s L
Pair correlation function g
Nearest-neighbour function G
Empty space function F
Note that all of these functions have different implementations to correct for inhomogeneity and for comparison between two marks (cross functions) (Baddeley, Rubak, and Turner 2015).
With correlation metrics, we assess the distances of all points to one another while normalising for density effects and of the window size |W|. Furthermore, spatial metrics are corrected for edge effects, due to the fact that points at the border of a FOV do not have a fully-observed neighborhood (Baddeley, Rubak, and Turner 2015, 203 ff.).
A well-known metric is Ripley’s K function or its
variance-stabilised transformation, the L function. We can calculate a
variant of the L function with
the function calcMetricPerFov
between e.g α and cytotoxic T cells. The output
is a dataframe with the following most important columns:
r
: the radius at which the spatial metric is
evaluated
theo
: the theoretical value of a homogeneous
(Poisson) realisation of a point process
iso
: an isotropic edge corrected value of the L function
metricRes <- calcMetricPerFov(spe = spe, selection = c("alpha", "Tc"),
subsetby = "image_number", fun = "Lcross",
marks = "cell_type",
rSeq = seq(0, 50, length.out = 50),
by = c("patient_stage", "patient_id",
"image_number"),
ncores = 1)
#> [1] "Calculating Lcross from alpha to Tc"
metricRes %>% head(3)
#> r theo border trans iso patient_stage patient_id image_number
#> 1 0.000000 0.000000 0 0 0 Non-diabetic 6126 138
#> 2 1.020408 1.020408 0 0 0 Non-diabetic 6126 138
#> 3 2.040816 2.040816 0 0 0 Non-diabetic 6126 138
#> npoints centroidx centroidy pplevels fun selection
#> 1 237 219.0556 322.1333 alpha to Tc Lcross alpha and Tc
#> 2 237 219.0556 322.1333 alpha to Tc Lcross alpha and Tc
#> 3 237 219.0556 322.1333 alpha to Tc Lcross alpha and Tc
We can visualise this metric with plotMetricPerFov
function. Here, we need to specify which border correction we want to
plot and what the x-axis is. Both can vary from function to
function.
# create a unique plotting ID
metricRes$ID <- paste0(
metricRes$patient_stage, "|", metricRes$patient_id
)
# change levels for plotting
metricRes$ID <- factor(metricRes$ID, levels = c("Non-diabetic|6126",
"Non-diabetic|6134",
"Onset|6228","Onset|6414",
"Long-duration|6089",
"Long-duration|6180"))
# plot metrics
plotMetricPerFov(metricRes, correction = "iso", x = "r",
imageId = "image_number", ID = "ID", ncol = 2)
By eye, we see no visible difference between the conditions in terms of correlation of α and cytotoxic T cells.
Another important aspect of spatial analysis is spacing. Here, the shortest distances or empty space to the next neighbor is calculated. This quantifies a different aspect of a point pattern than correlation or intensity of points. Two well-known functions are (Baddeley, Rubak, and Turner 2015, 255–66):
nearest-neighbor distance distribution G
empty space function F
For spacing metrics, we get different border corrections but otherwise the output stays the same:
metricRes <- calcMetricPerFov(spe = spe, selection = c("alpha", "Tc"),
subsetby = "image_number", fun = "Gcross",
marks = "cell_type",
rSeq = seq(0, 50, length.out = 50),
by = c("patient_stage", "patient_id",
"image_number"),
ncores = 1)
#> [1] "Calculating Gcross from alpha to Tc"
metricRes %>% head(3)
#> r theo han rs km hazard patient_stage patient_id image_number
#> 1 0.000000 0.0000000000 0 0 0 0 Non-diabetic 6126 138
#> 2 1.020408 0.0001499410 0 0 0 0 Non-diabetic 6126 138
#> 3 2.040816 0.0005996292 0 0 0 0 Non-diabetic 6126 138
#> npoints centroidx centroidy pplevels fun selection
#> 1 237 219.0556 322.1333 alpha to Tc Gcross alpha and Tc
#> 2 237 219.0556 322.1333 alpha to Tc Gcross alpha and Tc
#> 3 237 219.0556 322.1333 alpha to Tc Gcross alpha and Tc
# create a unique plotting ID
metricRes$ID <- paste0(
metricRes$patient_stage, "|", metricRes$patient_id
)
# change levels for plotting
metricRes$ID <- factor(metricRes$ID, levels = c("Non-diabetic|6126",
"Non-diabetic|6134",
"Onset|6228","Onset|6414",
"Long-duration|6089",
"Long-duration|6180"))
# plot metrics
plotMetricPerFov(metricRes, correction = "rs", x = "r",
imageId = "image_number", ID = "ID", ncol = 2)
In the nearest-neighbor distance function, we see a strong difference between onset T1D, long-duration T1D and non-diabetic controls in terms of spacing of α and cytotoxic T cells.
Looking at raw spatial statistics curves can be challenging. In order
to summarise this information, we can plot functional boxplots by
aggregating the curves into boxplots via a user-defined variable
aggregate_by
. We use the fbplot
function from
the fda
package (Sun and Genton 2011; James Ramsay
2024).
# create a unique ID per row in the dataframe
metricRes$ID <- paste0(
metricRes$patient_stage, "x", metricRes$patient_id,
"x", metricRes$image_number
)
#removing field of views that have as a curve only zeros - these are cases where
#there is no cells of one type
metricRes <- metricRes %>% dplyr::group_by(ID) %>% dplyr::filter(sum(rs) >= 1)
collector <- plotFbPlot(metricRes, "r", "rs", "patient_stage")
The functional boxplot shows that onset G-curves are more variable than the corresponding long-duration and non-diabetic curves. We note as well, that the variability is heteroscedastic along the domain (i.e., variance increases with radius), which is undesirable for our statistical modelling. Therefore, we can e.g. apply a variance stabilising transformation to our data or model this variance in the statistical model.
Another analysis that can be performed is functional principal componentent analysis (fPCA). This is a method to capture the main modes of variation in functional data (JO Ramsay and Silverman 2005). We use the refund implementation of fPCA.
# filter out all rows that have a constant zero part - all r<10
metricRes <- metricRes %>% filter(r > 10)
# prepare dataframe from calcMetricRes to be in the correct format for pffr
dat <- prepData(metricRes, "r", "rs")
# create meta info of the IDs
splitData <- dat$ID %>%
str_replace("-","_") %>%
str_split_fixed("x", 3) %>%
data.frame(stringsAsFactors = TRUE) %>%
setNames(c("condition", "patient_id", "imageId")) %>%
mutate(condition = relevel(condition,"Non_diabetic"))
dat <- cbind(dat, splitData)
# drop rows with NA
dat <- dat |> drop_na()
# calculate the fPCA
pca <- functionalPCA(dat = dat, r = metricRes$r |> unique(), pve = 0.995)
evalues <- pca$evalues
efunctions <- pca$efunctions
# plot the mean curve and the two first eigenfunctions
p_mu <- ggplot(data.frame(r = unique(metricRes$r), mu = pca$mu),
aes(x = r, y = mu)) +
geom_line() +
theme_light() +
xlab("r [µm]")
p_efunction1 <- ggplot(data.frame(r = unique(metricRes$r),
phi1 = pca$efunctions[,1]),
aes(x = r, y = phi1)) +
geom_line() +
theme_light() +
ylim(-0.3,0.3) +
xlab("r [µm]")
p_efunction2 <- ggplot(data.frame(r = unique(metricRes$r),
phi2 = pca$efunctions[,2]),
aes(x = r, y = phi2)) +
geom_line() +
theme_light() +
ylim(-0.3,0.3) +
xlab("r [µm]")
wrap_plots(list(p_mu, p_efunction1, p_efunction2), ncol = 3)
In the biplot above we get a very basic differentiation of the G curves. Onset T1D shows most variability along the first fPC. The second fPC describes less variation.
The L function above showed no clear difference between the three conditions whereas the G function showed a strong difference between onset T1D and the two other conditions. In order to test these differences we will use generalised functional additive mixed models. These are generalisations of standard additive mixed models to compare functions over their entire domain. The package that we use is the refund package (Scheipl, Staicu, and Greven 2015; Scheipl, Gertheiss, and Greven 2016; Goldsmith et al. 2024).
The model implemented here is of the form:
$$ \mathbb{E}[y_i(r)] = g(\alpha(r) + \beta_{0,g(i)}(r) + \sum_{j=1}^J f_j(X_{ji},r) + \epsilon_i(r)) $$
With the following terms:
yi(r) the functional response, here the spatstat curves
g optional link function
α(r) a global functional intercept varying over the domain r
β0, g(i)(r) a random functional intercept varying over the domain r per grouping variable g(i).
fj(Xji, r) the additive predictors
ϵi(r) residual zero-mean Gaussian errors
For the family we will use a quasi-likelihood distribution where the variance is modeled as quadratic to the mean. We do this to account for the heteroscedastic variance along the domain.
In this context we need to specify a design matrix and contrasts. For
the functional random intercepts we use a principal componend-based
estimation (pve = 0.995) from refund
(Goldsmith et al. 2024).
library('refund')
# create a design matrix
mm <- model.matrix(~condition, data = dat)
colnames(mm)[1] <- "Intercept"
mm %>% head()
#> Intercept conditionLong_duration conditionOnset
#> 1 1 1 0
#> 2 1 1 0
#> 3 1 1 0
#> 4 1 1 0
#> 5 1 1 0
#> 6 1 1 0
r <- metricRes$r |> unique()
# fit the model
mdl <- functionalGam(
data = dat, x = r,
designmat = mm, weights = dat$npoints,
formula = formula(Y ~ 1 + conditionLong_duration +
conditionOnset + pcre(id=patient_id,
efunctions=efunctions,
evalues=evalues, yind=r)),
family = quasi(link = "identity", variance = "mu^2"),
algorithm = "gam"
)
summary(mdl)
#>
#> Family: quasi
#> Link function: identity
#>
#> Formula:
#> Y ~ 1 + conditionLong_duration + conditionOnset + pcre(id = patient_id,
#> efunctions = efunctions, evalues = evalues, yind = r)
#>
#> Constant coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.09648 0.01127 8.557 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Smooth terms & functional coefficients:
#> edf Ref.df F p-value
#> Intercept(x) 13.682 19.000 10.778 <2e-16 ***
#> conditionLong_duration(x) 1.000 1.000 1.068 0.302
#> conditionOnset(x) 4.074 4.386 13.875 <2e-16 ***
#> pcre(patient_id,efunctions,evalues,r) 11.107 15.000 72.916 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.633 Deviance explained = 64.3%
#> -REML score = 2104.3 Scale est. = 0.50877 n = 12720 (318 x 40)
plotLs <- lapply(colnames(mm), plotMdl, mdl = mdl,
shift = mdl$coefficients[["(Intercept)"]])
#> using seWithMean for s(x.vec) .
#> using seWithMean for s(x.vec) .
#> using seWithMean for s(x.vec) .
wrap_plots(plotLs, nrow = 3, axes = 'collect')
We note that there is a small difference in the G function between non-diabetic and long-duration T1D samples, but a strong difference between non-diabetic and onset T1D according to the model summary. The point wise confidence bands are a limitation of this method and could be improved with either bootstrapping or continuous confidence bands (Liebl and Reimherr 2023). Thus, we see not only that a spatial difference in co-localisation of α and cytotoxic T cells is statistically significant but also at which spatial scale this difference occurs.
One open problem is the implementation of confidence bands that reflect the non-independently and non-identically distributed residuals. To visualise how much of a problem this is, we can plot the contours of the correlation/covariance and look at some model diagnostics.
#>
#> Method: REML Optimizer: outer newton
#> full convergence after 14 iterations.
#> Gradient range [-7.393688e-05,7.229055e-05]
#> (score 2104.331 & scale 0.5087694).
#> eigenvalue range [-4.141788e-05,6358.512].
#> Model rank = 45 / 45
#>
#> Basis dimension (k) checking results. Low p-value (k-index<1) may
#> indicate that k is too low, especially if edf is close to k'.
#>
#> k' edf
#> s(x.vec) 19.00 13.68
#> s(x.vec):conditionLong_duration 5.00 1.00
#> s(x.vec):conditionOnset 5.00 4.07
#> s(patient_id.vec,efunctions.PC1,efunctions.PC2,efunctions.PC3) 15.00 11.11
#> k-index p-value
#> s(x.vec) 0.98 0.11
#> s(x.vec):conditionLong_duration 0.98 0.14
#> s(x.vec):conditionOnset 0.98 0.15
#> s(patient_id.vec,efunctions.PC1,efunctions.PC2,efunctions.PC3) NA NA
In these model diagnostics, we note that there is still some variability in the residuals that is not considered by the model. The QQ plot indicates a good model fit. The residuals show a considerable structure that is in line with the structure in the auto-covariance / correlation plots.
In the functional additive mixed model, we have a specified global
intercept varying over the domain r as well as functional random
intercepts varying over the domain r per grouping variable
patient_id
. We can plot these smooth estimates of the
random intercepts.
# look at the smooth random intercepts per patient
#data <- coef(mdl)$smterms$`s(patient_id)`$coef
data <- coef(mdl)$smterms$`pcre(patient_id,efunctions,evalues,r)`$coef %>%
dplyr::rename(patient_id = "patient_id.vec")
#> using seWithMean for s(x.vec) .
data <- data %>% left_join(splitData %>%
select(patient_id, condition) %>% unique)
#> Joining with `by = join_by(patient_id)`
p <- ggplot(data, aes(x.vec, value, colour = patient_id)) +
geom_point(aes(shape=factor(condition))) +
theme_light() +
geom_smooth(aes(group = 1), col = 'black') +
xlab("r [µm]")
p
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In this implementation, we have used principal component based random
errors as implemented in pcre
from refund
.
These random errors are constrained to zero over t.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] refund_0.1-37 SpatialExperiment_1.17.0
#> [3] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#> [5] Biobase_2.67.0 GenomicRanges_1.59.1
#> [7] GenomeInfoDb_1.43.4 IRanges_2.41.3
#> [9] S4Vectors_0.45.4 BiocGenerics_0.53.6
#> [11] generics_0.1.3 MatrixGenerics_1.19.1
#> [13] matrixStats_1.5.0 patchwork_1.3.0
#> [15] stringr_1.5.1 tidyr_1.3.1
#> [17] ggplot2_3.5.1 dplyr_1.1.4
#> [19] spatialFDA_0.99.12 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] sys_3.4.3 jsonlite_1.9.0 magrittr_2.0.3
#> [4] rainbow_3.8 spatstat.utils_3.1-2 magick_2.8.5
#> [7] farver_2.1.2 nloptr_2.1.1 rmarkdown_2.29
#> [10] vctrs_0.6.5 memoise_2.0.1 minqa_1.2.8
#> [13] spatstat.explore_3.3-4 RCurl_1.98-1.16 htmltools_0.5.8.1
#> [16] S4Arrays_1.7.3 AnnotationHub_3.15.0 curl_6.2.1
#> [19] deSolve_1.40 SparseArray_1.7.6 sass_0.4.9
#> [22] hdrcde_3.4 pracma_2.4.4 KernSmooth_2.23-26
#> [25] bslib_0.9.0 RLRsim_3.1-8 cachem_1.1.0
#> [28] buildtools_1.0.0 mime_0.12 lifecycle_1.0.4
#> [31] pkgconfig_2.0.3 Matrix_1.7-2 R6_2.6.1
#> [34] fastmap_1.2.0 magic_1.6-1 GenomeInfoDbData_1.2.13
#> [37] rbibutils_2.3 digest_0.6.37 colorspace_2.1-1
#> [40] AnnotationDbi_1.69.0 tensor_1.5 ExperimentHub_2.15.0
#> [43] RSQLite_2.3.9 filelock_1.0.3 labeling_0.4.3
#> [46] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7
#> [49] abind_1.4-8 mgcv_1.9-1 compiler_4.4.2
#> [52] bit64_4.6.0-1 withr_3.0.2 DBI_1.2.3
#> [55] gamm4_0.2-6 MASS_7.3-64 rappdirs_0.3.3
#> [58] DelayedArray_0.33.6 rjson_0.2.23 tools_4.4.2
#> [61] goftest_1.2-3 glue_1.8.0 nlme_3.1-167
#> [64] grid_4.4.2 cluster_2.1.8 pbs_1.1
#> [67] gtable_0.3.6 fda_6.2.0 spatstat.data_3.1-4
#> [70] XVector_0.47.2 spatstat.geom_3.3-5 BiocVersion_3.21.1
#> [73] pillar_1.10.1 splines_4.4.2 BiocFileCache_2.15.1
#> [76] lattice_0.22-6 bit_4.5.0.1 deldir_2.0-4
#> [79] grpreg_3.5.0 ks_1.14.3 tidyselect_1.2.1
#> [82] fds_1.8 maketools_1.3.2 Biostrings_2.75.3
#> [85] knitr_1.49 reformulas_0.4.0 xfun_0.51
#> [88] stringi_1.8.4 UCSC.utils_1.3.1 yaml_2.3.10
#> [91] boot_1.3-31 evaluate_1.0.3 tibble_3.2.1
#> [94] BiocManager_1.30.25 cli_3.6.4 Rdpack_2.6.2
#> [97] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.14
#> [100] spatstat.random_3.3-2 dbplyr_2.5.0 png_0.1-8
#> [103] spatstat.univar_3.1-1 parallel_4.4.2 blob_1.2.4
#> [106] mclust_6.1.1 bitops_1.0-9 lme4_1.1-36
#> [109] mvtnorm_1.3-3 scales_1.3.0 pcaPP_2.0-5
#> [112] purrr_1.0.4 crayon_1.5.3 rlang_1.1.5
#> [115] KEGGREST_1.47.0