A framework to discover Spatially Variable genes via spatial clusters


Introduction

DESpace is an intuitive framework for identifying spatially variable (SV) genes (SVGs) via edgeR (Robinson, McCarthy, and Smyth 2009), one of the most common methods for performing differential expression analyses.

Based on pre-annotated spatial clusters as summarized spatial information, DESpace models gene expression using a negative binomial (NB), via edgeR (Robinson, McCarthy, and Smyth 2009), with spatial clusters as covariates. SV genes (SVGs) are then identified by testing the significance of spatial clusters.

Our approach assumes that the spatial structure can be summarized by spatial clusters, which should reproduce the key features of the tissue (e.g., white matter and layers in brain cortex). A significant test of these covariates indicates that space influences gene expression, hence identifying spatially variable genes.

Our model is flexible and robust, and is significantly faster than the most SV methods. Furthermore, to the best of our knowledge, it is the only SV approach that allows: - performing a SV test on each individual spatial cluster, hence identifying the key regions affected by spatial variability; - jointly fitting multiple samples, targeting genes with consistent spatial patterns across biological replicates.

Below, we illustrate en example usage of the package.

Basics

DESpace is implemented as a R package within Bioconductor, which is the main venue for omics analyses, and we use various other Bioconductor packages (e.g., SpatialLIBD, and edgeR).

DESpace package is available on Bioconductor and can be installed with the following command:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("DESpace")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

The development version of DESpacecan also be installed from the Bioconductor-devel branch or from GitHub.

To access the R code used in the vignettes, type:

browseVignettes("DESpace")

Questions relative to DESpace should be reported as a new issue at BugReports.

To cite DESpace, type:

citation("DESpace")

Load R packages:

suppressMessages({
library(DESpace)
library(ggplot2)
library(SpatialExperiment)
})

Data

As an example dataset, we consider a human dorsolateral pre-frontal cortex (DLPFC) spatial transcriptomics dataset from the 10x Genomics Visium platform, including three neurotypical adult donors (i.e., biological replicates), with four images per subject (Maynard 2021). The full dataset consists of 12 samples, which can be accessed via spatialLIBD Bioconductor package.

Input data

Here, we consider a subset of the original data, consisting of three biological replicates: 1 image for each of the three brain subjects. Initially, in Section 3 individual sample , we fit our approach on a single sample, whose data is stored in spe3 whereas all 3 samples will later be jointly used in Section 4 Multiple samples.

# Connect to ExperimentHub
ehub <- ExperimentHub::ExperimentHub()
# Download the full real data (about 2.1 GB in RAM) use:
spe_all <- spatialLIBD::fetch_data(type = "spe", eh = ehub)

# Create three spe objects, one per sample:
spe1 <- spe_all[, colData(spe_all)$sample_id == '151507']
spe2 <- spe_all[, colData(spe_all)$sample_id == '151669']
spe3 <- spe_all[, colData(spe_all)$sample_id == '151673']
rm(spe_all)
# Select small set of random genes for faster runtime in this example
set.seed(123)
sel_genes <- sample(dim(spe1)[1],2000)
spe1 <- spe1[sel_genes,]
spe2 <- spe2[sel_genes,]
spe3 <- spe3[sel_genes,]
# For covenience, we use “gene names” instead of “gene ids”:
rownames(spe1) <- rowData(spe1)$gene_name
rownames(spe2) <- rowData(spe2)$gene_name
rownames(spe3) <- rowData(spe3)$gene_name
# Specify column names of spatial coordinates in colData(spe) 
coordinates <- c("array_row", "array_col")
# Specify column names of spatial clusters in colData(spe) 
spatial_cluster <- 'layer_guess_reordered'

The spatial tissues of each sample were manually annotated in the original manuscript (Maynard 2021), and spots were labeled into one of the following categories: white matter (WM) and layers 1 to 6. The manual annotations are stored in column layer_guess_reordered of the colData, while columns array_col and array_row provide the spatial coordinates of spots.

# We select a subset of columns
keep_col <- c(coordinates,spatial_cluster,"expr_chrM_ratio","cell_count")
head(colData(spe3)[keep_col])
## DataFrame with 6 rows and 5 columns
##                    array_row array_col layer_guess_reordered expr_chrM_ratio
##                    <integer> <integer>              <factor>       <numeric>
## AAACAAGTATCTCCCA-1        50       102                Layer3        0.166351
## AAACAATCTACTAGCA-1         3        43                Layer1        0.122376
## AAACACCAATAACTGC-1        59        19                WM            0.114089
## AAACAGAGCGACTCCT-1        14        94                Layer3        0.242223
## AAACAGCTTTCAGAAG-1        43         9                Layer5        0.152174
## AAACAGGGTCTATATT-1        47        13                Layer6        0.155095
##                    cell_count
##                     <integer>
## AAACAAGTATCTCCCA-1          6
## AAACAATCTACTAGCA-1         16
## AAACACCAATAACTGC-1          5
## AAACAGAGCGACTCCT-1          2
## AAACAGCTTTCAGAAG-1          4
## AAACAGGGTCTATATT-1          6

Quality control/filtering

Quality control (QC) procedures at the spot and gene level aim to remove both low-quality spots, and lowly abundant genes. For QC, we adhere to the instructions from “Orchestrating Spatially Resolved Transcriptomics Analysis with Bioconductor” (OSTA). The library size, UMI counts, ratio of mitochondrial chromosome (chM) expression, and number of cells per spot are used to identify low-quality spots.

# Sample 1:
# Calculate per-spot QC metrics and store in colData
spe1 <- scuttle::addPerCellQC(spe1,)
# Remove combined set of low-quality spots
spe1 <- spe1[, !(colData(spe1)$sum < 10 |             # library size
                colData(spe1)$detected < 10 |         # number of expressed genes
                colData(spe1)$expr_chrM_ratio > 0.30| # mitochondrial expression ratio
                colData(spe1)$cell_count > 10)]       # number of cells per spot
# Sample 2:
# Calculate per-spot QC metrics and store in colData
spe2 <- scuttle::addPerCellQC(spe2,)
# Remove combined set of low-quality spots
spe2 <- spe2[, !(colData(spe2)$sum < 20 |
                colData(spe2)$detected < 15 |
                colData(spe2)$expr_chrM_ratio > 0.35|
                colData(spe2)$cell_count > 8)]
# Sample 3:
spe3 <- scuttle::addPerCellQC(spe3,)
# Remove combined set of low-quality spots
spe3 <- spe3[, !(colData(spe3)$sum < 25 |
                colData(spe3)$detected < 25 |
                colData(spe3)$expr_chrM_ratio > 0.3|
                colData(spe3)$cell_count > 15)]

Then, we discard lowly abundant genes, which were detected in less than 20 spots.

# For each sample i:
for(i in seq_len(3)){
    spe_i <- eval(parse(text = paste0("spe", i)))
    # Select QC threshold for lowly expressed genes: at least 20 non-zero spots:
    qc_low_gene <- rowSums(assays(spe_i)$counts > 0) >= 20
    # Remove lowly abundant genes
    spe_i <- spe_i[qc_low_gene,]
    assign(paste0("spe", i), spe_i)
    message("Dimension of spe", i, ": ", dim(spe_i)[1], ", ", dim(spe_i)[2])
}
## Dimension of spe1: 847, 4174
## Dimension of spe2: 868, 3635
## Dimension of spe3: 908, 3601

Individual sample

We fit our approach to discover SVGs in an individual sample. In Section 4 Multiple samples, we will show how to jointly embed multiple replicates.

Clustering

This framework relies on spatial clusters being accessible and successfully summarizing the primary spatial characteristics of the data. In most datasets, these spatial features are either accessible or can be easily generated with spatial clustering algorithms.

Manual annotation

If manual annotations are provided (e.g., annotated by a pathologist), we can directly use those. With the spe or spe object that contains coordinates of the spot-level data, we can visualize spatial clusters.

# View LIBD layers for one sample
CD <- as.data.frame(colData(spe3))
ggplot(CD, 
    aes(x=array_col,y=array_row, 
    color=factor(layer_guess_reordered))) +
    geom_point() + 
    theme_void() + scale_y_reverse() + 
    theme(legend.position="bottom") + 
    labs(color = "", title = paste0("Manually annotated spatial clusters"))

Spatially resolved clustering

If manual annotations are not available, we can use spatially resolved clustering tools. These methods, by jointly employing spatial coordinates and gene expression data, enable obtaining spatial clusters. Although, in this vignette we use pre-computed manually annotated clusters, below we provide links to two popular spatially resolved clustering tools: BayesSpace (Zhao et al. 2021) and StLearn (Pham et al. 2020).

BayesSpace

BayesSpace is a Bioconductor package that provides a Bayesian statistical approach for spatial transcriptomics data clustering (BayesSpace). There is a specific vignette for using BayesSpace on this dataset (human DLPFC) here.

After using BayesSpace, the spatial cluster assignments (spatial.cluster) are available in the colData(spe).

StLearn

StLearn, a python-based package, is designed for spatial transciptomics data. It allows spatially-resolved clustering based on Louvain or k-means (stLearn). There is a tutorial for using StLearn on this dataset (human DLPFC) here.

After running stLearn, we can store results as a csv file.

{python.reticulate = FALSE stLearn[py multi-sample], eval = FALSE} # Save spatial results obsm.to_csv("stLearn_clusters.csv")

Then, we can load these results in R and store spatial clusters in the spe object.

stLearn_results <- read.csv("stLearn_clusters.csv", sep = ',', 
                            header = TRUE)
# Match colData(spe) and stLearn results
stLearn_results <- stLearn_results[match(rownames(colData(spe3)), 
                                    rownames(stLearn_results)), ]
colData(spe3)$stLearn_clusters <- stLearn_results$stLearn_pca_kmeans

SV testing

Once we have spatial clusters, we can search for SVGs.

Gene-level test

Fit the model via DESpace_test function. Parameter spe specifies the input SpatialExperiment or SingleCellExperiment object, while spatial_cluster defines the column names of colData(spe) containing spatial clusters. To obtain all statistics, set verbose to TRUE (default value).

set.seed(123)
results <- DESpace_test(spe = spe3,
                        spatial_cluster = spatial_cluster, 
                        verbose = TRUE)
## using 'DESpace_test' for spatial gene/pattern detection.
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is 908.
## single sample test

A list of results is returned. The main results of interest are stored in the gene_results: a data.fame, where columns contain gene names (gene_id), likelihood ratio test statistics (LR), average (across spots) log-2 counts per million (logCPM), raw p-values (PValue) and Benjamini-Hochberg adjusted p-values (FDR).

head(results$gene_results, 3)
##         gene_id       LR   logCPM        PValue           FDR
## SNCG       SNCG 1302.458 14.36222 3.181142e-278 2.888477e-275
## ATP1A3   ATP1A3 1235.646 14.68926 9.224867e-264 4.188090e-261
## PLEKHH1 PLEKHH1 1045.607 13.69914 1.220571e-222 3.694262e-220

The second element of the results (a DGEList object estimated_y) contains the estimated common dispersion, which can later be used to speed-up calculation when testing individual clusters.

The third and forth element of the results (DGEGLM and DGELRT objects) contain full statistics from edgeR::glmFit and edgeR::glmLRT.

class(results$estimated_y); class(results$glmLrt); class(results$glmFit)
## [1] "DGEList"
## attr(,"package")
## [1] "edgeR"
## [1] "DGELRT"
## attr(,"package")
## [1] "edgeR"
## [1] "DGEGLM"
## attr(,"package")
## [1] "edgeR"

Visualize the gene expression of the three most significant genes in FeaturePlot(). Note that the gene names in vector feature, should also appear in the count matrix’s row names. Specifying the column names of spatial coordinates of spots is only necessary when they are not named row and col.

(feature <- results$gene_results$gene_id[seq_len(3)])
## [1] "SNCG"    "ATP1A3"  "PLEKHH1"
FeaturePlot(spe3, feature, 
            coordinates = coordinates, 
            ncol = 3, title = TRUE)

Additionally, function FeaturePlot() can draw an outline around each cluster.

FeaturePlot(spe3, feature, 
            coordinates = coordinates, 
            Annotated_cluster = TRUE,
            spatial_cluster = spatial_cluster, 
            cluster = 'all', 
            legend_cluster = TRUE, title = TRUE)
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).

Individual cluster test

DESpace can also be used to reveal the specific areas of the tissue affected by spatial variability; i.e., spatial clusters that are particularly over/under abundant compared to the average. Function individual_test() can be used to identify SVGs for each individual cluster. Parameter spatial_cluster indicates the column names of colData(spe) containing spatial clusters.

For every spatial cluster we test, edgeR would normally re-compute the dispersion estimates based on the specific design of the test. However, this calculation represents the majority of the overall computing time. Therefore, to speed-up calculations, we propose to use the dispersion estimates which were previously computed for the gene-level tests. Albeit this is an approximation, in our benchmarks, it leads to comparable performance in terms of sensitivity and specificity. If you want to use pre-computed gene-level dispersion estimates, set edgeR_y to estimated_y. Alternatively, if you want to re-compute dispersion estimates (significantly slower, but marginally more accurate option), leave edgeR_y empty.

set.seed(123)
cluster_results <- individual_test(spe3, 
                                    edgeR_y = results$estimated_y,
                                    spatial_cluster = spatial_cluster)
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is908.
## Pre-processing
## Start modeling
## Returning results

individual_test() returns a list containing the results of the individual cluster tests. Similarly to above, for each cluster, results are reported as a data.fame, where columns contain gene names (gene_id), likelihood ratio test statistics (LR), log2-fold changes (logFC), raw p-values (PValue) and Benjamini-Hochberg adjusted p-values (FDR). NB: that the logFC compares each cluster to the rest of the tissue; e.g., a logFC of 2 for WM test indicates that the average gene expression in WM is (4 times) higher than the average gene expression in non-WM tissue.

Visualize results for WM.

class(cluster_results)
## [1] "list"
names(cluster_results)
## [1] "Layer1" "Layer2" "Layer3" "Layer4" "Layer5" "Layer6" "WM"

top_results function can be used to combine gene-and cluster-level results. By default, results from top_results() report both adjusted p-values and log2-FC; however, users can also choose to only report either, by specifying select = "FDR" or select = "logFC". Below, gene_PValue and gene_FDR columns refer to the gene-level testing, while the subsequent columns indicate the cluster-specific results.

merge_res <- top_results(results$gene_results, cluster_results)
head(merge_res,3)
##   gene_id  gene_LR gene_logCPM   gene_Pvalue      gene_FDR   Layer1_FDR
## 1    SNCG 1302.458    14.36222 3.181142e-278 2.888477e-275 4.647725e-13
## 2  ATP1A3 1235.646    14.68926 9.224867e-264 4.188090e-261 3.938179e-15
## 3 PLEKHH1 1045.607    13.69914 1.220571e-222 3.694262e-220 6.927321e-15
##     Layer2_FDR   Layer3_FDR   Layer4_FDR   Layer5_FDR   Layer6_FDR
## 1 2.811936e-06 8.360504e-80 7.834189e-13 2.906336e-26 1.927089e-65
## 2 2.025663e-03 5.124609e-60 7.015626e-12 1.445723e-15 8.018028e-25
## 3 1.211797e-06 6.432818e-44 1.052098e-06 4.485554e-17 3.194246e-01
##          WM_FDR Layer1_logFC Layer2_logFC Layer3_logFC Layer4_logFC
## 1 3.682232e-133    0.7714133    0.4604152   -0.8320186   -0.6132229
## 2 2.231319e-182    0.6342124   -0.2225299   -0.5509693   -0.4553823
## 3 7.557305e-209    1.4032918    0.7711747    1.1047706    0.8704022
##   Layer5_logFC Layer6_logFC  WM_logFC
## 1   -0.5492418    1.0925655 -1.966807
## 2   -0.3252814    0.4761009 -1.728821
## 3    0.8053901    0.1317995  2.320427
merge_res <- top_results(results$gene_results, cluster_results, 
                        select = "FDR")
head(merge_res,3)
##   gene_id  gene_LR gene_logCPM   gene_Pvalue      gene_FDR   Layer1_FDR
## 1    SNCG 1302.458    14.36222 3.181142e-278 2.888477e-275 4.647725e-13
## 2  ATP1A3 1235.646    14.68926 9.224867e-264 4.188090e-261 3.938179e-15
## 3 PLEKHH1 1045.607    13.69914 1.220571e-222 3.694262e-220 6.927321e-15
##     Layer2_FDR   Layer3_FDR   Layer4_FDR   Layer5_FDR   Layer6_FDR
## 1 2.811936e-06 8.360504e-80 7.834189e-13 2.906336e-26 1.927089e-65
## 2 2.025663e-03 5.124609e-60 7.015626e-12 1.445723e-15 8.018028e-25
## 3 1.211797e-06 6.432818e-44 1.052098e-06 4.485554e-17 3.194246e-01
##          WM_FDR
## 1 3.682232e-133
## 2 2.231319e-182
## 3 7.557305e-209

We can further specify a cluster and check top genes detected by DESpace.

# Check top genes for WM
results_WM <- top_results(cluster_results = cluster_results, 
                        cluster = "WM")
head(results_WM, 3)
##         Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1      WM PLEKHH1   964.7325       13.69914      2.320427  8.323023e-212
## ATP1A3       WM  ATP1A3   841.5827       14.68926     -1.728821  4.914799e-185
## SLAIN1       WM  SLAIN1   723.0734       13.78573      1.870821  2.873138e-159
##           Cluster_FDR
## PLEKHH1 7.557305e-209
## ATP1A3  2.231319e-182
## SLAIN1  8.696031e-157

With high_low parameter, we can further filter genes to visualize those with higher (high_low = "high") or lower (high_low = "low") average abundance in the specified cluster, compared to the average abundance in the rest of the tissue. By default, high_low = “both” and all results are provided.

results_WM_both <- top_results(cluster_results = cluster_results, 
                                cluster = "WM", 
                                high_low = "both")

Here we present the highly abundant cluster SVGs; i.e., SVGs with higher expression in WM compared to the rest of the area.

head(results_WM_both$high_genes, 3)
##         Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1      WM PLEKHH1   964.7325       13.69914      2.320427  8.323023e-212
## SLAIN1       WM  SLAIN1   723.0734       13.78573      1.870821  2.873138e-159
## CMTM5        WM   CMTM5   659.1198       13.65298      2.098647  2.321441e-145
##           Cluster_FDR
## PLEKHH1 7.557305e-209
## SLAIN1  8.696031e-157
## CMTM5   4.215737e-143

We visualize the lowly abundant cluster SVGs; i.e., SVGs with lower expression in WM compared to the rest of the area.

head(results_WM_both$low_genes, 3)
##         Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## ATP1A3       WM  ATP1A3   841.5827       14.68926     -1.728821  4.914799e-185
## PRKAR1B      WM PRKAR1B   669.1243       14.72792     -1.505871  1.548960e-147
## NSF          WM     NSF   628.8597       14.50624     -1.744898  8.847779e-139
##           Cluster_FDR
## ATP1A3  2.231319e-182
## PRKAR1B 3.516140e-145
## NSF     1.338964e-136

Visualize the gene expression of the top genes for layer WM. A cluster outline can be drawn by specifying the column names of clusters stored in colData(spe) and the vector of cluster names via spatial_cluster and cluster.

# SVGs with higher than average abundance in WM
feature <- rownames(results_WM_both$high_genes)[seq_len(3)]
FeaturePlot(spe3, feature, spatial_cluster = spatial_cluster, 
            coordinates = coordinates, cluster = 'WM', 
            legend_cluster = TRUE, Annotated_cluster = TRUE, 
            linewidth = 0.6, title = TRUE)
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.

# SVGs with lower than average abundance in WM
feature <- rownames(results_WM_both$low_genes)[seq_len(3)]
FeaturePlot(spe3, feature, spatial_cluster = spatial_cluster, 
            coordinates = coordinates, cluster = 'WM', 
            legend_cluster = TRUE, Annotated_cluster = TRUE, 
            linewidth = 0.6,title = TRUE)
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.

Multiple samples

If biological replicates are available, our framework allows jointly modeling them to target SVGs with coherent spatial patterns across samples. This approach may be particularly beneficial when data are characterized by a large degree of (biological and technical) variability, such as in cancer. Importantly, only genes detected (above filtering thresholds) in all samples will be analyzed.

Clustering

Similar to gene-level testing, the multi-sample extension requires pre-annotated spatial clusters.

Manual annotation

If the manual annotation for each sample is available, we can combine all samples and use manual annotations directly. Note that cluster labels must be consistent across samples (i.e., WM in sample 1 should represent the same tissue as WM in sample 2). With the spe.combined object that contains coordinates of the spot-level data, we can visualize spatial clusters.

set.seed(123)
# Use common genes
a <- rownames(counts(spe1)); 
b <- rownames(counts(spe2)); 
c <- rownames(counts(spe3))
# find vector of common genes across all samples:
CommonGene <- Reduce(intersect, list(a,b,c))
spe1 <- spe1[CommonGene,]
spe2 <- spe2[CommonGene,]
spe3 <- spe3[CommonGene,]

# Combine three samples
spe.combined <- cbind(spe1, spe2, spe3, deparse.level = 1)
ggplot(as.data.frame(colData(spe.combined)), 
    aes(x=array_col, y=array_row,
    color=factor(layer_guess_reordered))) +
    geom_point() + 
    facet_wrap(~sample_id) +
    theme_void() + scale_y_reverse() +
    theme(legend.position="bottom") + 
    labs(color = "", title = "Manually annotated spatial clusters")

Spatially resolved (multi-sample) clustering

Similarly to above, if manual annotations are not available, we can use spatially resolved clustering tools. Both BayesSpace (Zhao et al. 2021) and StLearn (Pham et al. 2020) allow jointly clustering multiple samples. In particular each tool has a specific vignettes for multi-testing clustering: BayesSpace vignettes, and stLearn vignettes.

Single sample clustering

In our benchmarks, we have noticed that, with both BayesSpace and StLearn, joint spatial clustering of multiple samples is more prone to failure and inaccurate results than spatial clustering of individual samples. Therefore, if multi-sample clustering fails, we suggest trying to cluster individual samples (as in Section 3 Individual sample) and manually match cluster ids across samples, to ensure that “cluster 1” always refers to the same spatial region in all samples.

SV testing

Once we have spatial clusters for multiple samples, we add them to colData(spe.combined) as the column layer_guess_reordered and fit the model with spatial clusters as covariates.

Gene-level test

Fit the model via DESpace_test(). Parameter spe specifies the input SpatialExperiment or SingleCellExperiment object, while spatial_cluster and sample_col define the column names of colData(spe) containing spatial clusters and sample ids. With replicates = TRUE, we fit the multi-sample model.

The second element of the result (a DGEList object estimated_y_multi) contains the estimated common dispersion for the multi-sample case.

set.seed(123)
multi_results <- DESpace_test(spe = spe.combined,
                                spatial_cluster = spatial_cluster,
                                sample_col = 'sample_id',
                                replicates = TRUE)
## using 'DESpace_test' for spatial gene/pattern detection.
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is 828.
## multi-sample test
## Repeated column names found in count matrix

A list of results are returned. The main results of interest are stored in the gene_results.

head(multi_results$gene_results,3)
##        gene_id       LR   logCPM PValue FDR
## HPCAL1  HPCAL1 2063.906 14.41880      0   0
## NSF        NSF 1524.984 14.70442      0   0
## ATP1A3  ATP1A3 1578.194 14.88552      0   0

The second element of the results (a DGEList object estimated_y) contains the estimated common dispersion, which can later be used to speed-up calculation when testing individual clusters.

class(multi_results$estimated_y)
## [1] "DGEList"
## attr(,"package")
## [1] "edgeR"

For each sample, we can visualize the gene expression of the most significant SVGs. Note that column names of spatial coordinates of spots should be row and col.

## Top three spatially variable genes
feature <- multi_results$gene_results$gene_id[seq_len(3)]; feature
## [1] "HPCAL1" "NSF"    "ATP1A3"
## Sample names
samples <- unique(colData(spe.combined)$sample_id); samples
## [1] "151507" "151669" "151673"
## Use purrr::map to combine multiple figures
spot_plots <- purrr::map(seq_along(samples), function(j) {
    ## Subset spe for each sample j
    spe_j <- spe.combined[, colData(spe.combined)$sample_id == samples[j] ]
    ## Store three gene expression plots with gene names in `feature` for spe_j
    spot_plots <- FeaturePlot(spe_j, feature, 
                            coordinates = coordinates,
                            spatial_cluster = spatial_cluster, title = TRUE,
                            Annotated_cluster = TRUE, legend_cluster = TRUE)
    return(spot_plots)
})
patchwork::wrap_plots(spot_plots, ncol=1)

Individual cluster test

Similarly to what shown in Section 3 Individual sample, our framework can discover the key SV spatial clusters also when jointly fitting multiple samples. For a multi-sample testing, set replicates = TRUE in individual_test().

set.seed(123)
cluster_results <- individual_test(spe.combined, 
                                edgeR_y = multi_results$estimated_y,
                                replicates = TRUE, 
                                spatial_cluster = spatial_cluster)
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is828.
## Pre-processing
## Start modeling
## Returning results

individual_test() returns a list containing the results of individual clusters, specified in cluster parameter. In this case, logFC refers to the log2-FC between the average abundance, across all samples, of a spatial cluster and the average abundance of all remaining clusters (e.g., WM vs. non-WM tissue).

Visualize results for WM.

class(cluster_results)
## [1] "list"
names(cluster_results)
## [1] "Layer1" "Layer2" "Layer3" "Layer4" "Layer5" "Layer6" "WM"

As above, top_results function can be used to combine gene-level and cluster-level results.

merge_res <- top_results(multi_results$gene_results, cluster_results, 
                        select = "FDR")
head(merge_res,3)
##   gene_id  gene_LR gene_logCPM gene_Pvalue gene_FDR    Layer1_FDR    Layer2_FDR
## 1  ATP1A3 1578.194    14.88552           0        0  1.970166e-93  2.067235e-02
## 2  HPCAL1 2063.906    14.41880           0        0  4.522810e-15 1.890276e-191
## 3     NSF 1524.984    14.70442           0        0 5.165955e-127  4.084828e-01
##      Layer3_FDR   Layer4_FDR    Layer5_FDR   Layer6_FDR        WM_FDR
## 1  3.715619e-16 1.925546e-23  1.694696e-46 4.982746e-03 8.899716e-194
## 2 4.927089e-126 7.522529e-51 1.038996e-117 3.528734e-15  8.907694e-39
## 3  4.331610e-01 1.906776e-24  3.585842e-60 9.999424e-02 3.143784e-147

We can further select a cluster of interest, and check the top genes detected in that cluster.

# Check top genes for WM
results_WM <- top_results(cluster_results = cluster_results, 
                        cluster = "WM")
# For each gene, adjusted p-values for each cluster
head(results_WM,3)
##         Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1      WM PLEKHH1   933.0597       14.00538      1.725143  6.385387e-205
## ATP1A3       WM  ATP1A3   893.8334       14.88552     -1.285877  2.149690e-196
## SNCG         WM    SNCG   770.7204       14.60578     -1.597352  1.253467e-169
##           Cluster_FDR
## PLEKHH1 5.287100e-202
## ATP1A3  8.899716e-194
## SNCG    3.459569e-167

With high_low = "both", we can further filter genes to visualize highly and lowly abundant SVGs.

results_WM_both <- top_results(cluster_results = cluster_results, 
                            cluster = "WM", high_low = "both")

Here we present the highly abundant cluster SVGs; i.e., SVGs with higher expression in WM compared to the rest of the tissue.

head(results_WM_both$high_genes,3)
##         Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1      WM PLEKHH1   933.0597       14.00538      1.725143  6.385387e-205
## SLAIN1       WM  SLAIN1   660.5151       14.07035      1.407386  1.154292e-145
## CMTM5        WM   CMTM5   617.2949       13.99113      1.479957  2.898118e-136
##           Cluster_FDR
## PLEKHH1 5.287100e-202
## SLAIN1  1.592923e-143
## CMTM5   3.428059e-134

We visualize the lowly abundant cluster SVGs; i.e., SVGs with lower expression in WM compared to the rest of the tissue.

head(results_WM_both$low_genes,3)
##         Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## ATP1A3       WM  ATP1A3   893.8334       14.88552     -1.285877  2.149690e-196
## SNCG         WM    SNCG   770.7204       14.60578     -1.597352  1.253467e-169
## PRKAR1B      WM PRKAR1B   703.9422       14.89235     -1.135665  4.153923e-155
##           Cluster_FDR
## ATP1A3  8.899716e-194
## SNCG    3.459569e-167
## PRKAR1B 8.598622e-153

Visualize the gene expression of top three genes for layer WM.

# SVGs with higher abundance in WM, than in non-WM tissue
feature_high <- rownames(results_WM_both$high_genes)[seq_len(3)]
# SVGs with lower abundance in WM, than in non-WM tissue
feature_low <- rownames(results_WM_both$low_genes)[seq_len(3)]
plot_list_high <- list(); plot_list_low <- list()
## Sample names
samples <- unique(colData(spe.combined)$sample_id)
for(j in seq_along(samples)){
    ## Subset spe for each sample j
    spe_j <- spe.combined[, colData(spe.combined)$sample_id == samples[j]]
    ## Gene expression plots with top highly abundant cluster SVGs for spe_j
    plot_list_high[[j]] <- FeaturePlot(spe_j, feature_high, 
                                coordinates = coordinates,
                                spatial_cluster = spatial_cluster, 
                                linewidth = 0.6,
                                cluster = 'WM', Annotated_cluster = TRUE,
                                legend_cluster = TRUE, title = TRUE)
    ## Gene expression plots with top lowly abundant cluster SVGs for spe_j
    plot_list_low[[j]] <- FeaturePlot(spe_j, feature_low, 
                                coordinates = coordinates,
                                spatial_cluster = spatial_cluster, 
                                linewidth = 0.6,
                                cluster = 'WM', Annotated_cluster = TRUE,
                                legend_cluster = TRUE, title = TRUE)
}
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
# Expression plots for SVGs with higher abundance in WM, than in non-WM tissue
patchwork::wrap_plots(plot_list_high, ncol=1)

# Expression plots for SVGs with lower abundance in WM, than in non-WM tissue
patchwork::wrap_plots(plot_list_low, ncol=1)

Sample-specific covariates (e.g., batch effects)

If sample-specific covariates, such as batch effects, are available, we can account for them in DESpace. The adjustment works as in the edgeR original framework: the mean of the negative binomial model is expressed as a function of spatial clusters, and additional nuisance covariates; differential testing is then performed on spatial clusters only, to identify SVGs. Note that sample-specific covariates can be used instead of samples, but a joint modelling of both samples and (sample-specific) covariates is not possible because the two variables are nested.

To show an example application, we artificially separate samples in 2 batches:

spe.combined$batch_id = ifelse(spe.combined$sample_id == "151507", "batch_1", "batch_2")

table(spe.combined$batch_id, spe.combined$sample_id)
##          
##           151507 151669 151673
##   batch_1   4174      0      0
##   batch_2      0   3635   3601

Analyses are performed, as explained above, in Section 5; yet, when running DESpace_test, we set the sample_col to batch_id:

set.seed(123)
batch_results <- DESpace_test(spe = spe.combined,
                                spatial_cluster = spatial_cluster,
                                sample_col = 'batch_id',
                                replicates = TRUE)

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SpatialExperiment_1.15.1    SingleCellExperiment_1.27.2
##  [3] SummarizedExperiment_1.35.5 Biobase_2.67.0             
##  [5] GenomicRanges_1.57.2        GenomeInfoDb_1.41.2        
##  [7] IRanges_2.39.2              S4Vectors_0.43.2           
##  [9] BiocGenerics_0.53.0         MatrixGenerics_1.17.1      
## [11] matrixStats_1.4.1           ggplot2_3.5.1              
## [13] DESpace_1.7.0               BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] later_1.3.2              BiocIO_1.17.0            bitops_1.0-9            
##   [4] filelock_1.0.3           fields_16.3              tibble_3.2.1            
##   [7] polyclip_1.10-7          XML_3.99-0.17            lifecycle_1.0.4         
##  [10] rstatix_0.7.2            edgeR_4.3.21             doParallel_1.0.17       
##  [13] lattice_0.22-6           MASS_7.3-61              backports_1.5.0         
##  [16] magrittr_2.0.3           limma_3.61.12            plotly_4.10.4           
##  [19] sass_0.4.9               rmarkdown_2.28           jquerylib_0.1.4         
##  [22] yaml_2.3.10              httpuv_1.6.15            spam_2.11-0             
##  [25] sessioninfo_1.2.2        cowplot_1.1.3            DBI_1.2.3               
##  [28] buildtools_1.0.0         RColorBrewer_1.1-3       golem_0.5.1             
##  [31] maps_3.4.2               abind_1.4-8              zlibbioc_1.51.2         
##  [34] purrr_1.0.2              RCurl_1.98-1.16          tweenr_2.0.3            
##  [37] rappdirs_0.3.3           GenomeInfoDbData_1.2.13  ggrepel_0.9.6           
##  [40] irlba_2.3.5.1            maketools_1.3.1          codetools_0.2-20        
##  [43] DelayedArray_0.31.14     DT_0.33                  scuttle_1.15.5          
##  [46] ggforce_0.4.2            tidyselect_1.2.1         UCSC.utils_1.1.0        
##  [49] farver_2.1.2             viridis_0.6.5            ScaledMatrix_1.13.0     
##  [52] shinyWidgets_0.8.7       BiocFileCache_2.15.0     GenomicAlignments_1.41.0
##  [55] jsonlite_1.8.9           BiocNeighbors_2.1.0      Formula_1.2-5           
##  [58] scater_1.33.4            iterators_1.0.14         foreach_1.5.2           
##  [61] tools_4.4.1              ggnewscale_0.5.0         Rcpp_1.0.13             
##  [64] glue_1.8.0               gridExtra_2.3            SparseArray_1.5.45      
##  [67] xfun_0.48                dplyr_1.1.4              withr_3.0.2             
##  [70] BiocManager_1.30.25      fastmap_1.2.0            fansi_1.0.6             
##  [73] rsvd_1.0.5               digest_0.6.37            R6_2.5.1                
##  [76] mime_0.12                colorspace_2.1-1         RSQLite_2.3.7           
##  [79] config_0.3.2             utf8_1.2.4               tidyr_1.3.1             
##  [82] generics_0.1.3           data.table_1.16.2        rtracklayer_1.65.0      
##  [85] httr_1.4.7               htmlwidgets_1.6.4        S4Arrays_1.5.11         
##  [88] pkgconfig_2.0.3          gtable_0.3.6             blob_1.2.4              
##  [91] XVector_0.45.0           sys_3.4.3                htmltools_0.5.8.1       
##  [94] carData_3.0-5            dotCall64_1.2            scales_1.3.0            
##  [97] png_0.1-8                attempt_0.3.1            knitr_1.48              
## [100] rjson_0.2.23             curl_5.2.3               cachem_1.1.0            
## [103] stringr_1.5.1            BiocVersion_3.21.1       concaveman_1.1.0        
## [106] vipor_0.4.7              parallel_4.4.1           AnnotationDbi_1.69.0    
## [109] restfulr_0.0.15          pillar_1.9.0             grid_4.4.1              
## [112] vctrs_0.6.5              promises_1.3.0           ggpubr_0.6.0            
## [115] BiocSingular_1.23.0      car_3.1-3                dbplyr_2.5.0            
## [118] beachmat_2.23.0          xtable_1.8-4             beeswarm_0.4.0          
## [121] paletteer_1.6.0          evaluate_1.0.1           magick_2.8.5            
## [124] cli_3.6.3                locfit_1.5-9.10          compiler_4.4.1          
## [127] Rsamtools_2.21.2         rlang_1.1.4              crayon_1.5.3            
## [130] ggsignif_0.6.4           labeling_0.4.3           rematch2_2.1.2          
## [133] ggbeeswarm_0.7.2         stringi_1.8.4            viridisLite_0.4.2       
## [136] BiocParallel_1.39.0      assertthat_0.2.1         munsell_0.5.1           
## [139] Biostrings_2.75.0        lazyeval_0.2.2           V8_6.0.0                
## [142] Matrix_1.7-1             ExperimentHub_2.13.1     benchmarkme_1.0.8       
## [145] patchwork_1.3.0          bit64_4.5.2              KEGGREST_1.45.1         
## [148] statmod_1.5.0            shiny_1.9.1              highr_0.11              
## [151] AnnotationHub_3.15.0     broom_1.0.7              memoise_2.0.1           
## [154] bslib_0.8.0              benchmarkmeData_1.0.4    bit_4.5.0               
## [157] spatialLIBD_1.17.8

References

Maynard, Collado-Torres, K. R. 2021. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex.” Nature Neurosciencei 24: 425–36. https://doi.org/10.1038/s41593-020-00787-0.
Pham, Duy Truong, Xiao Tan, Jun Xu, Laura F Grice, Pui Yeng Lam, Arti Raghubar, Jana Vukovic, Marc J Ruitenberg, and Quan Hoang Nguyen. 2020. stLearn: integrating spatial location, tissue morphology and gene expression to find cell types, cell-cell interactions and spatial trajectories within undissociated tissues.” Bioinformatics 36 (7): 2293–94. https://doi.org/doi:10.1093/bioinformatics/btz914.
Robinson, M. D, D. J McCarthy, and G. K. Smyth. 2009. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Zhao, E, M. R Stone, X Ren, J Guenthoer, K. S Smythe, T Pulliam, S. R Williams, et al. 2021. Spatial transcriptomics at subspot resolution with BayesSpace.” Nature Biotechnology 39 (June): 1375–84. https://doi.org/10.1038/s41587-021-00935-2.