DESpace is an intuitive framework for identifying spatially variable (SV) genes (SVGs) via edgeR (Robinson, McCarthy, and Smyth 2009), one of the most common methods for performing differential expression analyses.
Based on pre-annotated spatial clusters as summarized spatial information, DESpace models gene expression using a negative binomial (NB), via edgeR (Robinson, McCarthy, and Smyth 2009), with spatial clusters as covariates. SV genes (SVGs) are then identified by testing the significance of spatial clusters.
Our approach assumes that the spatial structure can be summarized by spatial clusters, which should reproduce the key features of the tissue (e.g., white matter and layers in brain cortex). A significant test of these covariates indicates that space influences gene expression, hence identifying spatially variable genes.
Our model is flexible and robust, and is significantly faster than the most SV methods. Furthermore, to the best of our knowledge, it is the only SV approach that allows: - performing a SV test on each individual spatial cluster, hence identifying the key regions affected by spatial variability; - jointly fitting multiple samples, targeting genes with consistent spatial patterns across biological replicates.
Below, we illustrate en example usage of the package.
DESpace
is implemented as a R package within
Bioconductor, which is the main venue for omics analyses, and we use
various other Bioconductor packages (e.g., SpatialLIBD, and edgeR).
DESpace
package is available on Bioconductor and can be
installed with the following command:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("DESpace")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
The development version of DESpace
can also be installed
from the Bioconductor-devel branch or from GitHub.
To access the R code used in the vignettes, type:
Questions relative to DESpace should be reported as a new issue at BugReports.
To cite DESpace, type:
Load R packages:
As an example dataset, we consider a human dorsolateral pre-frontal cortex (DLPFC) spatial transcriptomics dataset from the 10x Genomics Visium platform, including three neurotypical adult donors (i.e., biological replicates), with four images per subject (Maynard 2021). The full dataset consists of 12 samples, which can be accessed via spatialLIBD Bioconductor package.
Here, we consider a subset of the original data, consisting of three
biological replicates: 1 image for each of the three brain subjects.
Initially, in Section 3 individual sample , we fit our approach
on a single sample, whose data is stored in spe3
whereas
all 3 samples will later be jointly used in Section 4 Multiple
samples.
# Connect to ExperimentHub
ehub <- ExperimentHub::ExperimentHub()
# Download the full real data (about 2.1 GB in RAM) use:
spe_all <- spatialLIBD::fetch_data(type = "spe", eh = ehub)
# Create three spe objects, one per sample:
spe1 <- spe_all[, colData(spe_all)$sample_id == '151507']
spe2 <- spe_all[, colData(spe_all)$sample_id == '151669']
spe3 <- spe_all[, colData(spe_all)$sample_id == '151673']
rm(spe_all)
# Select small set of random genes for faster runtime in this example
set.seed(123)
sel_genes <- sample(dim(spe1)[1],2000)
spe1 <- spe1[sel_genes,]
spe2 <- spe2[sel_genes,]
spe3 <- spe3[sel_genes,]
# For covenience, we use “gene names” instead of “gene ids”:
rownames(spe1) <- rowData(spe1)$gene_name
rownames(spe2) <- rowData(spe2)$gene_name
rownames(spe3) <- rowData(spe3)$gene_name
# Specify column names of spatial coordinates in colData(spe)
coordinates <- c("array_row", "array_col")
# Specify column names of spatial clusters in colData(spe)
spatial_cluster <- 'layer_guess_reordered'
The spatial tissues of each sample were manually annotated in the
original manuscript (Maynard 2021), and
spots were labeled into one of the following categories: white matter
(WM) and layers 1 to 6. The manual annotations are stored in column
layer_guess_reordered
of the colData
, while
columns array_col
and array_row
provide the
spatial coordinates of spots.
# We select a subset of columns
keep_col <- c(coordinates,spatial_cluster,"expr_chrM_ratio","cell_count")
head(colData(spe3)[keep_col])
## DataFrame with 6 rows and 5 columns
## array_row array_col layer_guess_reordered expr_chrM_ratio
## <integer> <integer> <factor> <numeric>
## AAACAAGTATCTCCCA-1 50 102 Layer3 0.166351
## AAACAATCTACTAGCA-1 3 43 Layer1 0.122376
## AAACACCAATAACTGC-1 59 19 WM 0.114089
## AAACAGAGCGACTCCT-1 14 94 Layer3 0.242223
## AAACAGCTTTCAGAAG-1 43 9 Layer5 0.152174
## AAACAGGGTCTATATT-1 47 13 Layer6 0.155095
## cell_count
## <integer>
## AAACAAGTATCTCCCA-1 6
## AAACAATCTACTAGCA-1 16
## AAACACCAATAACTGC-1 5
## AAACAGAGCGACTCCT-1 2
## AAACAGCTTTCAGAAG-1 4
## AAACAGGGTCTATATT-1 6
Quality control (QC) procedures at the spot and gene level aim to remove both low-quality spots, and lowly abundant genes. For QC, we adhere to the instructions from “Orchestrating Spatially Resolved Transcriptomics Analysis with Bioconductor” (OSTA). The library size, UMI counts, ratio of mitochondrial chromosome (chM) expression, and number of cells per spot are used to identify low-quality spots.
# Sample 1:
# Calculate per-spot QC metrics and store in colData
spe1 <- scuttle::addPerCellQC(spe1,)
# Remove combined set of low-quality spots
spe1 <- spe1[, !(colData(spe1)$sum < 10 | # library size
colData(spe1)$detected < 10 | # number of expressed genes
colData(spe1)$expr_chrM_ratio > 0.30| # mitochondrial expression ratio
colData(spe1)$cell_count > 10)] # number of cells per spot
# Sample 2:
# Calculate per-spot QC metrics and store in colData
spe2 <- scuttle::addPerCellQC(spe2,)
# Remove combined set of low-quality spots
spe2 <- spe2[, !(colData(spe2)$sum < 20 |
colData(spe2)$detected < 15 |
colData(spe2)$expr_chrM_ratio > 0.35|
colData(spe2)$cell_count > 8)]
# Sample 3:
spe3 <- scuttle::addPerCellQC(spe3,)
# Remove combined set of low-quality spots
spe3 <- spe3[, !(colData(spe3)$sum < 25 |
colData(spe3)$detected < 25 |
colData(spe3)$expr_chrM_ratio > 0.3|
colData(spe3)$cell_count > 15)]
Then, we discard lowly abundant genes, which were detected in less than 20 spots.
# For each sample i:
for(i in seq_len(3)){
spe_i <- eval(parse(text = paste0("spe", i)))
# Select QC threshold for lowly expressed genes: at least 20 non-zero spots:
qc_low_gene <- rowSums(assays(spe_i)$counts > 0) >= 20
# Remove lowly abundant genes
spe_i <- spe_i[qc_low_gene,]
assign(paste0("spe", i), spe_i)
message("Dimension of spe", i, ": ", dim(spe_i)[1], ", ", dim(spe_i)[2])
}
## Dimension of spe1: 847, 4174
## Dimension of spe2: 868, 3635
## Dimension of spe3: 908, 3601
We fit our approach to discover SVGs in an individual sample. In Section 4 Multiple samples, we will show how to jointly embed multiple replicates.
This framework relies on spatial clusters being accessible and successfully summarizing the primary spatial characteristics of the data. In most datasets, these spatial features are either accessible or can be easily generated with spatial clustering algorithms.
If manual annotations are provided (e.g., annotated by a
pathologist), we can directly use those. With the spe
or
spe
object that contains coordinates of the spot-level
data, we can visualize spatial clusters.
# View LIBD layers for one sample
CD <- as.data.frame(colData(spe3))
ggplot(CD,
aes(x=array_col,y=array_row,
color=factor(layer_guess_reordered))) +
geom_point() +
theme_void() + scale_y_reverse() +
theme(legend.position="bottom") +
labs(color = "", title = paste0("Manually annotated spatial clusters"))
If manual annotations are not available, we can use spatially resolved clustering tools. These methods, by jointly employing spatial coordinates and gene expression data, enable obtaining spatial clusters. Although, in this vignette we use pre-computed manually annotated clusters, below we provide links to two popular spatially resolved clustering tools: BayesSpace (Zhao et al. 2021) and StLearn (Pham et al. 2020).
BayesSpace is a Bioconductor package that provides a Bayesian statistical approach for spatial transcriptomics data clustering (BayesSpace). There is a specific vignette for using BayesSpace on this dataset (human DLPFC) here.
After using BayesSpace, the spatial cluster assignments
(spatial.cluster
) are available in the
colData(spe)
.
StLearn, a python-based package, is designed for spatial transciptomics data. It allows spatially-resolved clustering based on Louvain or k-means (stLearn). There is a tutorial for using StLearn on this dataset (human DLPFC) here.
After running stLearn, we can store results as a csv
file.
{python.reticulate = FALSE stLearn[py multi-sample], eval = FALSE} # Save spatial results obsm.to_csv("stLearn_clusters.csv")
Then, we can load these results in R and store spatial clusters in
the spe
object.
Once we have spatial clusters, we can search for SVGs.
Fit the model via DESpace_test
function. Parameter
spe
specifies the input SpatialExperiment
or
SingleCellExperiment
object, while
spatial_cluster
defines the column names of
colData(spe)
containing spatial clusters. To obtain all
statistics, set verbose
to TRUE
(default
value).
set.seed(123)
results <- DESpace_test(spe = spe3,
spatial_cluster = spatial_cluster,
verbose = TRUE)
## using 'DESpace_test' for spatial gene/pattern detection.
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is 908.
## single sample test
A list of results is returned. The main results of interest are
stored in the gene_results
: a data.fame
, where
columns contain gene names (gene_id
), likelihood ratio test
statistics (LR
), average (across spots) log-2 counts per
million (logCPM
), raw p-values (PValue
) and
Benjamini-Hochberg adjusted p-values (FDR
).
## gene_id LR logCPM PValue FDR
## SNCG SNCG 1302.458 14.36222 3.181142e-278 2.888477e-275
## ATP1A3 ATP1A3 1235.646 14.68926 9.224867e-264 4.188090e-261
## PLEKHH1 PLEKHH1 1045.607 13.69914 1.220571e-222 3.694262e-220
The second element of the results (a DGEList
object
estimated_y
) contains the estimated common dispersion,
which can later be used to speed-up calculation when testing individual
clusters.
The third and forth element of the results (DGEGLM
and
DGELRT
objects) contain full statistics from
edgeR::glmFit
and edgeR::glmLRT
.
## [1] "DGEList"
## attr(,"package")
## [1] "edgeR"
## [1] "DGELRT"
## attr(,"package")
## [1] "edgeR"
## [1] "DGEGLM"
## attr(,"package")
## [1] "edgeR"
Visualize the gene expression of the three most significant genes in
FeaturePlot()
. Note that the gene names in vector
feature
, should also appear in the count matrix’s row
names. Specifying the column names of spatial coordinates of spots is
only necessary when they are not named row
and
col
.
## [1] "SNCG" "ATP1A3" "PLEKHH1"
Additionally, function FeaturePlot()
can draw an outline
around each cluster.
FeaturePlot(spe3, feature,
coordinates = coordinates,
Annotated_cluster = TRUE,
spatial_cluster = spatial_cluster,
cluster = 'all',
legend_cluster = TRUE, title = TRUE)
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning: Removed 17 rows containing missing values or values outside the scale range
## (`geom_point()`).
DESpace can also be used to reveal the specific areas of the
tissue affected by spatial variability; i.e., spatial clusters that are
particularly over/under abundant compared to the average. Function
individual_test()
can be used to identify SVGs for each
individual cluster. Parameter spatial_cluster
indicates the
column names of colData(spe)
containing spatial
clusters.
For every spatial cluster we test, edgeR
would normally
re-compute the dispersion estimates based on the specific design of the
test. However, this calculation represents the majority of the overall
computing time. Therefore, to speed-up calculations, we propose to use
the dispersion estimates which were previously computed for the
gene-level tests. Albeit this is an approximation, in our benchmarks, it
leads to comparable performance in terms of sensitivity and specificity.
If you want to use pre-computed gene-level dispersion estimates, set
edgeR_y
to estimated_y
. Alternatively, if you
want to re-compute dispersion estimates (significantly slower, but
marginally more accurate option), leave edgeR_y
empty.
set.seed(123)
cluster_results <- individual_test(spe3,
edgeR_y = results$estimated_y,
spatial_cluster = spatial_cluster)
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is908.
## Pre-processing
## Start modeling
## Returning results
individual_test()
returns a list containing the results
of the individual cluster tests. Similarly to above, for each cluster,
results are reported as a data.fame
, where columns contain
gene names (gene_id
), likelihood ratio test statistics
(LR
), log2-fold changes (logFC
), raw p-values
(PValue
) and Benjamini-Hochberg adjusted p-values
(FDR
). NB: that the logFC
compares each
cluster to the rest of the tissue; e.g., a logFC of 2 for WM test
indicates that the average gene expression in WM is (4 times) higher
than the average gene expression in non-WM tissue.
Visualize results for WM.
## [1] "list"
## [1] "Layer1" "Layer2" "Layer3" "Layer4" "Layer5" "Layer6" "WM"
top_results
function can be used to combine gene-and
cluster-level results. By default, results from
top_results()
report both adjusted p-values and log2-FC;
however, users can also choose to only report either, by specifying
select = "FDR"
or select = "logFC"
. Below,
gene_PValue
and gene_FDR
columns refer to the
gene-level testing, while the subsequent columns indicate the
cluster-specific results.
## gene_id gene_LR gene_logCPM gene_Pvalue gene_FDR Layer1_FDR
## 1 SNCG 1302.458 14.36222 3.181142e-278 2.888477e-275 4.647725e-13
## 2 ATP1A3 1235.646 14.68926 9.224867e-264 4.188090e-261 3.938179e-15
## 3 PLEKHH1 1045.607 13.69914 1.220571e-222 3.694262e-220 6.927321e-15
## Layer2_FDR Layer3_FDR Layer4_FDR Layer5_FDR Layer6_FDR
## 1 2.811936e-06 8.360504e-80 7.834189e-13 2.906336e-26 1.927089e-65
## 2 2.025663e-03 5.124609e-60 7.015626e-12 1.445723e-15 8.018028e-25
## 3 1.211797e-06 6.432818e-44 1.052098e-06 4.485554e-17 3.194246e-01
## WM_FDR Layer1_logFC Layer2_logFC Layer3_logFC Layer4_logFC
## 1 3.682232e-133 0.7714133 0.4604152 -0.8320186 -0.6132229
## 2 2.231319e-182 0.6342124 -0.2225299 -0.5509693 -0.4553823
## 3 7.557305e-209 1.4032918 0.7711747 1.1047706 0.8704022
## Layer5_logFC Layer6_logFC WM_logFC
## 1 -0.5492418 1.0925655 -1.966807
## 2 -0.3252814 0.4761009 -1.728821
## 3 0.8053901 0.1317995 2.320427
## gene_id gene_LR gene_logCPM gene_Pvalue gene_FDR Layer1_FDR
## 1 SNCG 1302.458 14.36222 3.181142e-278 2.888477e-275 4.647725e-13
## 2 ATP1A3 1235.646 14.68926 9.224867e-264 4.188090e-261 3.938179e-15
## 3 PLEKHH1 1045.607 13.69914 1.220571e-222 3.694262e-220 6.927321e-15
## Layer2_FDR Layer3_FDR Layer4_FDR Layer5_FDR Layer6_FDR
## 1 2.811936e-06 8.360504e-80 7.834189e-13 2.906336e-26 1.927089e-65
## 2 2.025663e-03 5.124609e-60 7.015626e-12 1.445723e-15 8.018028e-25
## 3 1.211797e-06 6.432818e-44 1.052098e-06 4.485554e-17 3.194246e-01
## WM_FDR
## 1 3.682232e-133
## 2 2.231319e-182
## 3 7.557305e-209
We can further specify a cluster and check top genes detected by DESpace.
# Check top genes for WM
results_WM <- top_results(cluster_results = cluster_results,
cluster = "WM")
head(results_WM, 3)
## Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1 WM PLEKHH1 964.7325 13.69914 2.320427 8.323023e-212
## ATP1A3 WM ATP1A3 841.5827 14.68926 -1.728821 4.914799e-185
## SLAIN1 WM SLAIN1 723.0734 13.78573 1.870821 2.873138e-159
## Cluster_FDR
## PLEKHH1 7.557305e-209
## ATP1A3 2.231319e-182
## SLAIN1 8.696031e-157
With high_low
parameter, we can further filter genes to
visualize those with higher (high_low = "high"
) or lower
(high_low = "low"
) average abundance in the specified
cluster, compared to the average abundance in the rest of the tissue. By
default, high_low = “both”
and all results are
provided.
results_WM_both <- top_results(cluster_results = cluster_results,
cluster = "WM",
high_low = "both")
Here we present the highly abundant cluster SVGs; i.e., SVGs with higher expression in WM compared to the rest of the area.
## Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1 WM PLEKHH1 964.7325 13.69914 2.320427 8.323023e-212
## SLAIN1 WM SLAIN1 723.0734 13.78573 1.870821 2.873138e-159
## CMTM5 WM CMTM5 659.1198 13.65298 2.098647 2.321441e-145
## Cluster_FDR
## PLEKHH1 7.557305e-209
## SLAIN1 8.696031e-157
## CMTM5 4.215737e-143
We visualize the lowly abundant cluster SVGs; i.e., SVGs with lower expression in WM compared to the rest of the area.
## Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## ATP1A3 WM ATP1A3 841.5827 14.68926 -1.728821 4.914799e-185
## PRKAR1B WM PRKAR1B 669.1243 14.72792 -1.505871 1.548960e-147
## NSF WM NSF 628.8597 14.50624 -1.744898 8.847779e-139
## Cluster_FDR
## ATP1A3 2.231319e-182
## PRKAR1B 3.516140e-145
## NSF 1.338964e-136
Visualize the gene expression of the top genes for layer WM. A
cluster outline can be drawn by specifying the column names of clusters
stored in colData(spe)
and the vector of cluster names via
spatial_cluster
and cluster
.
# SVGs with higher than average abundance in WM
feature <- rownames(results_WM_both$high_genes)[seq_len(3)]
FeaturePlot(spe3, feature, spatial_cluster = spatial_cluster,
coordinates = coordinates, cluster = 'WM',
legend_cluster = TRUE, Annotated_cluster = TRUE,
linewidth = 0.6, title = TRUE)
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
# SVGs with lower than average abundance in WM
feature <- rownames(results_WM_both$low_genes)[seq_len(3)]
FeaturePlot(spe3, feature, spatial_cluster = spatial_cluster,
coordinates = coordinates, cluster = 'WM',
legend_cluster = TRUE, Annotated_cluster = TRUE,
linewidth = 0.6,title = TRUE)
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
If biological replicates are available, our framework allows jointly modeling them to target SVGs with coherent spatial patterns across samples. This approach may be particularly beneficial when data are characterized by a large degree of (biological and technical) variability, such as in cancer. Importantly, only genes detected (above filtering thresholds) in all samples will be analyzed.
Similar to gene-level testing, the multi-sample extension requires pre-annotated spatial clusters.
If the manual annotation for each sample is available, we can combine
all samples and use manual annotations directly. Note that cluster
labels must be consistent across samples (i.e., WM in sample 1 should
represent the same tissue as WM in sample 2). With the
spe.combined
object that contains coordinates of the
spot-level data, we can visualize spatial clusters.
set.seed(123)
# Use common genes
a <- rownames(counts(spe1));
b <- rownames(counts(spe2));
c <- rownames(counts(spe3))
# find vector of common genes across all samples:
CommonGene <- Reduce(intersect, list(a,b,c))
spe1 <- spe1[CommonGene,]
spe2 <- spe2[CommonGene,]
spe3 <- spe3[CommonGene,]
# Combine three samples
spe.combined <- cbind(spe1, spe2, spe3, deparse.level = 1)
ggplot(as.data.frame(colData(spe.combined)),
aes(x=array_col, y=array_row,
color=factor(layer_guess_reordered))) +
geom_point() +
facet_wrap(~sample_id) +
theme_void() + scale_y_reverse() +
theme(legend.position="bottom") +
labs(color = "", title = "Manually annotated spatial clusters")
Similarly to above, if manual annotations are not available, we can use spatially resolved clustering tools. Both BayesSpace (Zhao et al. 2021) and StLearn (Pham et al. 2020) allow jointly clustering multiple samples. In particular each tool has a specific vignettes for multi-testing clustering: BayesSpace vignettes, and stLearn vignettes.
In our benchmarks, we have noticed that, with both BayesSpace and StLearn, joint spatial clustering of multiple samples is more prone to failure and inaccurate results than spatial clustering of individual samples. Therefore, if multi-sample clustering fails, we suggest trying to cluster individual samples (as in Section 3 Individual sample) and manually match cluster ids across samples, to ensure that “cluster 1” always refers to the same spatial region in all samples.
Once we have spatial clusters for multiple samples, we add them to
colData(spe.combined)
as the column
layer_guess_reordered
and fit the model with spatial
clusters as covariates.
Fit the model via DESpace_test()
. Parameter
spe
specifies the input SpatialExperiment
or
SingleCellExperiment
object, while
spatial_cluster
and sample_col
define the
column names of colData(spe)
containing spatial clusters
and sample ids. With replicates = TRUE
, we fit the
multi-sample model.
The second element of the result (a DGEList
object
estimated_y_multi
) contains the estimated common dispersion
for the multi-sample case.
set.seed(123)
multi_results <- DESpace_test(spe = spe.combined,
spatial_cluster = spatial_cluster,
sample_col = 'sample_id',
replicates = TRUE)
## using 'DESpace_test' for spatial gene/pattern detection.
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is 828.
## multi-sample test
## Repeated column names found in count matrix
A list of results are returned. The main results of interest are
stored in the gene_results
.
## gene_id LR logCPM PValue FDR
## HPCAL1 HPCAL1 2063.906 14.41880 0 0
## NSF NSF 1524.984 14.70442 0 0
## ATP1A3 ATP1A3 1578.194 14.88552 0 0
The second element of the results (a DGEList
object
estimated_y
) contains the estimated common dispersion,
which can later be used to speed-up calculation when testing individual
clusters.
## [1] "DGEList"
## attr(,"package")
## [1] "edgeR"
For each sample, we can visualize the gene expression of the most
significant SVGs. Note that column names of spatial coordinates of spots
should be row
and col
.
## Top three spatially variable genes
feature <- multi_results$gene_results$gene_id[seq_len(3)]; feature
## [1] "HPCAL1" "NSF" "ATP1A3"
## [1] "151507" "151669" "151673"
## Use purrr::map to combine multiple figures
spot_plots <- purrr::map(seq_along(samples), function(j) {
## Subset spe for each sample j
spe_j <- spe.combined[, colData(spe.combined)$sample_id == samples[j] ]
## Store three gene expression plots with gene names in `feature` for spe_j
spot_plots <- FeaturePlot(spe_j, feature,
coordinates = coordinates,
spatial_cluster = spatial_cluster, title = TRUE,
Annotated_cluster = TRUE, legend_cluster = TRUE)
return(spot_plots)
})
patchwork::wrap_plots(spot_plots, ncol=1)
Similarly to what shown in Section 3 Individual sample, our
framework can discover the key SV spatial clusters also when jointly
fitting multiple samples. For a multi-sample testing, set
replicates = TRUE
in individual_test()
.
set.seed(123)
cluster_results <- individual_test(spe.combined,
edgeR_y = multi_results$estimated_y,
replicates = TRUE,
spatial_cluster = spatial_cluster)
## Filter low quality genes:
## min_counts = 20; min_non_zero_spots = 10.
## The number of genes that pass filtering is828.
## Pre-processing
## Start modeling
## Returning results
individual_test()
returns a list containing the results
of individual clusters, specified in cluster
parameter. In
this case, logFC refers to the log2-FC between the average abundance,
across all samples, of a spatial cluster and the average abundance of
all remaining clusters (e.g., WM vs. non-WM tissue).
Visualize results for WM.
## [1] "list"
## [1] "Layer1" "Layer2" "Layer3" "Layer4" "Layer5" "Layer6" "WM"
As above, top_results
function can be used to combine
gene-level and cluster-level results.
merge_res <- top_results(multi_results$gene_results, cluster_results,
select = "FDR")
head(merge_res,3)
## gene_id gene_LR gene_logCPM gene_Pvalue gene_FDR Layer1_FDR Layer2_FDR
## 1 ATP1A3 1578.194 14.88552 0 0 1.970166e-93 2.067235e-02
## 2 HPCAL1 2063.906 14.41880 0 0 4.522810e-15 1.890276e-191
## 3 NSF 1524.984 14.70442 0 0 5.165955e-127 4.084828e-01
## Layer3_FDR Layer4_FDR Layer5_FDR Layer6_FDR WM_FDR
## 1 3.715619e-16 1.925546e-23 1.694696e-46 4.982746e-03 8.899716e-194
## 2 4.927089e-126 7.522529e-51 1.038996e-117 3.528734e-15 8.907694e-39
## 3 4.331610e-01 1.906776e-24 3.585842e-60 9.999424e-02 3.143784e-147
We can further select a cluster of interest, and check the top genes detected in that cluster.
# Check top genes for WM
results_WM <- top_results(cluster_results = cluster_results,
cluster = "WM")
# For each gene, adjusted p-values for each cluster
head(results_WM,3)
## Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1 WM PLEKHH1 933.0597 14.00538 1.725143 6.385387e-205
## ATP1A3 WM ATP1A3 893.8334 14.88552 -1.285877 2.149690e-196
## SNCG WM SNCG 770.7204 14.60578 -1.597352 1.253467e-169
## Cluster_FDR
## PLEKHH1 5.287100e-202
## ATP1A3 8.899716e-194
## SNCG 3.459569e-167
With high_low = "both"
, we can further filter genes to
visualize highly and lowly abundant SVGs.
results_WM_both <- top_results(cluster_results = cluster_results,
cluster = "WM", high_low = "both")
Here we present the highly abundant cluster SVGs; i.e., SVGs with higher expression in WM compared to the rest of the tissue.
## Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## PLEKHH1 WM PLEKHH1 933.0597 14.00538 1.725143 6.385387e-205
## SLAIN1 WM SLAIN1 660.5151 14.07035 1.407386 1.154292e-145
## CMTM5 WM CMTM5 617.2949 13.99113 1.479957 2.898118e-136
## Cluster_FDR
## PLEKHH1 5.287100e-202
## SLAIN1 1.592923e-143
## CMTM5 3.428059e-134
We visualize the lowly abundant cluster SVGs; i.e., SVGs with lower expression in WM compared to the rest of the tissue.
## Cluster gene_id Cluster_LR Cluster_logCPM Cluster_logFC Cluster_PValue
## ATP1A3 WM ATP1A3 893.8334 14.88552 -1.285877 2.149690e-196
## SNCG WM SNCG 770.7204 14.60578 -1.597352 1.253467e-169
## PRKAR1B WM PRKAR1B 703.9422 14.89235 -1.135665 4.153923e-155
## Cluster_FDR
## ATP1A3 8.899716e-194
## SNCG 3.459569e-167
## PRKAR1B 8.598622e-153
Visualize the gene expression of top three genes for layer WM.
# SVGs with higher abundance in WM, than in non-WM tissue
feature_high <- rownames(results_WM_both$high_genes)[seq_len(3)]
# SVGs with lower abundance in WM, than in non-WM tissue
feature_low <- rownames(results_WM_both$low_genes)[seq_len(3)]
plot_list_high <- list(); plot_list_low <- list()
## Sample names
samples <- unique(colData(spe.combined)$sample_id)
for(j in seq_along(samples)){
## Subset spe for each sample j
spe_j <- spe.combined[, colData(spe.combined)$sample_id == samples[j]]
## Gene expression plots with top highly abundant cluster SVGs for spe_j
plot_list_high[[j]] <- FeaturePlot(spe_j, feature_high,
coordinates = coordinates,
spatial_cluster = spatial_cluster,
linewidth = 0.6,
cluster = 'WM', Annotated_cluster = TRUE,
legend_cluster = TRUE, title = TRUE)
## Gene expression plots with top lowly abundant cluster SVGs for spe_j
plot_list_low[[j]] <- FeaturePlot(spe_j, feature_low,
coordinates = coordinates,
spatial_cluster = spatial_cluster,
linewidth = 0.6,
cluster = 'WM', Annotated_cluster = TRUE,
legend_cluster = TRUE, title = TRUE)
}
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
## Warning in get_plot_component(plot, "guide-box"): Multiple components found;
## returning the first one. To return all, use `return_all = TRUE`.
If sample-specific covariates, such as batch effects, are available, we can account for them in DESpace. The adjustment works as in the edgeR original framework: the mean of the negative binomial model is expressed as a function of spatial clusters, and additional nuisance covariates; differential testing is then performed on spatial clusters only, to identify SVGs. Note that sample-specific covariates can be used instead of samples, but a joint modelling of both samples and (sample-specific) covariates is not possible because the two variables are nested.
To show an example application, we artificially separate samples in 2 batches:
spe.combined$batch_id = ifelse(spe.combined$sample_id == "151507", "batch_1", "batch_2")
table(spe.combined$batch_id, spe.combined$sample_id)
##
## 151507 151669 151673
## batch_1 4174 0 0
## batch_2 0 3635 3601
Analyses are performed, as explained above, in Section 5; yet, when
running DESpace_test
, we set the sample_col
to
batch_id
:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SpatialExperiment_1.15.1 SingleCellExperiment_1.27.2
## [3] SummarizedExperiment_1.35.5 Biobase_2.67.0
## [5] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
## [7] IRanges_2.39.2 S4Vectors_0.43.2
## [9] BiocGenerics_0.53.0 MatrixGenerics_1.17.1
## [11] matrixStats_1.4.1 ggplot2_3.5.1
## [13] DESpace_1.7.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] later_1.3.2 BiocIO_1.17.0 bitops_1.0-9
## [4] filelock_1.0.3 fields_16.3 tibble_3.2.1
## [7] polyclip_1.10-7 XML_3.99-0.17 lifecycle_1.0.4
## [10] rstatix_0.7.2 edgeR_4.3.21 doParallel_1.0.17
## [13] lattice_0.22-6 MASS_7.3-61 backports_1.5.0
## [16] magrittr_2.0.3 limma_3.61.12 plotly_4.10.4
## [19] sass_0.4.9 rmarkdown_2.28 jquerylib_0.1.4
## [22] yaml_2.3.10 httpuv_1.6.15 spam_2.11-0
## [25] sessioninfo_1.2.2 cowplot_1.1.3 DBI_1.2.3
## [28] buildtools_1.0.0 RColorBrewer_1.1-3 golem_0.5.1
## [31] maps_3.4.2 abind_1.4-8 zlibbioc_1.51.2
## [34] purrr_1.0.2 RCurl_1.98-1.16 tweenr_2.0.3
## [37] rappdirs_0.3.3 GenomeInfoDbData_1.2.13 ggrepel_0.9.6
## [40] irlba_2.3.5.1 maketools_1.3.1 codetools_0.2-20
## [43] DelayedArray_0.31.14 DT_0.33 scuttle_1.15.5
## [46] ggforce_0.4.2 tidyselect_1.2.1 UCSC.utils_1.1.0
## [49] farver_2.1.2 viridis_0.6.5 ScaledMatrix_1.13.0
## [52] shinyWidgets_0.8.7 BiocFileCache_2.15.0 GenomicAlignments_1.41.0
## [55] jsonlite_1.8.9 BiocNeighbors_2.1.0 Formula_1.2-5
## [58] scater_1.33.4 iterators_1.0.14 foreach_1.5.2
## [61] tools_4.4.1 ggnewscale_0.5.0 Rcpp_1.0.13
## [64] glue_1.8.0 gridExtra_2.3 SparseArray_1.5.45
## [67] xfun_0.48 dplyr_1.1.4 withr_3.0.2
## [70] BiocManager_1.30.25 fastmap_1.2.0 fansi_1.0.6
## [73] rsvd_1.0.5 digest_0.6.37 R6_2.5.1
## [76] mime_0.12 colorspace_2.1-1 RSQLite_2.3.7
## [79] config_0.3.2 utf8_1.2.4 tidyr_1.3.1
## [82] generics_0.1.3 data.table_1.16.2 rtracklayer_1.65.0
## [85] httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.5.11
## [88] pkgconfig_2.0.3 gtable_0.3.6 blob_1.2.4
## [91] XVector_0.45.0 sys_3.4.3 htmltools_0.5.8.1
## [94] carData_3.0-5 dotCall64_1.2 scales_1.3.0
## [97] png_0.1-8 attempt_0.3.1 knitr_1.48
## [100] rjson_0.2.23 curl_5.2.3 cachem_1.1.0
## [103] stringr_1.5.1 BiocVersion_3.21.1 concaveman_1.1.0
## [106] vipor_0.4.7 parallel_4.4.1 AnnotationDbi_1.69.0
## [109] restfulr_0.0.15 pillar_1.9.0 grid_4.4.1
## [112] vctrs_0.6.5 promises_1.3.0 ggpubr_0.6.0
## [115] BiocSingular_1.23.0 car_3.1-3 dbplyr_2.5.0
## [118] beachmat_2.23.0 xtable_1.8-4 beeswarm_0.4.0
## [121] paletteer_1.6.0 evaluate_1.0.1 magick_2.8.5
## [124] cli_3.6.3 locfit_1.5-9.10 compiler_4.4.1
## [127] Rsamtools_2.21.2 rlang_1.1.4 crayon_1.5.3
## [130] ggsignif_0.6.4 labeling_0.4.3 rematch2_2.1.2
## [133] ggbeeswarm_0.7.2 stringi_1.8.4 viridisLite_0.4.2
## [136] BiocParallel_1.39.0 assertthat_0.2.1 munsell_0.5.1
## [139] Biostrings_2.75.0 lazyeval_0.2.2 V8_6.0.0
## [142] Matrix_1.7-1 ExperimentHub_2.13.1 benchmarkme_1.0.8
## [145] patchwork_1.3.0 bit64_4.5.2 KEGGREST_1.45.1
## [148] statmod_1.5.0 shiny_1.9.1 highr_0.11
## [151] AnnotationHub_3.15.0 broom_1.0.7 memoise_2.0.1
## [154] bslib_0.8.0 benchmarkmeData_1.0.4 bit_4.5.0
## [157] spatialLIBD_1.17.8