nnSVG
is a method for scalable identification of
spatially variable genes (SVGs) in spatially-resolved transcriptomics
data.
The nnSVG
method is based on nearest-neighbor Gaussian
processes (Datta
et al., 2016, Finley
et al., 2019) and uses the BRISC algorithm (Saha
and Datta, 2018) for model fitting and parameter estimation.
nnSVG
allows identification and ranking of SVGs with
flexible length scales across a tissue slide or within spatial domains
defined by covariates. The method scales linearly with the number of
spatial locations and can be applied to datasets containing thousands or
more spatial locations.
nnSVG
is implemented as an R package within the
Bioconductor framework, and is available from Bioconductor.
More details describing the method are available in our paper, available from Nature Communications.
The following code will install the latest release version of the
nnSVG
package from Bioconductor. Additional details are
shown on the Bioconductor
page.
The latest development version can also be installed from the
devel
version of Bioconductor or from GitHub.
In the examples below, we assume the input data are provided as a SpatialExperiment
Bioconductor object. In this case, the outputs are stored in the
rowData
of the SpatialExperiment
object.
Alternatively, the inputs can also be provided as a numeric matrix of normalized and transformed counts (e.g. log-transformed normalized counts, also known as logcounts) and a numeric matrix of spatial coordinates.
Here we show a short example demonstrating how to run
nnSVG
.
For faster runtime in this example, we subsample the dataset and run
nnSVG
on only a small number of genes. For a full analysis,
the subsampling step can be skipped.
# load example dataset from STexampleData package
# see '?Visium_humanDLPFC' for more details
spe <- Visium_humanDLPFC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## [1] 33538 4992
# preprocessing steps
# keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)
## [1] 33538 3639
# skip spot-level quality control, since this has been performed previously
# on this dataset
# filter low-expressed and mitochondrial genes
# using default filtering parameters
spe <- filter_genes(spe)
## Gene filtering: removing mitochondrial genes
## removed 13 mitochondrial genes
## Gene filtering: retaining genes with at least 3 counts in at least 0.5% (n = 19) of spatial locations
## removed 30216 out of 33525 genes due to low expression
# calculate logcounts (log-transformed normalized counts) using scran package
# using library size factors
spe <- computeLibraryFactors(spe)
spe <- logNormCounts(spe)
assayNames(spe)
## [1] "counts" "logcounts"
# select small set of random genes and several known SVGs for
# faster runtime in this example
set.seed(123)
ix_random <- sample(seq_len(nrow(spe)), 10)
known_genes <- c("MOBP", "PCP4", "SNAP25", "HBB", "IGKC", "NPY")
ix_known <- which(rowData(spe)$gene_name %in% known_genes)
ix <- c(ix_known, ix_random)
spe <- spe[ix, ]
dim(spe)
## [1] 16 3639
# run nnSVG
# set seed for reproducibility
set.seed(123)
# using a single thread in this example
spe <- nnSVG(spe)
## Warning in nnSVG(spe): Rows (genes) and/or columns (spots) containing all zero
## counts have been found. Please see examples in tutorial for code to filter out
## zeros and/or low-expressed genes to avoid errors.
## DataFrame with 16 rows and 17 columns
## gene_id gene_name feature_type sigma.sq
## <character> <character> <character> <numeric>
## ENSG00000211592 ENSG00000211592 IGKC Gene Expression 0.565608
## ENSG00000168314 ENSG00000168314 MOBP Gene Expression 1.387394
## ENSG00000122585 ENSG00000122585 NPY Gene Expression 0.285674
## ENSG00000244734 ENSG00000244734 HBB Gene Expression 0.329421
## ENSG00000132639 ENSG00000132639 SNAP25 Gene Expression 0.430040
## ... ... ... ... ...
## ENSG00000130382 ENSG00000130382 MLLT1 Gene Expression 0.00978541
## ENSG00000036672 ENSG00000036672 USP2 Gene Expression 0.00307277
## ENSG00000086232 ENSG00000086232 EIF2AK1 Gene Expression 0.00315783
## ENSG00000106278 ENSG00000106278 PTPRZ1 Gene Expression 0.00279862
## ENSG00000133606 ENSG00000133606 MKRN1 Gene Expression 0.00632241
## tau.sq phi loglik runtime mean var
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000211592 0.455044 20.10862 -4531.64 1.383 0.622937 1.007454
## ENSG00000168314 0.364188 1.10202 -3663.60 0.432 0.805525 1.205673
## ENSG00000122585 0.280173 71.65329 -3995.23 0.642 0.393975 0.567383
## ENSG00000244734 0.353754 27.81410 -4044.96 1.353 0.411262 0.697673
## ENSG00000132639 0.430106 3.03385 -3912.70 0.340 3.451926 0.857922
## ... ... ... ... ... ... ...
## ENSG00000130382 0.283115 50.987930 -2927.61 0.502 0.298698 0.292976
## ENSG00000036672 0.241105 12.538273 -2597.00 0.341 0.248384 0.244218
## ENSG00000086232 0.266973 25.929892 -2781.47 0.342 0.275193 0.270208
## ENSG00000106278 0.367893 9.528574 -3357.32 0.402 0.352159 0.370784
## ENSG00000133606 0.272432 0.082674 -2831.51 0.467 0.295404 0.278806
## spcov prop_sv loglik_lm LR_stat rank pval
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000211592 1.207296 0.554163 -5176.53 1289.775 3 0
## ENSG00000168314 1.462248 0.792080 -5503.33 3679.464 1 0
## ENSG00000122585 1.356646 0.504861 -4131.87 273.278 6 0
## ENSG00000244734 1.395587 0.482191 -4507.99 926.046 4 0
## ENSG00000132639 0.189973 0.499961 -4884.19 1942.986 2 0
## ... ... ... ... ... ... ...
## ENSG00000130382 0.331175 0.03340866 -2929.28 3.35216 12 0.187105881
## ENSG00000036672 0.223173 0.01258413 -2598.08 2.15483 13 0.340473786
## ENSG00000086232 0.204200 0.01169001 -2782.09 1.23716 14 0.538708767
## ENSG00000106278 0.150222 0.00754973 -3357.83 1.01110 15 0.603172909
## ENSG00000133606 0.269169 0.02268095 -2839.08 15.15227 9 0.000512539
## padj
## <numeric>
## ENSG00000211592 0
## ENSG00000168314 0
## ENSG00000122585 0
## ENSG00000244734 0
## ENSG00000132639 0
## ... ...
## ENSG00000130382 0.24947451
## ENSG00000036672 0.41904466
## ENSG00000086232 0.61566716
## ENSG00000106278 0.64338444
## ENSG00000133606 0.00091118
The results are stored in the rowData
of the
SpatialExperiment
object.
The main results of interest are:
LR_stat
: likelihood ratio (LR) statisticsrank
: rank of top SVGs according to LR statisticspval
: p-values from asymptotic chi-squared distribution
with 2 degrees of freedompadj
: p-values adjusted for multiple testing, which can
be used to define a cutoff for statistically significant SVGs
(e.g. padj
<= 0.05)prop_sv
: effect size, defined as proportion of spatial
variance out of total variance##
## FALSE TRUE
## 7 9
## DataFrame with 10 rows and 17 columns
## gene_id gene_name feature_type sigma.sq
## <character> <character> <character> <numeric>
## ENSG00000168314 ENSG00000168314 MOBP Gene Expression 1.38739392
## ENSG00000132639 ENSG00000132639 SNAP25 Gene Expression 0.43003959
## ENSG00000211592 ENSG00000211592 IGKC Gene Expression 0.56560774
## ENSG00000244734 ENSG00000244734 HBB Gene Expression 0.32942113
## ENSG00000183036 ENSG00000183036 PCP4 Gene Expression 0.23102218
## ENSG00000122585 ENSG00000122585 NPY Gene Expression 0.28567360
## ENSG00000129562 ENSG00000129562 DAD1 Gene Expression 0.02389606
## ENSG00000114923 ENSG00000114923 SLC4A3 Gene Expression 0.01147169
## ENSG00000133606 ENSG00000133606 MKRN1 Gene Expression 0.00632241
## ENSG00000143543 ENSG00000143543 JTB Gene Expression 0.07548247
## tau.sq phi loglik runtime mean var
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000168314 0.364188 1.102018 -3663.60 0.432 0.805525 1.205673
## ENSG00000132639 0.430106 3.033847 -3912.70 0.340 3.451926 0.857922
## ENSG00000211592 0.455044 20.108619 -4531.64 1.383 0.622937 1.007454
## ENSG00000244734 0.353754 27.814098 -4044.96 1.353 0.411262 0.697673
## ENSG00000183036 0.452735 8.272277 -4026.22 0.309 0.687961 0.684598
## ENSG00000122585 0.280173 71.653291 -3995.23 0.642 0.393975 0.567383
## ENSG00000129562 0.464723 10.141882 -3842.24 0.433 0.549318 0.489167
## ENSG00000114923 0.237260 12.765632 -2617.36 0.506 0.250768 0.248816
## ENSG00000133606 0.272432 0.082674 -2831.51 0.467 0.295404 0.278806
## ENSG00000143543 0.463557 119.746675 -4036.28 0.573 0.654919 0.539172
## spcov prop_sv loglik_lm LR_stat rank pval
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000168314 1.462248 0.7920804 -5503.33 3679.46397 1 0.00000e+00
## ENSG00000132639 0.189973 0.4999614 -4884.19 1942.98556 2 0.00000e+00
## ENSG00000211592 1.207296 0.5541633 -5176.53 1289.77508 3 0.00000e+00
## ENSG00000244734 1.395587 0.4821910 -4507.99 926.04573 4 0.00000e+00
## ENSG00000183036 0.698656 0.3378716 -4473.57 894.68884 5 0.00000e+00
## ENSG00000122585 1.356646 0.5048609 -4131.87 273.27818 6 0.00000e+00
## ENSG00000129562 0.281410 0.0489053 -3861.98 39.49098 7 2.65854e-09
## ENSG00000114923 0.427112 0.0461207 -2632.02 29.31376 8 4.31119e-07
## ENSG00000133606 0.269169 0.0226810 -2839.08 15.15227 9 5.12539e-04
## ENSG00000143543 0.419503 0.1400315 -4039.07 5.59669 10 6.09109e-02
## padj
## <numeric>
## ENSG00000168314 0.00000e+00
## ENSG00000132639 0.00000e+00
## ENSG00000211592 0.00000e+00
## ENSG00000244734 0.00000e+00
## ENSG00000183036 0.00000e+00
## ENSG00000122585 0.00000e+00
## ENSG00000129562 6.07667e-09
## ENSG00000114923 8.62238e-07
## ENSG00000133606 9.11180e-04
## ENSG00000143543 9.74574e-02
# plot spatial expression of top-ranked SVG
ix <- which(rowData(spe)$rank == 1)
ix_name <- rowData(spe)$gene_name[ix]
ix_name
## [1] "MOBP"
df <- as.data.frame(
cbind(spatialCoords(spe),
expr = counts(spe)[ix, ]))
ggplot(df, aes(x = pxl_col_in_fullres, y = pxl_row_in_fullres, color = expr)) +
geom_point(size = 0.8) +
coord_fixed() +
scale_y_reverse() +
scale_color_gradient(low = "gray90", high = "blue",
trans = "sqrt", breaks = range(df$expr),
name = "counts") +
ggtitle(ix_name) +
theme_bw() +
theme(plot.title = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
nnSVG
can also be run on datasets with known covariates,
such as spatial domains or cell types. In this case, the method will
identify SVGs within regions defined by the selected covariates,
e.g. within spatial domains.
Here we run a short example to demonstrate this functionality.
This example dataset contains several known SVGs within spatial domains. Specifically, we are interested in SVGs with gradients of expression within the CA3 region of the mouse hippocampus. (See dataset help files for more details and references.)
As above, for faster runtime in this example, we subsample a very small subset of the data. For a full analysis, the subsampling step can be skipped.
# load example dataset from STexampleData package
# see '?SlideSeqV2_mouseHPC' for more details
spe <- SlideSeqV2_mouseHPC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## [1] 23264 53208
# preprocessing steps
# remove spots with NA cell type labels
spe <- spe[, !is.na(colData(spe)$celltype)]
dim(spe)
## [1] 23264 15003
##
## Astrocyte CA1 CA3
## 6688 1320 1729
## Cajal_Retzius Choroid Denate
## 60 21 1713
## Endothelial_Stalk Endothelial_Tip Entorihinal
## 96 118 111
## Ependymal Interneuron Microglia_Macrophages
## 94 730 299
## Mural Neurogenesis Neuron.Slc17a6
## 26 47 44
## Oligodendrocyte Polydendrocyte
## 1793 114
# filter low-expressed and mitochondrial genes
# using adjusted filtering parameters for this platform
spe <- filter_genes(
spe,
filter_genes_ncounts = 1,
filter_genes_pcspots = 1,
filter_mito = TRUE
)
## Gene filtering: removing mitochondrial genes
## removed 25 mitochondrial genes
## Gene filtering: retaining genes with at least 1 counts in at least 1% (n = 151) of spatial locations
## removed 14356 out of 23239 genes due to low expression
## [1] 8883 15003
# calculate log-transformed normalized counts using scran package
# using library size normalization
spe <- computeLibraryFactors(spe)
spe <- logNormCounts(spe)
assayNames(spe)
## [1] "counts" "logcounts"
# select small set of random genes and several known SVGs for
# faster runtime in this example
set.seed(123)
ix_random <- sample(seq_len(nrow(spe)), 10)
known_genes <- c("Cpne9", "Rgs14")
ix_known <- which(rowData(spe)$gene_name %in% known_genes)
ix <- c(ix_known, ix_random)
spe <- spe[ix, ]
dim(spe)
## [1] 12 15003
# run nnSVG with covariates
# create model matrix for cell type labels
X <- model.matrix(~ colData(spe)$celltype)
dim(X)
## [1] 15003 17
stopifnot(nrow(X) == ncol(spe))
# set seed for reproducibility
set.seed(123)
# using a single thread in this example
spe <- nnSVG(spe, X = X)
## Warning in nnSVG(spe, X = X): Rows (genes) and/or columns (spots) containing
## all zero counts have been found. Please see examples in tutorial for code to
## filter out zeros and/or low-expressed genes to avoid errors.
## DataFrame with 12 rows and 15 columns
## gene_name sigma.sq tau.sq phi loglik runtime
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## Cpne9 Cpne9 6.91474e-04 0.0275984 1.28485e+01 5484.78 11.520
## Rgs14 Rgs14 2.76964e-03 0.0452683 9.57130e-05 1650.39 11.108
## Eogt Eogt 8.52208e-05 0.0198481 7.81821e+01 8083.08 6.563
## Ergic1 Ergic1 2.54389e-04 0.0465677 9.77219e+01 1677.16 6.119
## Zfp330 Zfp330 1.21717e-04 0.0266163 1.54147e+02 5879.76 6.198
## ... ... ... ... ... ... ...
## Hdac11 Hdac11 3.54479e-03 0.1648228 145.3191 -7920.47 8.643
## Nab2 Nab2 3.99607e-05 0.0145549 67.5481 10421.32 6.560
## Senp3 Senp3 7.97641e-05 0.0253771 36.8072 6248.29 7.836
## Fbxo6 Fbxo6 9.53682e-05 0.0253767 168.3914 6243.54 6.218
## Ntng1 Ntng1 2.39135e-03 0.0433036 24.1536 1951.83 8.232
## mean var spcov prop_sv loglik_lm LR_stat rank
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## Cpne9 0.0194924 0.0283971 1.349031 0.02444247 5455.66 58.23907 3
## Rgs14 0.0360053 0.0485672 1.461655 0.05765516 1515.16 270.46142 1
## Eogt 0.0125243 0.0200072 0.737089 0.00427529 8082.32 1.52762 6
## Ergic1 0.0359343 0.0471421 0.443855 0.00543311 1676.61 1.10863 8
## Zfp330 0.0201203 0.0267799 0.548330 0.00455223 5879.05 1.42303 7
## ... ... ... ... ... ... ... ...
## Hdac11 0.1258726 0.1705333 0.473003 0.02105388 -7923.51 6.072531 4
## Nab2 0.0120832 0.0146742 0.523159 0.00273800 10421.82 -1.016708 12
## Senp3 0.0199706 0.0255396 0.447211 0.00313331 6248.60 -0.631574 11
## Fbxo6 0.0188249 0.0255584 0.518763 0.00374403 6243.59 -0.103414 10
## Ntng1 0.0341843 0.0466474 1.430522 0.05233289 1854.40 194.856131 2
## pval padj
## <numeric> <numeric>
## Cpne9 2.25708e-13 9.02833e-13
## Rgs14 0.00000e+00 0.00000e+00
## Eogt 4.65889e-01 8.41541e-01
## Ergic1 5.74466e-01 8.61699e-01
## Zfp330 4.90899e-01 8.41541e-01
## ... ... ...
## Hdac11 0.0480139 0.144042
## Nab2 1.0000000 1.000000
## Senp3 1.0000000 1.000000
## Fbxo6 1.0000000 1.000000
## Ntng1 0.0000000 0.000000
Note that if there are any empty levels in the factor used to create the design matrix, these can be removed when creating the design matrix, as follows:
##
## FALSE TRUE
## 9 3
## DataFrame with 10 rows and 15 columns
## gene_name sigma.sq tau.sq phi loglik runtime
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## Rgs14 Rgs14 2.76964e-03 0.0452683 9.57130e-05 1650.39 11.108
## Ntng1 Ntng1 2.39135e-03 0.0433036 2.41536e+01 1951.83 8.232
## Cpne9 Cpne9 6.91474e-04 0.0275984 1.28485e+01 5484.78 11.520
## Hdac11 Hdac11 3.54479e-03 0.1648228 1.45319e+02 -7920.47 8.643
## Cul1 Cul1 4.19373e-04 0.0688365 9.03189e+01 -1259.18 7.217
## Eogt Eogt 8.52208e-05 0.0198481 7.81821e+01 8083.08 6.563
## Zfp330 Zfp330 1.21717e-04 0.0266163 1.54147e+02 5879.76 6.198
## Ergic1 Ergic1 2.54389e-04 0.0465677 9.77219e+01 1677.16 6.119
## Gclc Gclc 1.22209e-04 0.0345001 1.19977e+02 3941.26 6.179
## Fbxo6 Fbxo6 9.53682e-05 0.0253767 1.68391e+02 6243.54 6.218
## mean var spcov prop_sv loglik_lm LR_stat rank
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## Rgs14 0.0360053 0.0485672 1.461655 0.05765516 1515.16 270.4614241 1
## Ntng1 0.0341843 0.0466474 1.430522 0.05233289 1854.40 194.8561307 2
## Cpne9 0.0194924 0.0283971 1.349031 0.02444247 5455.66 58.2390704 3
## Hdac11 0.1258726 0.1705333 0.473003 0.02105388 -7923.51 6.0725314 4
## Cul1 0.0548957 0.0696223 0.373046 0.00605542 -1260.02 1.6933562 5
## Eogt 0.0125243 0.0200072 0.737089 0.00427529 8082.32 1.5276157 6
## Zfp330 0.0201203 0.0267799 0.548330 0.00455223 5879.05 1.4230346 7
## Ergic1 0.0359343 0.0471421 0.443855 0.00543311 1676.61 1.1086290 8
## Gclc 0.0256188 0.0347265 0.431513 0.00352978 3941.30 -0.0749057 9
## Fbxo6 0.0188249 0.0255584 0.518763 0.00374403 6243.59 -0.1034144 10
## pval padj
## <numeric> <numeric>
## Rgs14 0.00000e+00 0.00000e+00
## Ntng1 0.00000e+00 0.00000e+00
## Cpne9 2.25708e-13 9.02833e-13
## Hdac11 4.80139e-02 1.44042e-01
## Cul1 4.28837e-01 8.41541e-01
## Eogt 4.65889e-01 8.41541e-01
## Zfp330 4.90899e-01 8.41541e-01
## Ergic1 5.74466e-01 8.61699e-01
## Gclc 1.00000e+00 1.00000e+00
## Fbxo6 1.00000e+00 1.00000e+00
# plot spatial expression of top-ranked SVG
ix <- which(rowData(spe)$rank == 1)
ix_name <- rowData(spe)$gene_name[ix]
ix_name
## [1] "Rgs14"
df <- as.data.frame(
cbind(spatialCoords(spe),
expr = counts(spe)[ix, ]))
ggplot(df, aes(x = xcoord, y = ycoord, color = expr)) +
geom_point(size = 0.1) +
coord_fixed() +
scale_color_gradient(low = "gray90", high = "blue",
trans = "sqrt", breaks = range(df$expr),
name = "counts") +
ggtitle(ix_name) +
theme_bw() +
theme(plot.title = element_text(face = "italic"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
For datasets consisting of multiple biological samples (e.g. multiple 10x Genomics Visium capture areas, or multiple tissue samples from other platforms), we can run nnSVG once per sample and combine the results by averaging the ranks of the SVGs identified per sample.
For this example, we use a dataset from the locus coeruleus region in postmortem human brain samples measured with the 10x Genomics Visium platform, which is available in the WeberDivechaLCdata package and is described in our paper Weber and Divecha et al. 2023.
# load example dataset from WeberDivechaLCdata package
# see '?WeberDivechaLCdata_Visium' for more details
spe <- WeberDivechaLCdata_Visium()
## see ?WeberDivechaLCdata and browseVignettes('WeberDivechaLCdata') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## [1] 23728 20380
##
## Br6522_LC_1_round1 Br6522_LC_2_round1 Br8153_LC_round2 Br2701_LC_round2
## 2979 3054 2208 2809
## Br6522_LC_round3 Br8079_LC_round3 Br2701_LC_round3 Br8153_LC_round3
## 2313 2473 2342 2202
This dataset contains a complex experimental design consisting of multiple samples (Visium capture areas) and multiple parts (contiguous tissue areas) for some of these samples, as described in our paper Weber and Divecha et al. 2023.
For the simplified example here, we select two samples. For a dataset with more than two samples, this example can be extended by including more samples within the loops.
# select sample IDs for example
samples_keep <- c("Br6522_LC_1_round1", "Br6522_LC_2_round1")
# subset SpatialExperiment object to keep only these samples
spe <- spe[, colData(spe)$sample_id %in% samples_keep]
# remove empty levels from factor of sample IDs
colData(spe)$sample_id <- droplevels(colData(spe)$sample_id)
# check
dim(spe)
## [1] 23728 6033
##
## Br6522_LC_1_round1 Br6522_LC_2_round1
## 2979 3054
# preprocessing steps
# keep only spots over tissue
spe <- spe[, colData(spe)$in_tissue == 1]
dim(spe)
## [1] 23728 6033
# filter low-expressed genes
# filter out genes with extremely low expression
# using simple threshold on total UMI counts summed across all spots
# approximate threshold of 10 UMI counts per sample
n_umis <- 20
ix_low_genes <- rowSums(counts(spe)) < n_umis
table(ix_low_genes)
## ix_low_genes
## FALSE TRUE
## 12859 10869
## [1] 12859 6033
# filter any new zeros created after filtering low-expressed genes
# remove genes with zero expression
ix_zero_genes <- rowSums(counts(spe)) == 0
table(ix_zero_genes)
## ix_zero_genes
## FALSE
## 12859
## [1] 12859 6033
## ix_zero_spots
## FALSE
## 6033
## [1] 12859 6033
Run nnSVG once per sample and store the resulting lists of top SVGs. Note that we perform additional gene filtering for nnSVG individually per sample, and re-calculate logcounts after the gene filtering for each sample.
In this dataset, mitochondrial genes are of specific biological
interest (see our paper Weber and
Divecha et al. 2023 for details), so we do not filter these genes
out (filter_mito = FALSE
). In most datasets, mitochondrial
genes are not of biological interest and should be filtered out
(filter_mito = TRUE
).
For faster runtime in this example, we also subsample a small number
of genes before running nnSVG
, as in the examples above.
For a full analysis, the subsampling step should be skipped.
# run nnSVG once per sample and store lists of top SVGs
sample_ids <- levels(colData(spe)$sample_id)
res_list <- as.list(rep(NA, length(sample_ids)))
names(res_list) <- sample_ids
for (s in seq_along(sample_ids)) {
# select sample
ix <- colData(spe)$sample_id == sample_ids[s]
spe_sub <- spe[, ix]
dim(spe_sub)
# run nnSVG filtering for mitochondrial genes and low-expressed genes
# (note: set 'filter_mito = TRUE' in most datasets)
spe_sub <- filter_genes(
spe_sub,
filter_genes_ncounts = 3,
filter_genes_pcspots = 0.5,
filter_mito = FALSE
)
# remove any zeros introduced by filtering
ix_zeros <- colSums(counts(spe_sub)) == 0
if (sum(ix_zeros) > 0) {
spe_sub <- spe_sub[, !ix_zeros]
}
dim(spe_sub)
# re-calculate logcounts after filtering
spe_sub <- computeLibraryFactors(spe_sub)
spe_sub <- logNormCounts(spe_sub)
# subsample small set of random genes for faster runtime in this example
# (same genes for each sample)
# (note: skip this step in a full analysis)
if (s == 1) {
set.seed(123)
ix_random <- sample(seq_len(nrow(spe_sub)), 30)
genes_random <- rownames(spe_sub)[ix_random]
}
spe_sub <- spe_sub[rownames(spe_sub) %in% genes_random, ]
dim(spe_sub)
# run nnSVG
set.seed(123)
spe_sub <- nnSVG(spe_sub)
# store results for this sample
res_list[[s]] <- rowData(spe_sub)
}
## Gene filtering: retaining genes with at least 3 counts in at least 0.5% (n = 15) of spatial locations
## removed 10420 out of 12859 genes due to low expression
## Warning in nnSVG(spe_sub): Rows (genes) and/or columns (spots) containing all
## zero counts have been found. Please see examples in tutorial for code to filter
## out zeros and/or low-expressed genes to avoid errors.
## Gene filtering: retaining genes with at least 3 counts in at least 0.5% (n = 16) of spatial locations
## removed 11790 out of 12859 genes due to low expression
## Warning in nnSVG(spe_sub): Rows (genes) and/or columns (spots) containing all
## zero counts have been found. Please see examples in tutorial for code to filter
## out zeros and/or low-expressed genes to avoid errors.
Now that we have run nnSVG once per sample, we can combine the results across multiple samples by averaging the ranks of the SVGs. This approach was applied in our recent paper Weber and Divecha et al. 2023.
# combine results for multiple samples by averaging gene ranks
# to generate an overall ranking
# number of genes that passed filtering (and subsampling) for each sample
sapply(res_list, nrow)
## Br6522_LC_1_round1 Br6522_LC_2_round1
## 30 14
# match results from each sample and store in matching rows
res_ranks <- matrix(NA, nrow = nrow(spe), ncol = length(sample_ids))
rownames(res_ranks) <- rownames(spe)
colnames(res_ranks) <- sample_ids
for (s in seq_along(sample_ids)) {
stopifnot(colnames(res_ranks)[s] == sample_ids[s])
stopifnot(colnames(res_ranks)[s] == names(res_list)[s])
rownames_s <- rownames(res_list[[s]])
res_ranks[rownames_s, s] <- res_list[[s]][, "rank"]
}
# remove genes that were filtered out in all samples
ix_allna <- apply(res_ranks, 1, function(r) all(is.na(r)))
res_ranks <- res_ranks[!ix_allna, ]
dim(res_ranks)
## [1] 30 2
# calculate average ranks
# note missing values due to filtering for samples
avg_ranks <- rowMeans(res_ranks, na.rm = TRUE)
# calculate number of samples where each gene is within top 100 ranked SVGs
# for that sample
n_withinTop100 <- apply(res_ranks, 1, function(r) sum(r <= 100, na.rm = TRUE))
# summary table
df_summary <- data.frame(
gene_id = names(avg_ranks),
gene_name = rowData(spe)[names(avg_ranks), "gene_name"],
gene_type = rowData(spe)[names(avg_ranks), "gene_type"],
overall_rank = rank(avg_ranks),
average_rank = unname(avg_ranks),
n_withinTop100 = unname(n_withinTop100),
row.names = names(avg_ranks)
)
# sort by average rank
df_summary <- df_summary[order(df_summary$average_rank), ]
head(df_summary)
## gene_id gene_name gene_type overall_rank
## ENSG00000104435 ENSG00000104435 STMN2 protein_coding 1
## ENSG00000180176 ENSG00000180176 TH protein_coding 2
## ENSG00000197756 ENSG00000197756 RPL37A protein_coding 3
## ENSG00000005486 ENSG00000005486 RHBDD2 protein_coding 4
## ENSG00000064393 ENSG00000064393 HIPK2 protein_coding 5
## ENSG00000171724 ENSG00000171724 VAT1L protein_coding 6
## average_rank n_withinTop100
## ENSG00000104435 1.0 2
## ENSG00000180176 2.5 2
## ENSG00000197756 3.0 2
## ENSG00000005486 4.0 2
## ENSG00000064393 5.5 2
## ENSG00000171724 7.0 2
# top n genes
# (note: NAs in this example due to subsampling genes for faster runtime)
top100genes <- df_summary$gene_name[1:100]
# summary table of "replicated" SVGs (i.e. genes that are highly ranked in at
# least x samples)
df_summaryReplicated <- df_summary[df_summary$n_withinTop100 >= 2, ]
# re-calculate rank within this set
df_summaryReplicated$overall_rank <- rank(df_summaryReplicated$average_rank)
dim(df_summaryReplicated)
## [1] 14 6
## gene_id gene_name gene_type overall_rank
## ENSG00000104435 ENSG00000104435 STMN2 protein_coding 1
## ENSG00000180176 ENSG00000180176 TH protein_coding 2
## ENSG00000197756 ENSG00000197756 RPL37A protein_coding 3
## ENSG00000005486 ENSG00000005486 RHBDD2 protein_coding 4
## ENSG00000064393 ENSG00000064393 HIPK2 protein_coding 5
## ENSG00000171724 ENSG00000171724 VAT1L protein_coding 6
## average_rank n_withinTop100
## ENSG00000104435 1.0 2
## ENSG00000180176 2.5 2
## ENSG00000197756 3.0 2
## ENSG00000005486 4.0 2
## ENSG00000064393 5.5 2
## ENSG00000171724 7.0 2
## gene_id gene_name gene_type overall_rank
## ENSG00000104435 ENSG00000104435 STMN2 protein_coding 1
## ENSG00000180176 ENSG00000180176 TH protein_coding 2
## ENSG00000197756 ENSG00000197756 RPL37A protein_coding 3
## ENSG00000005486 ENSG00000005486 RHBDD2 protein_coding 4
## ENSG00000064393 ENSG00000064393 HIPK2 protein_coding 5
## average_rank n_withinTop100
## ENSG00000104435 1.0 2
## ENSG00000180176 2.5 2
## ENSG00000197756 3.0 2
## ENSG00000005486 4.0 2
## ENSG00000064393 5.5 2
ix_gene <- which(rowData(spe)$gene_name == df_summary[2, "gene_name"])
df <- as.data.frame(cbind(
colData(spe),
spatialCoords(spe),
gene = counts(spe)[ix_gene, ]
))
ggplot(df, aes(x = pxl_col_in_fullres, y = pxl_row_in_fullres, color = gene)) +
facet_wrap(~ sample_id, nrow = 1, scales = "free") +
geom_point(size = 0.6) +
scale_color_gradient(low = "gray80", high = "red", trans = "sqrt",
name = "counts", breaks = range(df$gene)) +
scale_y_reverse() +
ggtitle(paste0(rowData(spe)$gene_name[ix_gene], " expression")) +
theme_bw() +
theme(aspect.ratio = 1,
panel.grid = element_blank(),
plot.title = element_text(face = "italic"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())
This section describes some common errors and solutions. Feedback on any additional issues is welcome to be provided as GitHub issues.
If your dataset contains rows (genes) or columns (spots) with all zero counts and you have not performed filtering for zeros and/or low-expressed genes, you may see an error similar to the following:
Timing stopped at: 0.068 0.003 0.072
Error: BiocParallel errors
1 remote errors, element index: 2
33536 unevaluated and other errors
first remote error:
Error in BRISC_estimation(coords = coords, y = y_i, x = X, cov.model = "exponential", : c++ error: dpotrf failed
After performing filtering for low-expressed genes and/or low-quality spots, it is also possible that new rows (genes) or columns (spots) of all zero counts may have been introduced. In this case, you may see the same error.
The following example code can be used to check for and remove any rows (genes) or columns (spots) with all zero counts.
# filter any new zeros created after filtering low-expressed genes
# remove genes with zero expression
ix_zero_genes <- rowSums(counts(spe)) == 0
table(ix_zero_genes)
if (sum(ix_zero_genes) > 0) {
spe <- spe[!ix_zero_genes, ]
}
dim(spe)
# remove spots with zero expression
ix_zero_spots <- colSums(counts(spe)) == 0
table(ix_zero_spots)
if (sum(ix_zero_spots) > 0) {
spe <- spe[, !ix_zero_spots]
}
dim(spe)
The nnSVG()
function also checks for rows (genes) or
columns (spots) of all zero counts, and will return a warning if these
exist.
The nnSVG model requires some minimum number of spots for reliable model fitting and parameter estimation. Running nnSVG with too few spots can return errors from the BRISC model fitting functions.
As an approximate guide, we recommend at least 100 spots (spatial locations).
In datasets with multiple samples, the minimum number of spots applies per sample.
In models with covariates for spatial domains, some minimum number of spots per spatial domain is also required. In this case, if the dataset contains too few spots, one possible solution can be to combine spatial domains into larger domains, e.g. if the main biological interest is one domain vs. all the rest (where the rest can be combined into one domain).
In datasets with multiple samples, nnSVG can be run once per sample and the results combined by averaging the ranks of the identified SVGs per sample, as demonstrated in the example above.
In this case, ensure that:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] WeberDivechaLCdata_1.7.0 ggplot2_3.5.1
## [3] nnSVG_1.11.0 scran_1.34.0
## [5] scuttle_1.16.0 STexampleData_1.13.3
## [7] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [9] BiocFileCache_2.15.0 dbplyr_2.5.0
## [11] SpatialExperiment_1.16.0 SingleCellExperiment_1.28.0
## [13] SummarizedExperiment_1.36.0 Biobase_2.67.0
## [15] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [17] IRanges_2.41.0 S4Vectors_0.44.0
## [19] BiocGenerics_0.53.1 generics_0.1.3
## [21] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [23] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 pbapply_1.7-2 rlang_1.1.4
## [4] magrittr_2.0.3 compiler_4.4.1 RSQLite_2.3.7
## [7] png_0.1-8 vctrs_0.6.5 rdist_0.0.5
## [10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [13] magick_2.8.5 XVector_0.46.0 labeling_0.4.3
## [16] utf8_1.2.4 BRISC_1.0.6 rmarkdown_2.28
## [19] UCSC.utils_1.2.0 purrr_1.0.2 bit_4.5.0
## [22] xfun_0.48 bluster_1.17.0 zlibbioc_1.52.0
## [25] cachem_1.1.0 beachmat_2.23.0 jsonlite_1.8.9
## [28] blob_1.2.4 highr_0.11 DelayedArray_0.33.1
## [31] BiocParallel_1.41.0 irlba_2.3.5.1 parallel_4.4.1
## [34] cluster_2.1.6 R6_2.5.1 bslib_0.8.0
## [37] limma_3.63.0 jquerylib_0.1.4 Rcpp_1.0.13
## [40] knitr_1.48 Matrix_1.7-1 igraph_2.1.1
## [43] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10
## [46] codetools_0.2-20 curl_5.2.3 lattice_0.22-6
## [49] tibble_3.2.1 withr_3.0.2 KEGGREST_1.47.0
## [52] evaluate_1.0.1 Biostrings_2.75.0 pillar_1.9.0
## [55] BiocManager_1.30.25 filelock_1.0.3 BiocVersion_3.21.1
## [58] munsell_0.5.1 scales_1.3.0 glue_1.8.0
## [61] metapod_1.14.0 maketools_1.3.1 tools_4.4.1
## [64] BiocNeighbors_2.1.0 sys_3.4.3 ScaledMatrix_1.14.0
## [67] locfit_1.5-9.10 RANN_2.6.2 buildtools_1.0.0
## [70] grid_4.4.1 colorspace_2.1-1 AnnotationDbi_1.69.0
## [73] edgeR_4.4.0 GenomeInfoDbData_1.2.13 BiocSingular_1.23.0
## [76] cli_3.6.3 rsvd_1.0.5 rappdirs_0.3.3
## [79] fansi_1.0.6 S4Arrays_1.6.0 dplyr_1.1.4
## [82] gtable_0.3.6 sass_0.4.9 digest_0.6.37
## [85] SparseArray_1.6.0 dqrng_0.4.1 farver_2.1.2
## [88] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.8.1
## [91] lifecycle_1.0.4 httr_1.4.7 mime_0.12
## [94] statmod_1.5.0 bit64_4.5.2