SpotSweeper
is an R package for spatial transcriptomics
data quality control (QC). It provides functions for detecting and
visualizing spot-level local outliers and artifacts using
spatially-aware methods. The package is designed to work with SpatialExperiment
objects, and is compatible with data from 10X Genomics Visium and other
spatial transcriptomics platforms.
Currently, the only way to install SpotSweeper
is by
downloading the development version which can be installed from GitHub using the
following:
if (!require("devtools")) install.packages("devtools")
remotes::install_github("MicTott/SpotSweeper")
Once accepted in Bioconductor,
SpotSweeper
will be installable using:
Here we’ll walk you through the standard workflow for using
‘SpotSweeper’ to detect and visualize local outliers in spatial
transcriptomics data. We’ll use the Visium_humanDLPFC
dataset from the STexampleData
package, which is a
SpatialExperiment
object.
Because local outliers will be saved in the colData
of
the SpatialExperiment
object, we’ll first view the
colData
and drop out-of-tissue spots before calculating
quality control (QC) metrics and running SpotSweeper
.
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## [1] "barcode_id" "sample_id" "in_tissue" "array_row" "array_col"
## [6] "ground_truth" "reference" "cell_count"
scuttle
We’ll use the scuttle
package to calculate QC metrics.
To do this, we’ll need to first change the rownames
from
gene id to gene names. We’ll then get the mitochondrial transcripts and
calculate QC metrics for each spot using
scuttle::addPerCellQCMetrics
.
# change from gene id to gene names
rownames(spe) <- rowData(spe)$gene_name
# identifying the mitochondrial transcripts
is.mito <- rownames(spe)[grepl("^MT-", rownames(spe))]
# calculating QC metrics for each spot using scuttle
spe <- scuttle::addPerCellQCMetrics(spe, subsets = list(Mito = is.mito))
colnames(colData(spe))
## [1] "barcode_id" "sample_id" "in_tissue"
## [4] "array_row" "array_col" "ground_truth"
## [7] "reference" "cell_count" "sum"
## [10] "detected" "subsets_Mito_sum" "subsets_Mito_detected"
## [13] "subsets_Mito_percent" "total"
SpotSweeper
We can now use SpotSweeper
to identify local outliers in
the spatial transcriptomics data. We’ll use the
localOutliers
function to detect local outliers based on
the unique detected genes, total library size, and percent of the total
reads that are mitochondrial. These methods assume a normal
distribution, so we’ll use the log-transformed sum of the counts and the
log-transformed number of detected genes. For mitochondrial percent,
we’ll use the raw mitochondrial percentage.
# library size
spe <- localOutliers(spe,
metric = "sum",
direction = "lower",
log = TRUE
)
# unique genes
spe <- localOutliers(spe,
metric = "detected",
direction = "lower",
log = TRUE
)
# mitochondrial percent
spe <- localOutliers(spe,
metric = "subsets_Mito_percent",
direction = "higher",
log = FALSE
)
The localOutlier
function automatically outputs the
results to the colData
with the naming convention
X_outliers
, where X
is the name of the input
colData
. We can then combine all outliers into a single
column called local_outliers
in the colData
of
the SpatialExperiment
object.
We can visualize the local outliers using the plotQC
function. This function creates a scatter plot of the specified metric
and highlights the local outliers in red using the escheR
package. Here, we’ll visualize local outliers of library size, unique
genes, mitochondrial percent, and finally, all local outliers. We’ll
then arrange these plots in a grid using
ggpubr::arrange
.
## Loading required package: ggplot2
library(ggpubr)
# library size
p1 <- plotQC(spe,
metric = "sum_log",
outliers = "sum_outliers", point_size = 1.1
) +
ggtitle("Library Size")
# unique genes
p2 <- plotQC(spe,
metric = "detected_log",
outliers = "detected_outliers", point_size = 1.1
) +
ggtitle("Unique Genes")
# mitochondrial percent
p3 <- plotQC(spe,
metric = "subsets_Mito_percent",
outliers = "subsets_Mito_percent_outliers", point_size = 1.1
) +
ggtitle("Mitochondrial Percent")
# all local outliers
p4 <- plotQC(spe,
metric = "sum_log",
outliers = "local_outliers", point_size = 1.1, stroke = 0.75
) +
ggtitle("All Local Outliers")
# plot
plot_list <- list(p1, p2, p3, p4)
ggarrange(
plotlist = plot_list,
ncol = 2, nrow = 2,
common.legend = FALSE
)
SpotSweeper
# load in DLPFC sample with hangnail artifact
data(DLPFC_artifact)
spe <- DLPFC_artifact
# inspect colData before artifact detection
colnames(colData(spe))
## [1] "sample_id" "in_tissue" "array_row"
## [4] "array_col" "key" "sum_umi"
## [7] "sum_gene" "expr_chrM" "expr_chrM_ratio"
## [10] "ManualAnnotation" "subject" "region"
## [13] "sex" "age" "diagnosis"
## [16] "sample_id_complete" "count" "sizeFactor"
Technical artifacts can commonly be visualized by standard QC
metrics, including library size, unique genes, or mitochondrial
percentage. We can first visualize the technical artifacts using the
plotQC
function. This function plots the Visium spots with
the specified QC metric.We’ll then again arrange these plots using
ggpubr::arrange
.
# library size
p1 <- plotQC(spe,
metric = "sum_umi",
outliers = NULL, point_size = 1.1
) +
ggtitle("Library Size")
# unique genes
p2 <- plotQC(spe,
metric = "sum_gene",
outliers = NULL, point_size = 1.1
) +
ggtitle("Unique Genes")
# mitochondrial percent
p3 <- plotQC(spe,
metric = "expr_chrM_ratio",
outliers = NULL, point_size = 1.1
) +
ggtitle("Mitochondrial Percent")
# plot
plot_list <- list(p1, p2, p3)
ggarrange(
plotlist = plot_list,
ncol = 3, nrow = 1,
common.legend = FALSE
)
SpotSweeper
We can then use the findArtifacts
function to identify
artifacts in the spatial transcriptomics (data. This function identifies
technical artifacts based on the first principle component of the local
variance of the specified QC metric (mito_percent
) at
numerous neighorhood sizes (n_rings=5
). Currently,
kmeans
clustering is used to cluster the technical artifact
vs high-quality Visium spots. Similar to localOutliers
, the
findArtifacts
function then outputs the results to the
colData
.
# find artifacts using SpotSweeper
spe <- findArtifacts(spe,
mito_percent = "expr_chrM_ratio",
mito_sum = "expr_chrM",
n_rings = 5,
name = "artifact"
)
# check that "artifact" is now in colData
colnames(colData(spe))
## [1] "sample_id" "in_tissue" "array_row"
## [4] "array_col" "key" "sum_umi"
## [7] "sum_gene" "expr_chrM" "expr_chrM_ratio"
## [10] "ManualAnnotation" "subject" "region"
## [13] "sex" "age" "diagnosis"
## [16] "sample_id_complete" "count" "sizeFactor"
## [19] "expr_chrM_ratio_log" "coords" "k6"
## [22] "k18" "k36" "k60"
## [25] "k90" "artifact"
We can visualize the artifacts using the escheR
package.
Here, we’ll visualize the artifacts using the plotQC
function and arrange these plots using ggpubr::arrange
.
plotQC(spe,
metric = "expr_chrM_ratio",
outliers = "artifact", point_size = 1.1
) +
ggtitle("Hangnail artifact")
# Session information
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggpubr_0.6.0 escheR_1.7.0
## [3] ggplot2_3.5.1 STexampleData_1.13.3
## [5] SpatialExperiment_1.16.0 SingleCellExperiment_1.28.0
## [7] SummarizedExperiment_1.36.0 Biobase_2.67.0
## [9] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [11] IRanges_2.41.0 S4Vectors_0.44.0
## [13] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [15] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [17] BiocFileCache_2.15.0 dbplyr_2.5.0
## [19] BiocGenerics_0.53.1 generics_0.1.3
## [21] SpotSweeper_1.3.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 rlang_1.1.4 magrittr_2.0.3
## [4] spatialEco_2.0-2 compiler_4.4.1 RSQLite_2.3.7
## [7] png_0.1-8 vctrs_0.6.5 pkgconfig_2.0.3
## [10] crayon_1.5.3 fastmap_1.2.0 backports_1.5.0
## [13] magick_2.8.5 XVector_0.46.0 labeling_0.4.3
## [16] scuttle_1.16.0 utf8_1.2.4 rmarkdown_2.28
## [19] UCSC.utils_1.2.0 purrr_1.0.2 bit_4.5.0
## [22] xfun_0.48 zlibbioc_1.52.0 cachem_1.1.0
## [25] beachmat_2.23.0 jsonlite_1.8.9 blob_1.2.4
## [28] highr_0.11 DelayedArray_0.33.1 BiocParallel_1.41.0
## [31] terra_1.7-83 broom_1.0.7 parallel_4.4.1
## [34] R6_2.5.1 bslib_0.8.0 car_3.1-3
## [37] jquerylib_0.1.4 Rcpp_1.0.13 knitr_1.48
## [40] Matrix_1.7-1 tidyselect_1.2.1 abind_1.4-8
## [43] yaml_2.3.10 codetools_0.2-20 curl_5.2.3
## [46] lattice_0.22-6 tibble_3.2.1 withr_3.0.2
## [49] KEGGREST_1.47.0 evaluate_1.0.1 Biostrings_2.75.0
## [52] pillar_1.9.0 BiocManager_1.30.25 filelock_1.0.3
## [55] carData_3.0-5 BiocVersion_3.21.1 munsell_0.5.1
## [58] scales_1.3.0 glue_1.8.0 maketools_1.3.1
## [61] tools_4.4.1 BiocNeighbors_2.1.0 sys_3.4.3
## [64] ggsignif_0.6.4 buildtools_1.0.0 cowplot_1.1.3
## [67] grid_4.4.1 tidyr_1.3.1 AnnotationDbi_1.69.0
## [70] colorspace_2.1-1 GenomeInfoDbData_1.2.13 Formula_1.2-5
## [73] cli_3.6.3 rappdirs_0.3.3 fansi_1.0.6
## [76] viridisLite_0.4.2 S4Arrays_1.6.0 dplyr_1.1.4
## [79] gtable_0.3.6 rstatix_0.7.2 sass_0.4.9
## [82] digest_0.6.37 SparseArray_1.6.0 farver_2.1.2
## [85] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.8.1
## [88] lifecycle_1.0.4 httr_1.4.7 mime_0.12
## [91] bit64_4.5.2 MASS_7.3-61