Getting started with ‘SpotSweeper’

Introduction

SpotSweeper is an R package for spatial transcriptomics data quality control (QC). It provides functions for detecting and visualizing spot-level local outliers and artifacts using spatially-aware methods. The package is designed to work with SpatialExperiment objects, and is compatible with data from 10X Genomics Visium and other spatial transcriptomics platforms.

Installation

Currently, the only way to install SpotSweeper is by downloading the development version which can be installed from GitHub using the following:

if (!require("devtools")) install.packages("devtools")
remotes::install_github("MicTott/SpotSweeper")

Once accepted in Bioconductor, SpotSweeper will be installable using:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("SpotSweeper")

Spot-level local outlier detection

Loading example data

Here we’ll walk you through the standard workflow for using ‘SpotSweeper’ to detect and visualize local outliers in spatial transcriptomics data. We’ll use the Visium_humanDLPFC dataset from the STexampleData package, which is a SpatialExperiment object.

Because local outliers will be saved in the colData of the SpatialExperiment object, we’ll first view the colData and drop out-of-tissue spots before calculating quality control (QC) metrics and running SpotSweeper.

library(SpotSweeper)

# load  Maynard et al DLPFC daatset
spe <- STexampleData::Visium_humanDLPFC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## downloading 1 resources
## retrieving 1 resource
## loading from cache
# show column data before SpotSweeper
colnames(colData(spe))
## [1] "barcode_id"   "sample_id"    "in_tissue"    "array_row"    "array_col"   
## [6] "ground_truth" "reference"    "cell_count"
# drop out-of-tissue spots
spe <- spe[, spe$in_tissue == 1]

Calculating QC metrics using scuttle

We’ll use the scuttle package to calculate QC metrics. To do this, we’ll need to first change the rownames from gene id to gene names. We’ll then get the mitochondrial transcripts and calculate QC metrics for each spot using scuttle::addPerCellQCMetrics.

# change from gene id to gene names
rownames(spe) <- rowData(spe)$gene_name

# identifying the mitochondrial transcripts
is.mito <- rownames(spe)[grepl("^MT-", rownames(spe))]

# calculating QC metrics for each spot using scuttle
spe <- scuttle::addPerCellQCMetrics(spe, subsets = list(Mito = is.mito))
colnames(colData(spe))
##  [1] "barcode_id"            "sample_id"             "in_tissue"            
##  [4] "array_row"             "array_col"             "ground_truth"         
##  [7] "reference"             "cell_count"            "sum"                  
## [10] "detected"              "subsets_Mito_sum"      "subsets_Mito_detected"
## [13] "subsets_Mito_percent"  "total"

Identifying local outliers using SpotSweeper

We can now use SpotSweeper to identify local outliers in the spatial transcriptomics data. We’ll use the localOutliers function to detect local outliers based on the unique detected genes, total library size, and percent of the total reads that are mitochondrial. These methods assume a normal distribution, so we’ll use the log-transformed sum of the counts and the log-transformed number of detected genes. For mitochondrial percent, we’ll use the raw mitochondrial percentage.

# library size
spe <- localOutliers(spe,
    metric = "sum",
    direction = "lower",
    log = TRUE
)

# unique genes
spe <- localOutliers(spe,
    metric = "detected",
    direction = "lower",
    log = TRUE
)

# mitochondrial percent
spe <- localOutliers(spe,
    metric = "subsets_Mito_percent",
    direction = "higher",
    log = FALSE
)

The localOutlier function automatically outputs the results to the colData with the naming convention X_outliers, where X is the name of the input colData. We can then combine all outliers into a single column called local_outliers in the colData of the SpatialExperiment object.

# combine all outliers into "local_outliers" column
spe$local_outliers <- as.logical(spe$sum_outliers) |
    as.logical(spe$detected_outliers) |
    as.logical(spe$subsets_Mito_percent_outliers)

Visualizing local outliers

We can visualize the local outliers using the plotQCmetrics function. This function creates a scatter plot of the specified metric and highlights the local outliers in red using the escheR package. Here, we’ll visualize local outliers of library size, unique genes, mitochondrial percent, and finally, all local outliers. We’ll then arrange these plots in a grid using ggpubr::arrange.

library(escheR)
## Loading required package: ggplot2
# all local outliers
plotQCmetrics(spe, metric = "sum_log", outliers = "local_outliers", point_size = 1.1, 
       stroke = 0.75) +
      ggtitle("All Local Outliers")

Removing technical artifacts using SpotSweeper

Loading example data

# load in DLPFC sample with hangnail artifact
data(DLPFC_artifact)
spe <- DLPFC_artifact

# inspect colData before artifact detection
colnames(colData(spe))
##  [1] "sample_id"          "in_tissue"          "array_row"         
##  [4] "array_col"          "key"                "sum_umi"           
##  [7] "sum_gene"           "expr_chrM"          "expr_chrM_ratio"   
## [10] "ManualAnnotation"   "subject"            "region"            
## [13] "sex"                "age"                "diagnosis"         
## [16] "sample_id_complete" "count"              "sizeFactor"

Visualizing technical artifacts

Technical artifacts can commonly be visualized by standard QC metrics, including library size, unique genes, or mitochondrial percentage. We can first visualize the technical artifacts using the plotQCmetrics function. In this sample, we can clearly see a hangnail artifact on the right side of the tissue section in the mitochondrial ratio plot.

plotQCmetrics(spe,
    metric = "expr_chrM_ratio",
    outliers = NULL, point_size = 1.1
) +
    ggtitle("Mitochondrial Percent")

Identifying artifacts using SpotSweeper

We can then use the findArtifacts function to identify artifacts in the spatial transcriptomics (data. This function identifies technical artifacts based on the first principle component of the local variance of the specified QC metric (mito_percent) at numerous neighorhood sizes (n_order=5). Currently, kmeans clustering is used to cluster the technical artifact vs high-quality Visium spots. Similar to localOutliers, the findArtifacts function then outputs the results to the colData.

# find artifacts using SpotSweeper
spe <- findArtifacts(spe,
    mito_percent = "expr_chrM_ratio",
    mito_sum = "expr_chrM",
    n_order = 5,
    name = "artifact"
)

# check that "artifact" is now in colData
colnames(colData(spe))
##  [1] "sample_id"           "in_tissue"           "array_row"          
##  [4] "array_col"           "key"                 "sum_umi"            
##  [7] "sum_gene"            "expr_chrM"           "expr_chrM_ratio"    
## [10] "ManualAnnotation"    "subject"             "region"             
## [13] "sex"                 "age"                 "diagnosis"          
## [16] "sample_id_complete"  "count"               "sizeFactor"         
## [19] "expr_chrM_ratio_log" "coords"              "k6"                 
## [22] "k18"                 "k36"                 "k60"                
## [25] "k90"                 "artifact"

Visualizing artifacts

We can visualize the artifacts using the escheR package. Here, we’ll visualize the artifacts using the plotQCmetrics function and arrange these plots using ggpubr::arrange.

plotQCmetrics(spe,
    metric = "expr_chrM_ratio",
    outliers = "artifact", point_size = 1.1
) +
    ggtitle("Hangnail artifact")

# Session information

utils::sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] escheR_1.7.0                ggplot2_3.5.1              
##  [3] STexampleData_1.14.0        SpatialExperiment_1.17.0   
##  [5] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
##  [7] Biobase_2.67.0              GenomicRanges_1.59.1       
##  [9] GenomeInfoDb_1.43.2         IRanges_2.41.2             
## [11] S4Vectors_0.45.2            MatrixGenerics_1.19.0      
## [13] matrixStats_1.4.1           ExperimentHub_2.15.0       
## [15] AnnotationHub_3.15.0        BiocFileCache_2.15.0       
## [17] dbplyr_2.5.0                BiocGenerics_0.53.3        
## [19] generics_0.1.3              SpotSweeper_1.3.2          
## [21] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               rlang_1.1.4             magrittr_2.0.3         
##  [4] spatialEco_2.0-2        compiler_4.4.2          RSQLite_2.3.9          
##  [7] png_0.1-8               vctrs_0.6.5             pkgconfig_2.0.3        
## [10] crayon_1.5.3            fastmap_1.2.0           magick_2.8.5           
## [13] XVector_0.47.0          scuttle_1.17.0          labeling_0.4.3         
## [16] utf8_1.2.4              rmarkdown_2.29          UCSC.utils_1.3.0       
## [19] purrr_1.0.2             bit_4.5.0.1             xfun_0.49              
## [22] zlibbioc_1.52.0         cachem_1.1.0            beachmat_2.23.2        
## [25] jsonlite_1.8.9          blob_1.2.4              DelayedArray_0.33.3    
## [28] BiocParallel_1.41.0     terra_1.7-83            parallel_4.4.2         
## [31] R6_2.5.1                bslib_0.8.0             jquerylib_0.1.4        
## [34] Rcpp_1.0.13-1           knitr_1.49              Matrix_1.7-1           
## [37] tidyselect_1.2.1        abind_1.4-8             yaml_2.3.10            
## [40] codetools_0.2-20        curl_6.0.1              lattice_0.22-6         
## [43] tibble_3.2.1            withr_3.0.2             KEGGREST_1.47.0        
## [46] evaluate_1.0.1          Biostrings_2.75.1       pillar_1.9.0           
## [49] BiocManager_1.30.25     filelock_1.0.3          BiocVersion_3.21.1     
## [52] munsell_0.5.1           scales_1.3.0            glue_1.8.0             
## [55] maketools_1.3.1         tools_4.4.2             BiocNeighbors_2.1.1    
## [58] sys_3.4.3               buildtools_1.0.0        grid_4.4.2             
## [61] AnnotationDbi_1.69.0    colorspace_2.1-1        GenomeInfoDbData_1.2.13
## [64] cli_3.6.3               rappdirs_0.3.3          fansi_1.0.6            
## [67] S4Arrays_1.7.1          viridisLite_0.4.2       dplyr_1.1.4            
## [70] gtable_0.3.6            sass_0.4.9              digest_0.6.37          
## [73] SparseArray_1.7.2       rjson_0.2.23            farver_2.1.2           
## [76] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
## [79] httr_1.4.7              mime_0.12               bit64_4.5.2            
## [82] MASS_7.3-61