As the number of cells that are routinely interrogated in single-cell experiments increases rapidly and can easily reach hundreds of thousands, the computational resource requirements also grow larger. Several approaches have been proposed to either subsample or aggregate cells in order to reduce the size of the data and enable the application of standard analysis procedures. One such approach is geometric sketching - subsampling in a density-aware manner in such a way that densely populated regions of the gene expression space are subsampled more aggressively, while a larger fraction of cells are retained in sparsely populated regions. In addition to reducing the size of the data set, this often also increases the relative representation of rare cell types in the subsampled data set.
Several tools have been developed for applying sketching to
single-cell (or other) data sets, but not all of them are easily
applicable from R. The sketchR
package implements an
R/Bioconductor interface to some of the most popular python packages for
geometric sketching, allowing them to be directly incorporated into
Bioconductor-based single-cell analysis workflows. The interaction with
python is managed using the basilisk
package, which
automatically takes care of generating and activating a suitable conda
environment with the required packages.
This vignette showcases the main functionalities of the
sketchR
package, and illustrates how it can be used to
generate a subsample of a dataset using the geometric
sketching/subsampling algorithms and implementations proposed by Hie et al. (2019) and Song et al. (2022), as well as create a set of
diagnostic plots.
sketchR
can be installed from Bioconductor using the
following code:
We start by loading the required packages and preparing an example data set.
suppressPackageStartupMessages({
library(sketchR)
library(TENxPBMCData)
library(scuttle)
library(scran)
library(scater)
library(SingleR)
library(celldex)
library(cowplot)
library(SummarizedExperiment)
library(SingleCellExperiment)
library(beachmat.hdf5)
})
We will use the PBMC3k data set from the TENxPBMCData Bioconductor package for illustration. The chunk below prepares the data set by calculating log-transformed normalized counts, finding highly variable genes, performing dimensionality reduction and predicting cell type labels using the SingleR package.
## Load data
pbmc3k <- TENxPBMCData::TENxPBMCData(dataset = "pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
## Set row and column names
colnames(pbmc3k) <- paste0("Cell", seq_len(ncol(pbmc3k)))
rownames(pbmc3k) <- scuttle::uniquifyFeatureNames(
ID = SummarizedExperiment::rowData(pbmc3k)$ENSEMBL_ID,
names = SummarizedExperiment::rowData(pbmc3k)$Symbol_TENx
)
## Normalize and log-transform counts
pbmc3k <- scuttle::logNormCounts(pbmc3k)
## Find highly variable genes
dec <- scran::modelGeneVar(pbmc3k)
top.hvgs <- scran::getTopHVGs(dec, n = 2000)
## Perform dimensionality reduction
set.seed(100)
pbmc3k <- scater::runPCA(pbmc3k, subset_row = top.hvgs)
pbmc3k <- scater::runTSNE(pbmc3k, dimred = "PCA")
## Predict cell type labels
ref_monaco <- celldex::MonacoImmuneData()
pred_monaco_main <- SingleR::SingleR(test = pbmc3k, ref = ref_monaco,
labels = ref_monaco$label.main)
pbmc3k$labels_main <- pred_monaco_main$labels
dim(pbmc3k)
#> [1] 32738 2700
The geosketch()
function performs geometric sketching by
calling the geosketch
python package. The
output is a vector of indices that can be used to subset the full
dataset. The provided seed will be propagated to the python code to
achieve reproducibility.
idx800gs <- geosketch(SingleCellExperiment::reducedDim(pbmc3k, "PCA"),
N = 800, seed = 123)
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda create --yes --prefix /github/home/.cache/R/basilisk/1.19.0/sketchR/1.3.0/universal 'python=3.11.10' --quiet -c bioconda -c conda-forge --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/sketchR/1.3.0/universal 'python=3.11.10' -c bioconda -c conda-forge --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/sketchR/1.3.0/universal -c bioconda -c conda-forge 'python=3.11.10' '_libgcc_mutex=0.1' '_openmp_mutex=4.5' 'anndata=0.10.9' 'array-api-compat=1.9' 'brotli=1.1.0' 'brotli-bin=1.1.0' 'bzip2=1.0.8' 'c-ares=1.33.1' 'ca-certificates=2024.8.30' 'cached-property=1.5.2' 'cached_property=1.5.2' 'certifi=2024.8.30' 'colorama=0.4.6' 'contourpy=1.3.0' 'cycler=0.12.1' 'exceptiongroup=1.2.2' 'fonttools=4.54.1' 'freetype=2.12.1' 'get-annotations=0.1.2' 'h5py=3.11.0' 'hdf5=1.14.3' 'icu=75.1' 'joblib=1.4.2' 'keyutils=1.6.1' 'kiwisolver=1.4.7' 'krb5=1.21.3' 'lcms2=2.15' 'ld_impl_linux-64=2.43' 'legacy-api-wrap=1.4' 'lerc=4.0.0' 'libaec=1.1.3' 'libblas=3.9.0' 'libbrotlicommon=1.1.0' 'libbrotlidec=1.1.0' 'libbrotlienc=1.1.0' 'libcblas=3.9.0' 'libcurl=8.10.1' 'libdeflate=1.18' 'libedit=3.1.20191231' 'libev=4.33' 'libexpat=2.6.3' 'libffi=3.4.2' 'libgcc=14.1.0' 'libgcc-ng=14.1.0' 'libgfortran=14.1.0' 'libgfortran-ng=14.1.0' 'libgfortran5=14.1.0' 'libgomp=14.1.0' 'libhwloc=2.11.1' 'libiconv=1.17' 'libjpeg-turbo=2.1.5.1' 'liblapack=3.9.0' 'libllvm14=14.0.6' 'libnghttp2=1.58.0' 'libnsl=2.0.1' 'libopenblas=0.3.27' 'libpng=1.6.44' 'libsqlite=3.46.1' 'libssh2=1.11.0' 'libstdcxx=14.1.0' 'libstdcxx-ng=14.1.0' 'libtiff=4.5.1' 'libuuid=2.38.1' 'libwebp-base=1.4.0' 'libxcb=1.15' 'libxcrypt=4.4.36' 'libxml2=2.12.7' 'libzlib=1.3.1' 'llvmlite=0.43.0' 'matplotlib-base=3.9.2' 'munkres=1.1.4' 'natsort=8.4.0' 'ncurses=6.5' 'networkx=3.3' 'numba=0.60.0' 'numpy=1.26.4' 'openjpeg=2.5.0' 'openssl=3.3.2' 'packaging=24.1' 'pandas=2.2.3' 'patsy=0.5.6' 'pillow=10.0.0' 'pip=24.2' 'pthread-stubs=0.4' 'pynndescent=0.5.13' 'pyparsing=3.1.4' 'python=3.11.10' 'python-dateutil=2.9.0' 'python-tzdata=2024.2' 'python_abi=3.11' 'pytz=2024.1' 'qhull=2020.2' 'readline=8.2' 'scanpy=1.10.3' 'scikit-learn=1.5.2' 'scipy=1.13.1' 'seaborn=0.13.2' 'seaborn-base=0.13.2' 'session-info=1.0.0' 'setuptools=75.1.0' 'six=1.16.0' 'statsmodels=0.14.4' 'stdlib-list=0.10.0' 'tbb=2021.13.0' 'threadpoolctl=3.5.0' 'tk=8.6.13' 'tqdm=4.66.5' 'tzdata=2024b' 'umap-learn=0.5.6' 'wheel=0.44.0' 'xorg-libxau=1.0.11' 'xorg-libxdmcp=1.1.5' 'xz=5.2.6' 'zstd=1.5.6' --override-channels
head(idx800gs)
#> [1] 1 5 8 11 15 16
length(idx800gs)
#> [1] 800
Similarly, the scsampler()
function calls the
scSampler
python package
to perform subsampling.
idx800scs <- scsampler(SingleCellExperiment::reducedDim(pbmc3k, "PCA"),
N = 800, seed = 123)
head(idx800scs)
#> [1] 1079 644 1494 1278 5 391
length(idx800scs)
#> [1] 800
To illustrate the result of the subsampling, we plot the tSNE representation of the original data as well as the two subsets (using the tSNE coordinates derived from the full dataset).
cowplot::plot_grid(
scater::plotTSNE(pbmc3k, colour_by = "labels_main"),
scater::plotTSNE(pbmc3k[, idx800gs], colour_by = "labels_main"),
scater::plotTSNE(pbmc3k[, idx800scs], colour_by = "labels_main")
)
We can also illustrate the relative abundance of each cell type in the full data and in the subsets, respectively.
sketchR
provides a convenient function to plot the
Hausdorff distance between the full dataset and the subsample, for a
range of sketch sizes (to make this plot reproducible, we use
set.seed
before the call).
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] beachmat.hdf5_1.5.1 cowplot_1.1.3
#> [3] celldex_1.16.0 SingleR_2.9.4
#> [5] scater_1.35.0 ggplot2_3.5.1
#> [7] scran_1.35.0 scuttle_1.17.0
#> [9] TENxPBMCData_1.24.0 HDF5Array_1.35.2
#> [11] rhdf5_2.51.1 DelayedArray_0.33.3
#> [13] SparseArray_1.7.2 S4Arrays_1.7.1
#> [15] abind_1.4-8 Matrix_1.7-1
#> [17] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#> [19] Biobase_2.67.0 GenomicRanges_1.59.1
#> [21] GenomeInfoDb_1.43.2 IRanges_2.41.2
#> [23] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [25] generics_0.1.3 MatrixGenerics_1.19.0
#> [27] matrixStats_1.4.1 sketchR_1.3.0
#> [29] BiocStyle_2.35.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] sys_3.4.3 jsonlite_1.8.9
#> [3] magrittr_2.0.3 ggbeeswarm_0.7.2
#> [5] gypsum_1.3.0 farver_2.1.2
#> [7] vctrs_0.6.5 memoise_2.0.1
#> [9] DelayedMatrixStats_1.29.0 htmltools_0.5.8.1
#> [11] AnnotationHub_3.15.0 curl_6.0.1
#> [13] BiocNeighbors_2.1.2 Rhdf5lib_1.29.0
#> [15] sass_0.4.9 alabaster.base_1.7.2
#> [17] bslib_0.8.0 basilisk_1.19.0
#> [19] httr2_1.0.7 cachem_1.1.0
#> [21] buildtools_1.0.0 igraph_2.1.2
#> [23] mime_0.12 lifecycle_1.0.4
#> [25] pkgconfig_2.0.3 rsvd_1.0.5
#> [27] R6_2.5.1 fastmap_1.2.0
#> [29] GenomeInfoDbData_1.2.13 digest_0.6.37
#> [31] colorspace_2.1-1 AnnotationDbi_1.69.0
#> [33] dqrng_0.4.1 irlba_2.3.5.1
#> [35] ExperimentHub_2.15.0 RSQLite_2.3.9
#> [37] beachmat_2.23.5 labeling_0.4.3
#> [39] filelock_1.0.3 httr_1.4.7
#> [41] compiler_4.4.2 bit64_4.5.2
#> [43] withr_3.0.2 BiocParallel_1.41.0
#> [45] viridis_0.6.5 DBI_1.2.3
#> [47] alabaster.ranges_1.7.0 alabaster.schemas_1.7.0
#> [49] rappdirs_0.3.3 bluster_1.17.0
#> [51] tools_4.4.2 vipor_0.4.7
#> [53] beeswarm_0.4.0 glue_1.8.0
#> [55] rhdf5filters_1.19.0 grid_4.4.2
#> [57] Rtsne_0.17 cluster_2.1.8
#> [59] gtable_0.3.6 BiocSingular_1.23.0
#> [61] ScaledMatrix_1.15.0 metapod_1.15.0
#> [63] XVector_0.47.1 ggrepel_0.9.6
#> [65] BiocVersion_3.21.1 pillar_1.10.0
#> [67] limma_3.63.2 dplyr_1.1.4
#> [69] BiocFileCache_2.15.0 lattice_0.22-6
#> [71] bit_4.5.0.1 tidyselect_1.2.1
#> [73] locfit_1.5-9.10 maketools_1.3.1
#> [75] Biostrings_2.75.3 knitr_1.49
#> [77] gridExtra_2.3 edgeR_4.5.1
#> [79] xfun_0.49 statmod_1.5.0
#> [81] UCSC.utils_1.3.0 yaml_2.3.10
#> [83] evaluate_1.0.1 codetools_0.2-20
#> [85] tibble_3.2.1 alabaster.matrix_1.7.4
#> [87] BiocManager_1.30.25 cli_3.6.3
#> [89] reticulate_1.40.0 munsell_0.5.1
#> [91] jquerylib_0.1.4 Rcpp_1.0.13-1
#> [93] dir.expiry_1.15.0 dbplyr_2.5.0
#> [95] png_0.1-8 parallel_4.4.2
#> [97] blob_1.2.4 basilisk.utils_1.19.0
#> [99] sparseMatrixStats_1.19.0 alabaster.se_1.7.0
#> [101] viridisLite_0.4.2 scales_1.3.0
#> [103] purrr_1.0.2 crayon_1.5.3
#> [105] rlang_1.1.4 KEGGREST_1.47.0