---
title: "Subsampling single-cell data sets with sketchR"
date: "`r Sys.Date()`"
output: rmarkdown::html_document
vignette: >
    %\VignetteIndexEntry{sketchR}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options: 
    chunk_output_type: console
bibliography:
    bibliography.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE,
  crop = NULL
)
library(BiocStyle)
```

## Introduction

As the number of cells that are routinely interrogated in single-cell 
experiments increases rapidly and can easily reach hundreds of thousands, the 
computational resource requirements also grow larger. Several approaches have 
been proposed to either subsample or aggregate cells in order to reduce the 
size of the data and enable the application of standard analysis procedures. 
One such approach is geometric sketching - subsampling in a density-aware 
manner in such a way that densely populated regions of the 
gene expression space are subsampled more aggressively, while a larger 
fraction of cells are retained in sparsely populated regions. In addition to 
reducing the size of the data set, this often also increases the relative 
representation of rare cell types in the subsampled data set.

Several tools have been developed for applying sketching to single-cell 
(or other) data sets, but not all of them are easily applicable from R.
The `sketchR` package implements an R/Bioconductor interface to some of 
the most popular python packages for geometric sketching, allowing them to be 
directly incorporated into Bioconductor-based single-cell analysis workflows. 
The interaction with python is managed using the `basilisk` package, which 
automatically takes care of generating and activating a suitable conda 
environment with the required packages. 

This vignette showcases the main functionalities of the `sketchR` package, 
and illustrates how it can be used to generate a subsample of a dataset using 
the geometric sketching/subsampling algorithms and implementations proposed by 
@Hie2019-geosketch and @Song2022-scsampler, as well as create a set of 
diagnostic plots. 

## Installation

`sketchR` can be installed from Bioconductor using the following code: 

```{r}
#| eval: false

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sketchR")
```

## Preparation

We start by loading the required packages and preparing an example data set. 

```{r setup}
suppressPackageStartupMessages({
    library(sketchR)
    library(TENxPBMCData)
    library(scuttle)
    library(scran)
    library(scater)
    library(SingleR)
    library(celldex)
    library(cowplot)
    library(SummarizedExperiment)
    library(SingleCellExperiment)
    library(beachmat.hdf5)
})
```

We will use the PBMC3k data set from the `r Biocpkg("TENxPBMCData")` 
Bioconductor package for illustration. The chunk below prepares the data set
by calculating log-transformed normalized counts, finding highly variable 
genes, performing dimensionality reduction and predicting cell type labels 
using the `r Biocpkg("SingleR")` package.

```{r}
## Load data
pbmc3k <- TENxPBMCData::TENxPBMCData(dataset = "pbmc3k")

## Set row and column names
colnames(pbmc3k) <- paste0("Cell", seq_len(ncol(pbmc3k)))
rownames(pbmc3k) <- scuttle::uniquifyFeatureNames(
    ID = SummarizedExperiment::rowData(pbmc3k)$ENSEMBL_ID,
    names = SummarizedExperiment::rowData(pbmc3k)$Symbol_TENx
)

## Normalize and log-transform counts
pbmc3k <- scuttle::logNormCounts(pbmc3k)

## Find highly variable genes
dec <- scran::modelGeneVar(pbmc3k)
top.hvgs <- scran::getTopHVGs(dec, n = 2000)

## Perform dimensionality reduction
set.seed(100)
pbmc3k <- scater::runPCA(pbmc3k, subset_row = top.hvgs)
pbmc3k <- scater::runTSNE(pbmc3k, dimred = "PCA")

## Predict cell type labels
ref_monaco <- celldex::MonacoImmuneData()
pred_monaco_main <- SingleR::SingleR(test = pbmc3k, ref = ref_monaco, 
                                     labels = ref_monaco$label.main)
pbmc3k$labels_main <- pred_monaco_main$labels

dim(pbmc3k)
```

## Subsampling

The `geosketch()` function performs geometric sketching by calling the 
`geosketch` [python package](https://github.com/brianhie/geosketch). 
The output is a vector of indices that can be used to subset the full 
dataset. The provided seed will be propagated to the python code to 
achieve reproducibility. 

```{r}
idx800gs <- geosketch(SingleCellExperiment::reducedDim(pbmc3k, "PCA"), 
                      N = 800, seed = 123)
head(idx800gs)
length(idx800gs)
```

Similarly, the `scsampler()` function calls the `scSampler` 
[python package](https://github.com/SONGDONGYUAN1994/scsampler) to 
perform subsampling. 

```{r}
idx800scs <- scsampler(SingleCellExperiment::reducedDim(pbmc3k, "PCA"), 
                       N = 800, seed = 123)
head(idx800scs)
length(idx800scs)
```

To illustrate the result of the subsampling, we plot the tSNE 
representation of the original data as well as the two subsets (using the 
tSNE coordinates derived from the full dataset).

```{r, fig.width = 10, fig.height = 8}
cowplot::plot_grid(
    scater::plotTSNE(pbmc3k, colour_by = "labels_main"),
    scater::plotTSNE(pbmc3k[, idx800gs], colour_by = "labels_main"),
    scater::plotTSNE(pbmc3k[, idx800scs], colour_by = "labels_main")
)
```

We can also illustrate the relative abundance of each cell type in the 
full data and in the subsets, respectively. 

```{r, fig.width = 6, fig.height = 8}
compareCompositionPlot(SummarizedExperiment::colData(pbmc3k), 
                       idx = list(geosketch = idx800gs,
                                  scSampler = idx800scs), 
                       column = "labels_main")
```

## Diagnostic plots

`sketchR` provides a convenient function to plot the Hausdorff distance 
between the full dataset and the subsample, for a range of sketch 
sizes (to make this plot reproducible, we use `set.seed` before the
call).

```{r}
set.seed(123)
hausdorffDistPlot(mat = SingleCellExperiment::reducedDim(pbmc3k, "PCA"), 
                  Nvec = c(400, 800, 2000),
                  Nrep = 3, methods = c("geosketch", "scsampler", "uniform"))
```

## Session info

```{r}
sessionInfo()
```

## References