---
title: "GSVA on single-cell RNA-seq data"
author:
- name: Robert Castelo
  affiliation:
  - &idupf Dept. of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain
  email: robert.castelo@upf.edu
- name: Axel Klenk
  affiliation: *idupf
  email: axel.klenk@gmail.com
- name: Justin Guinney 
  affiliation: 
  - Tempus Labs, Inc.
  email: justin.guinney@tempus.com
abstract: >
  Here we illustrate how to use GSVA with single-cell RNA sequencing (scRNA-seq)
  data.
date: "`r BiocStyle::doc_date()`"
package: "`r pkg_ver('GSVA')`"
vignette: >
  %\VignetteIndexEntry{GSVA on single-cell RNA-seq data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteKeywords{GeneExpression, Microarray, RNAseq, GeneSetEnrichment, Pathway}
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
    fig_captions: yes
bibliography: GSVA.bib
---

**License**: `r packageDescription("GSVA")[["License"]]`

```{r setup, include=FALSE}
options(width=80)
knitr::opts_chunk$set(collapse=TRUE,
                      message=FALSE,
                      warning=FALSE,
                      comment="",
                      fig.align="center",
                      fig.wide=TRUE)
```

# Introduction

GSVA provides now specific support for single-cell data in the algorithm
that runs through the `gsvaParam()` parameter constructor, and originally
described in the publication by @haenzelmann2013gsva. At the moment, this
specific support consists of the following features:

  * The input expression data can be stored in different types of data
    containers prepared to store sparse single-cell data. These types of
    sparse data containers can be broadly categorized in those that only
    store the expression values, and those that may store additional row
    and column metadata. The currently available value-only containers for
    input are `dgCMatrix`, `SVT_SparseArray`, `HDF5Matrix` and `DelayedMatrix`.
    The currently available container for single-cell data that allows one
    to input additional row and column metadata is a `SingleCellExperiment`
    object.
  * While the input single-cell data is always sparse, the output of enrichment
    scores will be always dense, and therefore, the container storing those
    scores will be different from the input data, typically a `matrix` or a
    dense `DelayedMatrix` object using an `HDF5Matrix` backend. The latter will
    be particularly used when the total number of values exceeds 2^31, which is
    the largest 32-bit standard integer value in R.
  * By default, when the input expression data is stored in a sparse data
    container, as it typically happens with single-cell data, then a sparse
    regime of the GSVA algorithm will run, if GSVA is the chosen method, by
    which nonzero values are treated differently from zero values, leading to
    slightly different results than those obtained by applying the classical
    GSVA algorithm. If we set the parameter `sparse=FALSE` in the call to
    `gsvaParam()`, the classical GSVA algorithm will be used, which for a
    typical single-cell data set will result in longer running times and larger
    memory consumption than running it in the default sparse regime for this
    type of data.
  * The GSVA algorithm can be run either at once through a call to `gsva()`
    with a parameter object or in three steps: (1) row normalization with
    `gsvaRowNorm()`; (2) column rank transformation with `gsvaColRanks()`;
    and (3) column enrichment scores calculation with `gsvaColScores()`.
    Splitting the GSVA algorithm into these three steps allows one to distribute
    and balance the computational load of the algorithm in a high-performance
    computing (HPC) environment with multiple nodes, and to reuse the output of
    the first two steps, which are independent of the gene sets, to calculate
    enrichment scores for different collections of gene sets, without having to
    repeat the first two steps.

In what follows, we will illustrate the use of GSVA on a publicly available
single-cell transcriptomics data set of peripheral blood mononuclear cells
(PBMCs) published by @zheng2017massively.

# Import data

We import the PBMC data using the `r Biocpkg("TENxPBMCData")` package, as a
`r Biocpkg("SingleCellExperiment")` object.


```{r, message=FALSE, warning=FALSE}
library(SingleCellExperiment)
library(TENxPBMCData)

sce <- TENxPBMCData(dataset="pbmc4k")
sce
```

# Quality control and pre-processing

Here, we perform a quality control (QC) and pre-processing steps using the
package `r Biocpkg("scrapper")` [@lun2022powering]. We start identifying
mitochondrial genes.

```{r, message=FALSE, warning=FALSE}
library(scrapper)

is_mito <- grepl("^MT-", rowData(sce)$Symbol_TENx)
table(is_mito)
```
Calculate QC metrics and filter out low-quality cells.

```{r}
sce <- quickRnaQc.se(sce, subsets=list(mito=is_mito))
sce <- sce[, sce$keep]
dim(sce)
```
We filter out genes that are expressed in less than 1% of the cells.

```{r}
cellsxgene <- rowSums(counts(sce) > 0)
sce <- sce[cellsxgene > floor(ncol(sce)*0.01), ]
dim(sce)
```
Calculate library size factors and normalized units of expression in
logarithmic scale.

```{r}
sce <- normalizeRnaCounts.se(sce, size.factors=sce$sum)
assayNames(sce)
```

# Annotate cell types using GSVA

Here, we illustrate how to annotate cell types in the PBMC data using GSVA and
a collection of relevant gene sets.

## Read gene sets in GMT format

First, we fetch a collection of 22 leukocyte gene set signatures, containing
a total 547 genes, which should help to distinguish among 22 mature human
hematopoietic cell type populations isolated from peripheral blood or
*in vitro* culture conditions, including seven T cell types: naïve and memory B
cells, plasma cells, NK cell, and myeloid subsets. These gene sets have been
used in the benchmarking publication by @diaz2019evaluation, and were
originally compiled by the [CIBERSORT](https://cibersortx.stanford.edu)
developers, where they called it the LM22 signature [@newman2015robust]. The
LM22 signature is stored in the `r Biocpkg("GSVAdata")` experiment data
package as a compressed text file in
[GMT format](https://www.genepattern.org/file-formats-guide/#GMT), which can
be read into R using the `readGMT()` function from the `r Biocpkg("GSVA")`
package, which will return the gene sets, by default, into a
`GeneSetCollection` object, defined in the `r Biocpkg("GSEABase")` package.
This default argument can be changed to return the gene sets into a base
`list` object by setting `valueType="list"` in the call to `readGMT()`.

```{r, message=FALSE, warning=FALSE}
library(GSEABase)
library(GSVA)

fname <- file.path(system.file("extdata", package="GSVAdata"),
                   "pbmc_cell_type_gene_set_signatures.gmt.gz")
gsets <- readGMT(fname)
gsets
```

## Add gene identifier type metadata

Note that while gene identifers in the `sce` object correspond to
[Ensembl stable identifiers](https://www.ensembl.org/info/genome/stable_ids/index.html)
(`ENSG...`), the gene identifiers in the gene sets are
[HGNC](https://www.genenames.org) gene symbols. This, in principle, precludes
matching directly what gene in the single-cell data object `sce` corresponds to
what gene set in the `GeneSetCollection` object `gsets`. However, the
`r Biocpkg("GSVA")` package can do that matching as long as the appropriate
metadata is present in both objects.

In the case of a `GeneSetCollection` object, its `geneIdType` metadata slot
stores the type of gene identifier. In the case of a `SingleCellExperiment`
object, such as the previous `sce` object, such metadata is not present.
However, using the function `gsvaAnnotation()` from the `r Biocpkg("GSVA")`
package, and the helper function `ENSEMBLIdentifier()` from the
`r Biocpkg("GSEABase")` package, we add such metadata to the `sce` object as
follows.

```{r}
gsvaAnnotation(sce) <- ENSEMBLIdentifier("org.Hs.eg.db")
gsvaAnnotation(sce)
```

## Build parameter object

We first build a parameter object using the function `gsvaParam()`. By
default, the expression values in the `logcounts` assay will be selected for
downstream analysis.

```{r}
gsvapar <- gsvaParam(sce, gsets)
gsvapar
```

## Calculate GSVA scores

While at this point, we could already run the entire GSVA algorithm with a call
to the `gsva(gsvapar)` function. We show here how to do it in three steps.
First we calculate row-normalized expression values using the function
`gsvaRowNorm()`, which if, as in this example, the given input is a
`SingleCellExperiment` object, then the output will be the same object with
an additional assay called `gsvarownr` containing the row-normalized expression
values.

```{r}
gsvarownorm <- gsvaRowNorm(gsvapar)
gsvarownorm
assayNames(gsvarownorm)
```
Second, we calculate GSVA column rank values using the function
`gsvaColRanks()`, which takes as input the output of `gsvaRowNorm()`, and returns
the column rank values in a new assay called `gsvaranks`, if the input is a
`SingleCellExperiment` object.

```{r}
gsvacolranks <- gsvaColRanks(gsvarownorm)
gsvacolranks
assayNames(gsvacolranks)
```
Third, we finally calculate the GSVA scores using the output of
`gsvaColRanks()` as input to the function `gsvaColScores()`. By default, this
function will calculate the scores for all gene sets specified in the input
parameter object given in the call to `gsvaRowNorm()`.

```{r}
es <- gsvaColScores(gsvacolranks)
es
```
However, we could calculate the scores for another collection of gene sets,
without having to calculate the column ranks again, by giving this other
collection of gene sets as second argument to the call to `gsvaColScores()`.

```{r, eval=FALSE}
es2 <- gsvaColScores(gsvacolranks, alternative_gsets)
```

## Using GSVA scores to assign cell types

Following @amezquita2020orchestrating, and some of the steps described in
"Chapter 5 Clustering" of the first version of the
[OSCA book](https://bioconductor.org/books/3.16/OSCA.basic/clustering.html),
we use GSVA scores to build a nearest-neighbor graph of the cells using the
function `makeSNNGraph()` from the `r Biocpkg("bluster")` package. The
parameter `k` in the call to `makeSNNGraph()` specifies the number of nearest
neighbors to consider during graph construction, and here we set `k=20` because
it leads to a number of clusters close to the expected number of cell types.

```{r, message=FALSE, warning=FALSE}
library(bluster)

g <- makeSNNGraph(t(assay(es)), k=20)
```
Second, we use the function `cluster_walktrap()` from the `r CRANpkg("igraph")`
package [@csardi2006igraph], to cluster cells by finding densely connected
subgraphs. We store the resulting vector of cluster indicator values into the
`sce` object using the function `colLabels()`.


```{r, message=FALSE, warning=FALSE}
library(igraph)

colLabels(es) <- factor(cluster_walktrap(g)$membership)
table(colLabels(es))
```
Similarly to @diaz2019evaluation, we apply a simple cell type assignment
algorithm, which consists of selecting at each cell the gene set with highest
GSVA score, tallying the selected gene sets per cluster, and assigning to the
cluster the most frequent gene set, storing that assignment into the `sce`
object with the function `colLabels()`.

```{r}
whmax <- apply(assay(es), 2, which.max)
gsxlab <- split(rownames(es)[whmax], colLabels(es))
gsxlab <- names(sapply(sapply(gsxlab, table), which.max))
colLabels(es) <- factor(gsub("[0-9]\\.", "", gsxlab))[colLabels(es)]
table(colLabels(es))
```
We can visualize the cell type assignments by projecting cells dissimilarity in
two dimensions with a principal components analysis (PCA) on the GSVA scores,
and coloring cells using the previously assigned clusters.

```{r scpcaclusters, echo=TRUE, message=FALSE, warning=FALSE, fig.height=5, fig.width=6, out.width="600px", fig.cap="Cell type assignments of PBMC scRNA-seq data, based on GSVA scores."}
library(RColorBrewer)

res <- prcomp(assay(es))
varexp <- res$sdev^2 / sum(res$sdev^2)
nclusters <- nlevels(colLabels(es))
hmcol <- colorRampPalette(brewer.pal(nclusters, "Set1"))(nclusters)
par(mar=c(4, 5, 1, 1))
plot(res$rotation[, 1], res$rotation[, 2], col=hmcol[colLabels(es)], pch=19,
     xlab=sprintf("PCA 1 (%.0f%%)", varexp[1]*100),
     ylab=sprintf("PCA 2 (%.0f%%)", varexp[2]*100),
     las=1, cex.axis=1.2, cex.lab=1.5)
mask <- colLabels(es) == "NK_CELLS_RESTING"
points(res$rotation[mask, 1], res$rotation[mask, 2], ## show the overlap better
       col=hmcol[colLabels(es)[mask]], pch=19)
legend("bottomright", gsub("_", " ", levels(colLabels(es))), fill=hmcol, inset=0.01)
```
Finally, if we want to better understand why a specific cell type is annotated
to a given cell, we can use the `gsvaEnrichment()` function, which will show
a GSEA enrichment plot. This function takes as input the output of
`gsvaRanks()`, a given column (cell) in the input single-cell data, and a given
gene set. In Figure \@ref(fig:gsvaenrichment) below, we show such a plot for the
first cell annotated to the monocytes cell type.

```{r gsvaenrichment, echo=TRUE, fig.height=5, fig.width=5, out.width="600px", fig.cap="GSVA enrichment plot of the EOSINOPHILS gene set in the expression profile of the first cell annotated to that cell type."}
firstmonocytecell <- which(colLabels(es) == "MONOCYTES")[1]
par(mar=c(4, 5, 1, 1))
gsvaEnrichment(gsvacolranks, column=firstmonocytecell, geneSet="MONOCYTES",
               cex.axis=1.2, cex.lab=1.5, plot="ggplot")
```
In the previous call to `gsvaEnrichment()` we used the argument `plot="ggplot"`
to produce a plot with the [ggplot2](https://cran.r-project.org/package=ggplot2)
package. By default, if we call `gsvaEnrichment()` interactively, it will
produce a plot using "base R", but either when we do it non-interactively, or
when we set `plot="no"` it will return a `data.frame` object with the enrichment
data.

# Benchmarking

We are still benchmarking and testing this version of GSVA for single-cell
data. If you encounter problems or have suggestions, do not hesitate to
contact us by opening an [issue](https://github.com/rcastelo/GSVA/issues)
in the GSVA GitHub repo.

# Session information {.unnumbered}
Here is the output of `sessionInfo()` on the system on which this document was
compiled running pandoc `r rmarkdown::pandoc_version()`:

```{r session_info, cache=FALSE}
sessionInfo()
```

# References