---
title: "Running GSVA in an HPC environment"
author:
- name: Robert Castelo
  affiliation:
  - &idupf Dept. of Medicine and Life Sciences, Universitat Pompeu Fabra, Barcelona, Spain
  email: robert.castelo@upf.edu
abstract: >
  Here we illustrate how to use GSVA in an HPC environment.
date: "`r BiocStyle::doc_date()`"
package: "`r pkg_ver('GSVA')`"
vignette: >
  %\VignetteIndexEntry{Running GSVA in an HPC environment}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteKeywords{GeneExpression, Microarray, RNAseq, GeneSetEnrichment, Pathway}
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: true
    fig_captions: yes
bibliography: GSVA.bib
---

**License**: `r packageDescription("GSVA")[["License"]]`

```{r setup, include=FALSE}
options(width=80)
knitr::opts_chunk$set(collapse=TRUE,
                      message=FALSE,
                      warning=FALSE,
                      comment="",
                      fig.align="center",
                      fig.wide=TRUE)
```

# Introduction

Splitting GSVA calculations in parallel on a multicore shared-memory computer is
directly supported in the functions `gsva()`, `gsvaRowNorm()`, `gsvaColRanks()`
and `gsvaColScores()`, by using the `BPPARAM` parameter with backends such
as `BiocParallel::MulticoreParam()` or `BiocParallel::SnowParam()`. However,
running GSVA in a high-performance computing (HPC) environment, distributing the
computational load across multiple independent nodes that do not share their
main memory, requires using some additional functions for that purpose. In this
vignette we illustrate this use case assuming our HPC environment uses the
[SLURM](https://slurm.schedmd.com) workload manager. Other workload managers
can be used, please consult the documentation of the `r Biocpkg("BiocParallel")`
package for more information, or open an
[issue](https://github.com/rcastelo/GSVA/issues/new/choose) in the GSVA GitHub
repo if you need help with a different workload manager.

In essence, we will be using the three-step GSVA pipeline, illustrated in the
scRNA-seq vignette, with the additional parameters `first` and `last`, which
allow the algorithm to run on a subset of rows (`gsvaRowNorm()`) or columns
(`gsvaColRanks()` and `gsvaColScores()`) of the input data, storing the
intermediate results using on-disk data structures, and merging them into a
final single output object that will store the enrichment scores using also
an on-disk data structure. We will not have to worry about how to split the
input data and merge the intermediate results. These two tasks will be executed
by the functions `gsvaMap()` and `gsvaReduce()`.

# Using a SLURM workload manager

Let's assume we have run the first part of the scRNA-seq vignette, and we have
obtained a `SingleCellExperiment` object called `sce` with filtered and
normalized expression values, and a collection of gene sets stored in an object
called `gsets`.

We first build a parameter object using the function `gsvaParam()`, as we would
normally do.

```{r, eval=FALSE}
gsvapar <- gsvaParam(sce, gsets)
```
We would then build a `BatchtoolsParam` object, as described in the
corresponding
[vignette](https://bioconductor.org/packages/release/bioc/vignettes/BiocParallel/inst/doc/BiocParallel_BatchtoolsParam.html)
of the `r Biocpkg("BiocParallel")` package. The GSVA package, however, provides
a convenience function called `gsvaBatchtoolsSlurmParam()` that will build a
`BatchtoolsParam` object with some sensible settings for running GSVA in an HPC
environment with a SLURM workload manager.

```{r, eval=FALSE}
gsvabtpar <- gsvaBatchtoolsSlurmParam(partition="short", nodes=10,
                                      ncpus_per_task=4)
```
In this example, we are distributing the computational load across 10 nodes, where
in each of them calculations will be done in parallel using 4 CPU cores, sending
the corresponding jobs to a partition called `short`. By default, the function
`gsvaBatchtoolsSlurmParam()` has a first argument `dir="GSVAOUTPUT"`, which will
create a directory called `GSVAOUTPUT` in the current working directory, where
the intermediate results will be stored. The user can change this directory by
specifying a different value for the `dir` argument, but it is important that this
directory is accessible from all nodes in the HPC environment. The user is also
responsible for deleting the contents of this directory after the GSVA calculations
are finished.

The next step is to run the function `gsvaMap()`, which will submit the jobs to the
SLURM workload manager, and return a list object, where each of its elements will be
of the same class as the input expression data, but with the intermediate results
stored in an on-disk data structure.

```{r, eval=FALSE}
gsvamaprnorm <- gsvaMap(gsvaRowNorm, gsvapar, BTPARAM=gsvabtpar)
```
The final step is to merge the intermediate results into a single object, using
the function `gsvaReduce()`, which in this example will return a
`SingleCellExperiment` object with the row-normalized values stored in an
assay called `gsvarnorm`.

```{r, eval=FALSE}
gsvarnorm <- gsvaReduce(gsvamaprnorm)
```
To complete the three-step GSVA pipeline, we would then calculate the column
rank values using the function `gsvaColRanks()`, and finally the GSVA scores
using the function `gsvaColScores()`, as we would normally do in a single-node
environment, but using `gsvaMap()` and `gsvaReduce()`, this time for a more
compact coding in single line of code each.

```{r, eval=FALSE}
gsvaranks <- gsvaReduce(gsvaMap(gsvaColRanks, gsvarnorm, BTPARAM=gsvabtpar))
gsvaes <- gsvaReduce(gsvaMap(gsvaColScores, gsvaranks, BTPARAM=gsvabtpar))
```
Since the calculations of the column ranks and enrichment scores are both done
on the columns of the input data, we can reuse the output of `gsvaMap()`
obtained from the rank calculations, directly to the input of `gsvaMap()` for
the enrichment score calculations, without having to reduce the intermediate
results into a single object, avoiding the overhead of that unnecessary
reduction.

```{r, eval=FALSE}
gsvamapranks <- gsvaMap(gsvaColRanks, gsvarnorm, BTPARAM=gsvabtpar)
gsvaes <- gsvaReduce(gsvaMap(gsvaColScores, gsvamapranks, BTPARAM=gsvabtpar))
```
The current implementation using the `BatchtoolsParam` backend makes the
`gsvaMap()` function blocking, waiting for all jobs to finish before returning
a list object with the intermediate results. There are currently two ways to
circumvent this limitation. The first one is to use a terminal multiplexer such
as [tmux](https://github.com/tmux/tmux/wiki), log interactively into a node of
the HPC environment using a job submission command such as:

```{r, eval=FALSE}
srun --mpi=none --mem=10G --nodes=1 --ntasks-per-node=1 --partition=short --pty bash -i
```
start an R session, call interactively the `gsvaMap()` function and detach the
tmux session to be able to log out and leave the jobs running in the background.
You may reattach to the tmux session later to check the progress of the jobs or
fetch the results. Note that the partition where the interactive job is
submitted must have a walltime limit that is long enough to allow the jobs to finish.

The second way is to submit a job with the R script that calls the `gsvaMap()`
function, which will submit the jobs to the SLURM workload manager.

The current implementation of `gsvaMap()` and `gsvaReduce()` does not yet handle
partial completion of the jobs. So, if any of the jobs fail, the user will have
to resubmit the entire calculation. We are working to enable the resubmission of
only the failed jobs.

# Session information {.unnumbered}
Here is the output of `sessionInfo()` on the system on which this document was
compiled running pandoc `r rmarkdown::pandoc_version()`:

```{r session_info, cache=FALSE}
sessionInfo()
```

# References