Running GSVA in an HPC environment

License: Artistic-2.0

Introduction

Splitting GSVA calculations in parallel on a multicore shared-memory computer is directly supported in the functions gsva(), gsvaRowNorm(), gsvaColRanks() and gsvaColScores(), by using the BPPARAM parameter with backends such as BiocParallel::MulticoreParam() or BiocParallel::SnowParam(). However, running GSVA in a high-performance computing (HPC) environment, distributing the computational load across multiple independent nodes that do not share their main memory, requires using some additional functions for that purpose. In this vignette we illustrate this use case assuming our HPC environment uses the SLURM workload manager. Other workload managers can be used, please consult the documentation of the BiocParallel package for more information, or open an issue in the GSVA GitHub repo if you need help with a different workload manager.

In essence, we will be using the three-step GSVA pipeline, illustrated in the scRNA-seq vignette, with the additional parameters first and last, which allow the algorithm to run on a subset of rows (gsvaRowNorm()) or columns (gsvaColRanks() and gsvaColScores()) of the input data, storing the intermediate results using on-disk data structures, and merging them into a final single output object that will store the enrichment scores using also an on-disk data structure. We will not have to worry about how to split the input data and merge the intermediate results. These two tasks will be executed by the functions gsvaMap() and gsvaReduce().

Using a SLURM workload manager

Let’s assume we have run the first part of the scRNA-seq vignette, and we have obtained a SingleCellExperiment object called sce with filtered and normalized expression values, and a collection of gene sets stored in an object called gsets.

We first build a parameter object using the function gsvaParam(), as we would normally do.

gsvapar <- gsvaParam(sce, gsets)

We would then build a BatchtoolsParam object, as described in the corresponding vignette of the BiocParallel package. The GSVA package, however, provides a convenience function called gsvaBatchtoolsSlurmParam() that will build a BatchtoolsParam object with some sensible settings for running GSVA in an HPC environment with a SLURM workload manager.

gsvabtpar <- gsvaBatchtoolsSlurmParam(partition="short", nodes=10,
                                      ncpus_per_task=4)

In this example, we are distributing the computational load across 10 nodes, where in each of them calculations will be done in parallel using 4 CPU cores, sending the corresponding jobs to a partition called short. By default, the function gsvaBatchtoolsSlurmParam() has a first argument dir="GSVAOUTPUT", which will create a directory called GSVAOUTPUT in the current working directory, where the intermediate results will be stored. The user can change this directory by specifying a different value for the dir argument, but it is important that this directory is accessible from all nodes in the HPC environment. The user is also responsible for deleting the contents of this directory after the GSVA calculations are finished.

The next step is to run the function gsvaMap(), which will submit the jobs to the SLURM workload manager, and return a list object, where each of its elements will be of the same class as the input expression data, but with the intermediate results stored in an on-disk data structure.

gsvamaprnorm <- gsvaMap(gsvaRowNorm, gsvapar, BTPARAM=gsvabtpar)

The final step is to merge the intermediate results into a single object, using the function gsvaReduce(), which in this example will return a SingleCellExperiment object with the row-normalized values stored in an assay called gsvarnorm.

gsvarnorm <- gsvaReduce(gsvamaprnorm)

To complete the three-step GSVA pipeline, we would then calculate the column rank values using the function gsvaColRanks(), and finally the GSVA scores using the function gsvaColScores(), as we would normally do in a single-node environment, but using gsvaMap() and gsvaReduce(), this time for a more compact coding in single line of code each.

gsvaranks <- gsvaReduce(gsvaMap(gsvaColRanks, gsvarnorm, BTPARAM=gsvabtpar))
gsvaes <- gsvaReduce(gsvaMap(gsvaColScores, gsvaranks, BTPARAM=gsvabtpar))

Since the calculations of the column ranks and enrichment scores are both done on the columns of the input data, we can reuse the output of gsvaMap() obtained from the rank calculations, directly to the input of gsvaMap() for the enrichment score calculations, without having to reduce the intermediate results into a single object, avoiding the overhead of that unnecessary reduction.

gsvamapranks <- gsvaMap(gsvaColRanks, gsvarnorm, BTPARAM=gsvabtpar)
gsvaes <- gsvaReduce(gsvaMap(gsvaColScores, gsvamapranks, BTPARAM=gsvabtpar))

The current implementation using the BatchtoolsParam backend makes the gsvaMap() function blocking, waiting for all jobs to finish before returning a list object with the intermediate results. There are currently two ways to circumvent this limitation. The first one is to use a terminal multiplexer such as tmux, log interactively into a node of the HPC environment using a job submission command such as:

srun --mpi=none --mem=10G --nodes=1 --ntasks-per-node=1 --partition=short --pty bash -i

start an R session, call interactively the gsvaMap() function and detach the tmux session to be able to log out and leave the jobs running in the background. You may reattach to the tmux session later to check the progress of the jobs or fetch the results. Note that the partition where the interactive job is submitted must have a walltime limit that is long enough to allow the jobs to finish.

The second way is to submit a job with the R script that calls the gsvaMap() function, which will submit the jobs to the SLURM workload manager.

The current implementation of gsvaMap() and gsvaReduce() does not yet handle partial completion of the jobs. So, if any of the jobs fail, the user will have to resubmit the entire calculation. We are working to enable the resubmission of only the failed jobs.

Session information

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc 3.8.3:

sessionInfo()
R version 4.6.1 (2026-06-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 26.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocStyle_2.41.0

loaded via a namespace (and not attached):
 [1] digest_0.6.39       R6_2.6.1            fastmap_1.2.0      
 [4] xfun_0.59           maketools_1.3.2     cachem_1.1.0       
 [7] knitr_1.51          htmltools_0.5.9     rmarkdown_2.31     
[10] buildtools_1.0.0    lifecycle_1.0.5     cli_3.6.6          
[13] sass_0.4.10         jquerylib_0.1.4     compiler_4.6.1     
[16] sys_3.4.3           tools_4.6.1         bslib_0.11.0       
[19] evaluate_1.0.5      yaml_2.3.12         otel_0.2.0         
[22] BiocManager_1.30.27 jsonlite_2.0.0      rlang_1.2.0        

References