License: Artistic-2.0
Splitting GSVA calculations in parallel on a multicore shared-memory
computer is directly supported in the functions gsva(),
gsvaRowNorm(), gsvaColRanks() and
gsvaColScores(), by using the BPPARAM
parameter with backends such as
BiocParallel::MulticoreParam() or
BiocParallel::SnowParam(). However, running GSVA in a
high-performance computing (HPC) environment, distributing the
computational load across multiple independent nodes that do not share
their main memory, requires using some additional functions for that
purpose. In this vignette we illustrate this use case assuming our HPC
environment uses the SLURM
workload manager. Other workload managers can be used, please consult
the documentation of the BiocParallel
package for more information, or open an issue in
the GSVA GitHub repo if you need help with a different workload
manager.
In essence, we will be using the three-step GSVA pipeline,
illustrated in the scRNA-seq vignette, with the additional parameters
first and last, which allow the algorithm to
run on a subset of rows (gsvaRowNorm()) or columns
(gsvaColRanks() and gsvaColScores()) of the
input data, storing the intermediate results using on-disk data
structures, and merging them into a final single output object that will
store the enrichment scores using also an on-disk data structure. We
will not have to worry about how to split the input data and merge the
intermediate results. These two tasks will be executed by the functions
gsvaMap() and gsvaReduce().
Let’s assume we have run the first part of the scRNA-seq vignette,
and we have obtained a SingleCellExperiment object called
sce with filtered and normalized expression values, and a
collection of gene sets stored in an object called
gsets.
We first build a parameter object using the function
gsvaParam(), as we would normally do.
We would then build a BatchtoolsParam object, as
described in the corresponding vignette
of the BiocParallel
package. The GSVA package, however, provides a convenience function
called gsvaBatchtoolsSlurmParam() that will build a
BatchtoolsParam object with some sensible settings for
running GSVA in an HPC environment with a SLURM workload manager.
In this example, we are distributing the computational load across 10
nodes, where in each of them calculations will be done in parallel using
4 CPU cores, sending the corresponding jobs to a partition called
short. By default, the function
gsvaBatchtoolsSlurmParam() has a first argument
dir="GSVAOUTPUT", which will create a directory called
GSVAOUTPUT in the current working directory, where the
intermediate results will be stored. The user can change this directory
by specifying a different value for the dir argument, but
it is important that this directory is accessible from all nodes in the
HPC environment. The user is also responsible for deleting the contents
of this directory after the GSVA calculations are finished.
The next step is to run the function gsvaMap(), which
will submit the jobs to the SLURM workload manager, and return a list
object, where each of its elements will be of the same class as the
input expression data, but with the intermediate results stored in an
on-disk data structure.
The final step is to merge the intermediate results into a single
object, using the function gsvaReduce(), which in this
example will return a SingleCellExperiment object with the
row-normalized values stored in an assay called
gsvarnorm.
To complete the three-step GSVA pipeline, we would then calculate the
column rank values using the function gsvaColRanks(), and
finally the GSVA scores using the function gsvaColScores(),
as we would normally do in a single-node environment, but using
gsvaMap() and gsvaReduce(), this time for a
more compact coding in single line of code each.
gsvaranks <- gsvaReduce(gsvaMap(gsvaColRanks, gsvarnorm, BTPARAM=gsvabtpar))
gsvaes <- gsvaReduce(gsvaMap(gsvaColScores, gsvaranks, BTPARAM=gsvabtpar))Since the calculations of the column ranks and enrichment scores are
both done on the columns of the input data, we can reuse the output of
gsvaMap() obtained from the rank calculations, directly to
the input of gsvaMap() for the enrichment score
calculations, without having to reduce the intermediate results into a
single object, avoiding the overhead of that unnecessary reduction.
gsvamapranks <- gsvaMap(gsvaColRanks, gsvarnorm, BTPARAM=gsvabtpar)
gsvaes <- gsvaReduce(gsvaMap(gsvaColScores, gsvamapranks, BTPARAM=gsvabtpar))The current implementation using the BatchtoolsParam
backend makes the gsvaMap() function blocking, waiting for
all jobs to finish before returning a list object with the intermediate
results. There are currently two ways to circumvent this limitation. The
first one is to use a terminal multiplexer such as tmux, log interactively
into a node of the HPC environment using a job submission command such
as:
start an R session, call interactively the gsvaMap()
function and detach the tmux session to be able to log out and leave the
jobs running in the background. You may reattach to the tmux session
later to check the progress of the jobs or fetch the results. Note that
the partition where the interactive job is submitted must have a
walltime limit that is long enough to allow the jobs to finish.
The second way is to submit a job with the R script that calls the
gsvaMap() function, which will submit the jobs to the SLURM
workload manager.
The current implementation of gsvaMap() and
gsvaReduce() does not yet handle partial completion of the
jobs. So, if any of the jobs fail, the user will have to resubmit the
entire calculation. We are working to enable the resubmission of only
the failed jobs.
Here is the output of sessionInfo() on the system on
which this document was compiled running pandoc 3.8.3:
sessionInfo()
R version 4.6.1 (2026-06-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 26.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocStyle_2.41.0
loaded via a namespace (and not attached):
[1] digest_0.6.39 R6_2.6.1 fastmap_1.2.0
[4] xfun_0.59 maketools_1.3.2 cachem_1.1.0
[7] knitr_1.51 htmltools_0.5.9 rmarkdown_2.31
[10] buildtools_1.0.0 lifecycle_1.0.5 cli_3.6.6
[13] sass_0.4.10 jquerylib_0.1.4 compiler_4.6.1
[16] sys_3.4.3 tools_4.6.1 bslib_0.11.0
[19] evaluate_1.0.5 yaml_2.3.12 otel_0.2.0
[22] BiocManager_1.30.27 jsonlite_2.0.0 rlang_1.2.0