AnVILWorkflow: Run batch analysis workflows including non-R tools leveraing Cloud resources

Overview

The AnVIL project is an analysis, visualization, and informatics cloud-based space for data access, sharing and computing across large genomic-related data sets.

For R users with the limited computing resources, we introduce the AnVILWorkflow package. This package allows users to run workflows implemented in Terra without installing software, writing any workflow, or managing cloud resources. Terra is a cloud-based genomics platform and its computing resources rely on Google Cloud Platform (GCP).

Use of this package requires AnVIL and Google cloud computing billing accounts. Consult AnVIL training guides for details on establishing these accounts.

Install and load package

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("AnVILWorkflow")
library(AnVIL)
library(AnVILGCP)
library(AnVILWorkflow)

Google Cloud SDK

If you use AnVILWorkflow within Terra’s RStudio, you don’t need extra authentication and gcloud SDK. If you use this package locally, it requires gcloud SDK and the billing account used in Terra. You can [install][] the gcloud sdk.

Check whether your system has the installation with AnVIL::gcloud_exists(). It should return TRUE to use AnVILWorkflow package.

gcloud_exists()

If it returns FALSE, install the gcloud SDK following this script:

devtools::install_github("rstudio/cloudml")
cloudml::gcloud_install()
## shell
$ gcloud auth login

Create Terra account

You need Terra account setup. Once you have your own Terra account, you need two pieces of information to use AnVILWorkflow package:

  1. The email address linked to your Terra account
  2. Your billing project name

You can setup your working environment using setCloudEnv() function like below. Provide the input values with YOUR account information!

accountEmail <- "[email protected]"
billingProjectName <- "YOUR_BILLING_ACCOUNT"

setCloudEnv(accountEmail = accountEmail, 
            billingProjectName = billingProjectName)

The remainder of this vignette assumes that an Terra account has been established and successfully linked to a Google cloud computing billing account.

Major steps

Here is the table of major functions for three workflow steps - prepare, run, and check result.

Steps Functions Description
Prepare cloneWorkspace Copy the template workspace
updateInput Take user’s inputs
Run runWorkflow Launch the workflow in Terra
stopWorkflow Abort the submission
monitorWorkflow Monitor the status of your workflow run
Result getOutput List or download your workflow outputs

Example in this vignette: bulk RNAseq analysis

You can find all the available workspaces you have access to using AnVIL::avworkspaces() function. Workspaces manually curated by this package are separately checked using availableAnalysis() function. The values under analysis column can be used for the analysis argument, simplifying the cloning process. For this vignette, we use "salmon".

> availableAnalysis()
   analysis       workspaceNamespace                            workspaceName         configuration_namespace              configuration_name
1 bioBakery waldronlab-terra-rstudio mtx_workflow_biobakery_version3_template mtx_workflow_biobakery_version3 mtx_workflow_biobakery_version3
2    salmon  bioconductor-rpci-anvil             Bioconductor-Workflow-DESeq2         bioconductor-rpci-anvil                 AnVILBulkRNASeq
3    pathml waldronlab-terra-rstudio      pathml_stain_normalization_template                          PathML                   Preprocessing
                                                                                             description
1                                                                    Microbiome analysis using bioBakery
2 Trascript quantification from RNAseq using Salmon | Differential gene expression analysis using DESeq2
3                                                            Stain normalization step of PathML pipeline
analysis <- "salmon"

Browse AnVIL resources

AnVILBrowse("malaria")
AnVILBrowse("resistance")
AnVILBrowse("resistance", searchFrom = "workflow")

Setup

Clone workspace

Curated by this package

We will refer the existing workspaces, that you have access to and want to use for your analysis, as ‘template’ workspaces. The first step of using this package is cloning the template workspace using cloneWorkspace function. Note that you need to provide a unique name for the cloned workspace through workspaceName argument. Once you successfully clone the workspace, the function will return the name of the cloned workspace. For example, the successfully execution of the below script will return {YOUR_BILLING_ACCOUNT}/salmon_test.

salmonWorkspaceName <- basename(tempfile("salmon_")) # unique workspace name
salmonWorkspaceName
cloneWorkspace(workspaceName = salmonWorkspaceName, analysis = analysis)

Any workspace you have access to

If you want to clone any other workspace that you have access to but is not curated by this pacakge, you can directly enter the name of the target workspace as a templateName. For example, to clone the Tumor_Only_CNV workspace:

cnvWorkspaceName <- basename(tempfile("cnv_")) # unique workspace name
cnvWorkspaceName
cloneWorkspace(workspaceName = cnvWorkspaceName,
               templateName = "Tumor_Only_CNV")

Prepare input

Current input

You can review the current inputs using currentInput function. Below shows all the required and optional inputs for the workflow.

config <- getWorkflowConfig(workspaceName = salmonWorkspaceName)
current_input <- currentInput(salmonWorkspaceName, config = config)
current_input


Update input

You can modify/update inputs of your workflow using updateInput function. To minimize the formatting issues, we recommend to make any change in the current input table returned from the currentInput function. Under the default (dry=TRUE), the updated input table will be returned without actually updating Terra/AnVIL. Set dry=FALSE, to make a change in Terra/AnVIL.

new_input <- current_input
new_input[4,4] <- "athal_index"
new_input

updateInput(salmonWorkspaceName, inputs = new_input, config = config)

Run workflow

You can launch the workflow using runWorkflow() function. You need to specify the inputName of your workflow. If you don’t provide it, this function will return the list of input names you can use for your workflow.

Example error outputs:

runWorkflow(slamonWorkspaceName, config = config)
# You should provide the inputName from the followings:
# [1] "AnVILBulkRNASeq_set"
#> Error in runWorkflow(salmonWorkspaceName):
runWorkflow(salmonWorkspaceName, 
            inputName = "AnVILBulkRNASeq_set", 
            config = config)

Monitor progress

The last three columns (status, succeeded, and failed) show the submission and the result status.

submissions <- monitorWorkflow(workspaceName = salmonWorkspaceName)
submissions

Abort submission

You can abort the most recently submitted job using the stopWorkflow function. You can abort any workflow that is not the most recently submitted by providing a specific submissionId.

stopWorkflow(salmonWorkspaceName)

Result

The workspace Bioconductor-Workflow-DESeq2 is the template workspace you cloned at the beginning using the analysis = "salmon" argument in cloneWorkspace() function. This template workspace has already a history of the previous submissions, so we will check the output examples in this workspace.

submissions <- monitorWorkflow(workspaceName = "Bioconductor-Workflow-DESeq2")
submissions

You can check all the output files from the most recently succeeded submission using getOutput function. If you specify the submissionId argument, you can get the output files of that specific submission.

## Output from the successfully-done submission
successful_submissions <- submissions$submissionId[submissions$succeeded == 1]
out <- getOutput(workspaceName = "Bioconductor-Workflow-DESeq2",
                 submissionId = successful_submissions[1])
head(out)

Session Info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] BiocStyle_2.33.1
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.3.1          rappdirs_0.3.3       sass_0.4.9          
#>  [4] utf8_1.2.4           generics_0.1.3       futile.options_1.0.1
#>  [7] digest_0.6.37        magrittr_2.0.3       evaluate_1.0.1      
#> [10] fastmap_1.2.0        jsonlite_1.8.9       formatR_1.14        
#> [13] promises_1.3.0       BiocManager_1.30.25  httr_1.4.7          
#> [16] purrr_1.0.2          fansi_1.0.6          rapiclient_0.1.8    
#> [19] httr2_1.0.5          jquerylib_0.1.4      cli_3.6.3           
#> [22] shiny_1.9.1          rlang_1.1.4          futile.logger_1.4.3 
#> [25] AnVIL_1.17.20        cachem_1.1.0         yaml_2.3.10         
#> [28] BiocBaseUtils_1.7.3  tools_4.4.1          parallel_4.4.1      
#> [31] dplyr_1.1.4          httpuv_1.6.15        DT_0.33             
#> [34] lambda.r_1.2.4       buildtools_1.0.0     vctrs_0.6.5         
#> [37] R6_2.5.1             mime_0.12            lifecycle_1.0.4     
#> [40] htmlwidgets_1.6.4    miniUI_0.1.1.1       pkgconfig_2.0.3     
#> [43] pillar_1.9.0         bslib_0.8.0          later_1.3.2         
#> [46] glue_1.8.0           Rcpp_1.0.13          xfun_0.48           
#> [49] tibble_3.2.1         tidyselect_1.2.1     sys_3.4.3           
#> [52] knitr_1.48           AnVILBase_0.99.32    xtable_1.8-4        
#> [55] htmltools_0.5.8.1    rmarkdown_2.28       maketools_1.3.1     
#> [58] compiler_4.4.1