ReUseData is an R/Bioconductor software tool to provide a systematic and versatile approach for standardized and reproducible data management. ReUseData facilitates transformation of shell or other ad hoc scripts for data preprocessing into workflow-based data recipes. Evaluation of data recipes generate curated data files in their generic formats (e.g., VCF, bed). Both recipes and data are cached using database infrastructure for easy data management and reuse. Prebuilt data recipes are available through ReUseData portal (“https://rcwl.org/dataRecipes/”) with full annotation and user instructions. Pregenerated data are available through ReUseData cloud bucket that is directly downloadable through “getCloudData()”.
This quick start shows the basic use of package functions in 2 major categories for managing:
Details for each section can be found in the companion vignettes for data recipes and reusable data.
All pre-built data recipes are included in the package and can be
easily updated (recipeUpdate
), searched
(recipeSearch
) and loaded (recipeLoad
).
Details about data recipes can be found in the vignette
ReUseData_recipe.html
.
recipeUpdate(cachePath = "ReUseDataRecipe", force = TRUE)
#> NOTE: existing caches will be removed and regenerated!
#> Updating recipes...
#> STAR_index.R added
#> bowtie2_index.R added
#> echo_out.R added
#> ensembl_liftover.R added
#> gcp_broad_gatk_hg19.R added
#> gcp_broad_gatk_hg38.R added
#> gcp_gatk_mutect2_b37.R added
#> gcp_gatk_mutect2_hg38.R added
#> gencode_annotation.R added
#> gencode_genome_grch38.R added
#> gencode_transcripts.R added
#> hisat2_index.R added
#> reference_genome.R added
#> salmon_index.R added
#> ucsc_database.R added
#>
#> recipeHub with 15 records
#> cache path: /tmp/RtmpchyZGA/cache/ReUseDataRecipe
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#>
#> name
#> BFC16 | STAR_index
#> BFC17 | bowtie2_index
#> BFC18 | echo_out
#> BFC19 | ensembl_liftover
#> BFC20 | gcp_broad_gatk_hg19
#> ... ...
#> BFC26 | gencode_transcripts
#> BFC27 | hisat2_index
#> BFC28 | reference_genome
#> BFC29 | salmon_index
#> BFC30 | ucsc_database
recipeSearch("echo")
#> recipeHub with 1 records
#> cache path: /tmp/RtmpchyZGA/cache/ReUseDataRecipe
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#>
#> name
#> BFC18 | echo_out
echo_out <- recipeLoad("echo_out")
#> Note: you need to assign a name for the recipe: rcpName <- recipeLoad('xx')
#> Data recipe loaded!
#> Use inputs() to check required input parameters before evaluation.
#> Check here: https://rcwl.org/dataRecipes/echo_out.html
#> for user instructions (e.g., eligible input values, data source, etc.)
We can install cwltool first to make sure a cwl-runner is available.
A data recipe can be evaluated by assigning values to the recipe
parameters. getData
runs the recipe as a CWL scripts
internally, and generates the data of interest with annotation files for
future reuse.
Rcwl::inputs(echo_out)
#> inputs:
#> input (input) (string):
#> outfile (outfile) (string):
echo_out$input <- "Hello World!"
echo_out$outfile <- "outfile"
outdir <- file.path(tempdir(), "SharedData")
res <- getData(echo_out,
outdir = outdir,
notes = c("echo", "hello", "world", "txt"))
#> }[1;30mINFO[0m Final process status is success
res$out
#> [1] "/tmp/RtmpchyZGA/SharedData/outfile.txt"
readLines(res$out)
#> [1] "Print the input: Hello World!"
One can create a data recipe from scratch or by converting an
existing shell script for data processing, by specifying input
parameters, output globbing patterns using recipeMake
function.
script <- system.file("extdata", "echo_out.sh", package = "ReUseData")
rcp <- recipeMake(shscript = script,
paramID = c("input", "outfile"),
paramType = c("string", "string"),
outputID = "echoout",
outputGlob = "*.txt")
Rcwl::inputs(rcp)
#> inputs:
#> input (string):
#> outfile (string):
Rcwl::outputs(rcp)
#> outputs:
#> echoout:
#> type: File[]
#> outputBinding:
#> glob: '*.txt'
The data that are generated from evaluating data recipes are
automatically annotated and tracked with user-specified keywords and
time/date tags. It uses a similar cache system as for recipes for users
to easily update (dataUpdate
), search
(dataSearch
) and use (toList
).
Pre-generated data files from existing data recipes are saved in
Google Cloud Bucket, that are ready to be queried
(dataSearch(cloud=TRUE)
) and downloaded
(getCloudData
) to local cache system with annotations.
ReUseData
dh <- dataUpdate(dir = outdir)
dataSearch(c("echo", "hello"))
dataNames(dh)
dataParams(dh)
dataNotes(dh)
toList(dh, format="json", file = file.path(outdir, "data.json"))
dh <- dataUpdate(dir = outdir, cloud = TRUE)
getCloudData(dh[2], outdir = outdir)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ReUseData_1.7.0 Rcwl_1.23.0 S4Vectors_0.45.2
#> [4] BiocGenerics_0.53.3 generics_0.1.3 yaml_2.3.10
#> [7] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] dir.expiry_1.15.0 xfun_0.49 bslib_0.8.0
#> [4] htmlwidgets_1.6.4 visNetwork_2.1.2 lattice_0.22-6
#> [7] batchtools_0.9.17 vctrs_0.6.5 tools_4.4.2
#> [10] curl_6.0.1 base64url_1.4 parallel_4.4.2
#> [13] tibble_3.2.1 RSQLite_2.3.9 blob_1.2.4
#> [16] RcwlPipelines_1.23.0 pkgconfig_2.0.3 R.oo_1.27.0
#> [19] Matrix_1.7-1 data.table_1.16.4 checkmate_2.3.2
#> [22] dbplyr_2.5.0 RColorBrewer_1.1-3 lifecycle_1.0.4
#> [25] git2r_0.35.0 compiler_4.4.2 progress_1.2.3
#> [28] codetools_0.2-20 httpuv_1.6.15 htmltools_0.5.8.1
#> [31] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
#> [34] pillar_1.10.0 later_1.4.1 crayon_1.5.3
#> [37] jquerylib_0.1.4 R.utils_2.12.3 BiocParallel_1.41.0
#> [40] cachem_1.1.0 mime_0.12 basilisk_1.19.0
#> [43] brew_1.0-10 tidyselect_1.2.1 digest_0.6.37
#> [46] stringi_1.8.4 purrr_1.0.2 dplyr_1.1.4
#> [49] maketools_1.3.1 fastmap_1.2.0 grid_4.4.2
#> [52] cli_3.6.3 magrittr_2.0.3 DiagrammeR_1.0.11
#> [55] withr_3.0.2 prettyunits_1.2.0 filelock_1.0.3
#> [58] promises_1.3.2 backports_1.5.0 rappdirs_0.3.3
#> [61] bit64_4.5.2 httr_1.4.7 rmarkdown_2.29
#> [64] bit_4.5.0.1 reticulate_1.40.0 png_0.1-8
#> [67] R.methodsS3_1.8.2 hms_1.1.3 memoise_2.0.1
#> [70] shiny_1.10.0 evaluate_1.0.1 knitr_1.49
#> [73] basilisk.utils_1.19.0 BiocFileCache_2.15.0 rlang_1.1.4
#> [76] Rcpp_1.0.13-1 xtable_1.8-4 glue_1.8.0
#> [79] DBI_1.2.3 BiocManager_1.30.25 jsonlite_1.8.9
#> [82] R6_2.5.1