Title: | Reusable and reproducible Data Management |
---|---|
Description: | ReUseData is an _R/Bioconductor_ software tool to provide a systematic and versatile approach for standardized and reproducible data management. ReUseData facilitates transformation of shell or other ad hoc scripts for data preprocessing into workflow-based data recipes. Evaluation of data recipes generate curated data files in their generic formats (e.g., VCF, bed). Both recipes and data are cached using database infrastructure for easy data management and reuse. Prebuilt data recipes are available through ReUseData portal ("https://rcwl.org/dataRecipes/") with full annotation and user instructions. Pregenerated data are available through ReUseData cloud bucket that is directly downloadable through "getCloudData()". |
Authors: | Qian Liu [aut, cre] |
Maintainer: | Qian Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.7.0 |
Built: | 2024-12-04 06:01:43 UTC |
Source: | https://github.com/bioc/ReUseData |
Add annotation or meta information to existing data
annData( path, notes, date = Sys.Date(), recursive = TRUE, md5 = FALSE, skip = "*.md|meta.yml", force = FALSE, ... )
annData( path, notes, date = Sys.Date(), recursive = TRUE, md5 = FALSE, skip = "*.md|meta.yml", force = FALSE, ... )
path |
The data path to annotate. |
notes |
User assigned notes/keywords to annotate the data and
be used for keywords matching in |
date |
The date of the data. |
recursive |
Whether to annotate all data recursively. |
md5 |
Whether to generate md5 values for all files. |
skip |
Patter to skip files in the path. |
force |
Whether to force regenerate meta.yml. |
... |
The other options from |
dataHub
class, constructor, and methods.
dataHub(BFC) dataHub(BFC) ## S4 method for signature 'dataHub' show(object) dataNames(object) dataParams(object) dataNotes(object) dataPaths(object) dataYml(object) dataTags(object) ## S4 method for signature 'dataHub' dataTags(object) dataTags(object, append = TRUE) <- value ## S4 replacement method for signature 'dataHub' dataTags(object, append = FALSE) <- value ## S4 method for signature 'dataHub,ANY,ANY,ANY' x[i, j, drop] ## S4 replacement method for signature 'dataHub,ANY,ANY,ANY' x[i, j] <- value ## S4 method for signature 'dataHub' c(x, ...) toList( x, listNames = NULL, format = c("list", "json", "yaml"), type = NULL, file = character() )
dataHub(BFC) dataHub(BFC) ## S4 method for signature 'dataHub' show(object) dataNames(object) dataParams(object) dataNotes(object) dataPaths(object) dataYml(object) dataTags(object) ## S4 method for signature 'dataHub' dataTags(object) dataTags(object, append = TRUE) <- value ## S4 replacement method for signature 'dataHub' dataTags(object, append = FALSE) <- value ## S4 method for signature 'dataHub,ANY,ANY,ANY' x[i, j, drop] ## S4 replacement method for signature 'dataHub,ANY,ANY,ANY' x[i, j] <- value ## S4 method for signature 'dataHub' c(x, ...) toList( x, listNames = NULL, format = c("list", "json", "yaml"), type = NULL, file = character() )
BFC |
A BiocFileCache object created for data and recipes. |
object |
A |
append |
Whether to append new tag or replace all tags. |
value |
A |
x |
A |
i |
The integer index of the |
j |
inherited from |
drop |
Inherited from |
... |
More |
listNames |
A vector of names for the output list. |
format |
can be "list", "json" or "yaml". Supports partial match. Default is list. |
type |
The type of workflow input list, such as cwl. |
file |
The file name to save the data list in required format. The data extension needs to be included, e.g., ".json" or ".yml". |
dataHub: a dataHub
object.
dataNames: the names of datasets in dataHub
object.
dataParams: the data recipe parameter values for datasets
in dataHub
object.
dataNotes: the notes of datasets in dataHub
object.
dataPaths: the file paths of datasets in dataHub
object.
dataYml: the yaml file paths of datasets in dataHub
object.
dataTags: the tags of datasets in dataHub
object.
toList: A list of datasets in specific format, and a file
if file
argument is specified.
outdir <- file.path(tempdir(), "SharedData") dataUpdate(outdir, cloud = TRUE) dd <- dataSearch(c("liftover", "GRCh38")) dataNames(dd) dataParams(dd) dataNotes(dd) dataTags(dd) dataYml(dd) toList(dd) toList(dd, format = "yaml") toList(dd, format = "json", file = tempfile())
outdir <- file.path(tempdir(), "SharedData") dataUpdate(outdir, cloud = TRUE) dd <- dataSearch(c("liftover", "GRCh38")) dataNames(dd) dataParams(dd) dataNotes(dd) dataTags(dd) dataYml(dd) toList(dd) toList(dd, format = "yaml") toList(dd, format = "json", file = tempfile())
dataSearch search data in local data caching system
dataSearch(keywords = character(), cachePath = "ReUseData")
dataSearch(keywords = character(), cachePath = "ReUseData")
keywords |
character vector of keywords to be matched to the
local datasets. It matches the "notes" when generating the data
using |
cachePath |
A character string for the data cache. Must
match the one specified in |
a dataHub
object containing the information about local
data cache, e.g., data name, data path, etc.
dataSearch() dataSearch(c("gencode")) dataSearch("#gatk")
dataSearch() dataSearch(c("gencode")) dataSearch("#gatk")
Function to update the local data records by reading the yaml files in the specified directory recursively.
dataUpdate( dir, cachePath = "ReUseData", outMeta = FALSE, keepTags = TRUE, cleanup = FALSE, cloud = FALSE, remote = FALSE, checkData = TRUE, duplicate = FALSE )
dataUpdate( dir, cachePath = "ReUseData", outMeta = FALSE, keepTags = TRUE, cleanup = FALSE, cloud = FALSE, remote = FALSE, checkData = TRUE, duplicate = FALSE )
dir |
a character string for the directory where all data are saved. Data information will be collected recursively within this directory. |
cachePath |
A character string specifying the name for the
|
outMeta |
Logical. If TRUE, a "meta_data.csv" file will be
generated in the |
keepTags |
If keep the prior assigned data tags. Default is TRUE. |
cleanup |
If remove any invalid intermediate files. Default is
FALSE. In cases one data recipe (with same parameter values)
was evaluated multiple times, the same data file(s) will match
to multiple intermediate files (e.g., .yml). |
cloud |
Whether to return the pregenerated data from Google Cloud bucket of ReUseData. Default is FALSE. |
remote |
Whether to use the csv file (containing information
about pregenerated data on Google Cloud) from GitHub, which is
most up-to-date. Only works when |
checkData |
check if the data (listed as "# output: " in the yml file) exists. If not, do not include in the output csv file. This argument is added for internal testing purpose. |
duplicate |
Whether to remove duplicates. If TRUE, older version of duplicates will be removed. |
Users can directly retrieve information for all available
datasets by using meta_data(dir=)
, which generates a data
frame in R with same information as described above and can be
saved out. dataUpdate
does extra check for all datasets
(check the file path in "output" column), remove invalid ones,
e.g., empty or non-existing file path, and create a data cache
for all valid datasets.
a dataHub
object containing the information about local
data cache, e.g., data name, data path, etc.
## Generate data ## Not run: library(Rcwl) outdir <- file.path(tempdir(), "SharedData") echo_out <- recipeLoad("echo_out") Rcwl::inputs(echo_out) echo_out$input <- "Hello World!" echo_out$outfile <- "outfile" res <- getData(echo_out, outdir = outdir, notes = c("echo", "hello", "world", "txt"), showLog = TRUE) ensembl_liftover <- recipeLoad("ensembl_liftover") Rcwl::inputs(ensembl_liftover) ensembl_liftover$species <- "human" ensembl_liftover$from <- "GRCh37" ensembl_liftover$to <- "GRCh38" res <- getData(ensembl_liftover, outdir = outdir, notes = c("ensembl", "liftover", "human", "GRCh37", "GRCh38"), showLog = TRUE) ## Update data cache (with or without prebuilt data sets from ReUseData cloud bucket) dataUpdate(dir = outdir) dataUpdate(dir = outdir, cloud = TRUE) ## newly generated data are now cached and searchable dataSearch(c("hello", "world")) dataSearch(c("ensembl", "liftover")) ## both locally generated data and google cloud data! ## End(Not run)
## Generate data ## Not run: library(Rcwl) outdir <- file.path(tempdir(), "SharedData") echo_out <- recipeLoad("echo_out") Rcwl::inputs(echo_out) echo_out$input <- "Hello World!" echo_out$outfile <- "outfile" res <- getData(echo_out, outdir = outdir, notes = c("echo", "hello", "world", "txt"), showLog = TRUE) ensembl_liftover <- recipeLoad("ensembl_liftover") Rcwl::inputs(ensembl_liftover) ensembl_liftover$species <- "human" ensembl_liftover$from <- "GRCh37" ensembl_liftover$to <- "GRCh38" res <- getData(ensembl_liftover, outdir = outdir, notes = c("ensembl", "liftover", "human", "GRCh37", "GRCh38"), showLog = TRUE) ## Update data cache (with or without prebuilt data sets from ReUseData cloud bucket) dataUpdate(dir = outdir) dataUpdate(dir = outdir, cloud = TRUE) ## newly generated data are now cached and searchable dataSearch(c("hello", "world")) dataSearch(c("ensembl", "liftover")) ## both locally generated data and google cloud data! ## End(Not run)
getCloudData Download the pregenerated curated data sets from ReUseData cloud bucket
getCloudData(datahub, outdir = character())
getCloudData(datahub, outdir = character())
datahub |
The |
outdir |
The output directory for the data (and concomitant annotation files) to be downloaded. It is recommended to use a new folder under a shared folder for a new to-be-downloaded data. |
Data and concomitant annotation files will be downloaded to
the user-specified folder that is locally searchable with
dataSearch()
.
outdir <- file.path(tempdir(), "gcpData") dh <- dataSearch(c("ensembl", "GRCh38")) dh <- dh[grep("http", dataPaths(dh))] ## download data from google bucket getCloudData(dh[1], outdir = outdir) ## Update local data caching dataUpdate(outdir) ## no "cloud=TRUE" here, only showing local data cache ## Now the data is available to use locally dataSearch(c("ensembl", "GRCh38"))
outdir <- file.path(tempdir(), "gcpData") dh <- dataSearch(c("ensembl", "GRCh38")) dh <- dh[grep("http", dataPaths(dh))] ## download data from google bucket getCloudData(dh[1], outdir = outdir) ## Update local data caching dataUpdate(outdir) ## no "cloud=TRUE" here, only showing local data cache ## Now the data is available to use locally dataSearch(c("ensembl", "GRCh38"))
Evaluation of data recipes to generate curated dataset of interest.
getData( rcp, outdir, prefix = NULL, notes = c(), conda = FALSE, BPPARAM = NULL, ... )
getData( rcp, outdir, prefix = NULL, notes = c(), conda = FALSE, BPPARAM = NULL, ... )
rcp |
the data recipe in |
outdir |
Character string specifying the directory to store the output files. Will automatically create if not exist or provided. |
prefix |
Character string specifying the file name of the annotation files (.yml, .cwl, .sh, .md5). |
notes |
User assigned notes/keywords to annotate the data and
be used for keywords matching in |
conda |
Whether to use conda to install required software when evaluating the data recipe as a CWL workflow. Default is FALSE. |
BPPARAM |
The options for |
... |
Arguments to be passed into |
The data files and 4 meta files: .cwl
: The cwl script
that was internally run to get the data; .yml
: the input
parameter values for the data recipe and user specified data
annotation notes, versions etc; .sh
: The script for data
processing; .md
: checksum file to verify the integrity of
generated data files.
## Not run: library(Rcwl) outdir <- file.path(tempdir(), "SharedData") ## Example 1 echo_out <- recipeLoad("echo_out") Rcwl::inputs(echo_out) echo_out$input <- "Hello World!" echo_out$outfile <- "outfile" res <- getData(echo_out, outdir = outdir, notes = c("echo", "hello", "world", "txt"), showLog = TRUE) # Example 2 ensembl_liftover <- recipeLoad("ensembl_liftover") Rcwl::inputs(ensembl_liftover) ensembl_liftover$species <- "human" ensembl_liftover$from <- "GRCh37" ensembl_liftover$to <- "GRCh38" res <- getData(ensembl_liftover, outdir = outdir, notes = c("ensembl", "liftover", "human", "GRCh37", "GRCh38"), showLog = TRUE) dir(outdir) ## End(Not run)
## Not run: library(Rcwl) outdir <- file.path(tempdir(), "SharedData") ## Example 1 echo_out <- recipeLoad("echo_out") Rcwl::inputs(echo_out) echo_out$input <- "Hello World!" echo_out$outfile <- "outfile" res <- getData(echo_out, outdir = outdir, notes = c("echo", "hello", "world", "txt"), showLog = TRUE) # Example 2 ensembl_liftover <- recipeLoad("ensembl_liftover") Rcwl::inputs(ensembl_liftover) ensembl_liftover$species <- "human" ensembl_liftover$from <- "GRCh37" ensembl_liftover$to <- "GRCh38" res <- getData(ensembl_liftover, outdir = outdir, notes = c("ensembl", "liftover", "human", "GRCh37", "GRCh38"), showLog = TRUE) dir(outdir) ## End(Not run)
Functions to generate the meta csv file for local cached dataset.
meta_data(dir = "", cleanup = FALSE, checkData = TRUE)
meta_data(dir = "", cleanup = FALSE, checkData = TRUE)
dir |
The path to the shared data folder. |
cleanup |
If remove any invalid intermediate files. Default is
FALSE. In cases one data recipe (with same parameter values)
was evaluated multiple times, the same data file(s) will match
to multiple intermediate files (e.g., .yml). |
checkData |
check if the data (listed as "# output: " in the yml file) exists. If not, do not include in the output csv file. This argument is added for internal testing purpose. |
a data.frame
with yml file name, parameter values, data
file paths, date, and user-specified notes when generating the
data with getData()
.
outdir <- file.path(tempdir(), "SharedData") meta_data(outdir)
outdir <- file.path(tempdir(), "SharedData") meta_data(outdir)
recipeHub
class, constructor, and methods.
recipeHub(BFC) recipeHub(BFC) ## S4 method for signature 'recipeHub' show(object) ## S4 method for signature 'recipeHub,ANY,ANY,ANY' x[i] recipeNames(object)
recipeHub(BFC) recipeHub(BFC) ## S4 method for signature 'recipeHub' show(object) ## S4 method for signature 'recipeHub,ANY,ANY,ANY' x[i] recipeNames(object)
BFC |
A BiocFileCache object created for recipe and recipes. |
object |
The |
x |
The |
i |
The integer index of the |
recipeHub: a recipeHub
object.
[: A recipeHub
object that was subsetted.
recipeNames: the recipe names for the recipeHub
object.
rcps <- recipeSearch(c("gencode")) ## rcp1 <- rcps[1] ## recipeNames(rcp1)
rcps <- recipeSearch(c("gencode")) ## rcp1 <- rcps[1] ## recipeNames(rcp1)
To load data recipe(s) into R environment.
recipeLoad( rcp = c(), cachePath = "ReUseDataRecipe", env = .GlobalEnv, return = TRUE )
recipeLoad( rcp = c(), cachePath = "ReUseDataRecipe", env = .GlobalEnv, return = TRUE )
rcp |
The (vector of) character string of recipe name or file
path ( |
cachePath |
A character string for the recipe cache. Must
match the one specified in |
env |
The R environment to export to. Default is |
return |
Whether to return the recipe to a user-assigned R
object. Default is TRUE, where user need to assign a variable
name to the recipe. e.g., |
A data recipe of cwlProcess
S4 class, which is ready to
be evaluated in R.
######################## ## Load single recipe ######################## library(Rcwl) recipeUpdate() recipeSearch("liftover") rcp <- recipeLoad("ensembl_liftover") Rcwl::inputs(rcp) rm(rcp) gencode_annotation <- recipeLoad("gencode_annotation") inputs(gencode_annotation) rm(gencode_annotation) ######################### ## Load multiple recipes ######################### rcphub <- recipeSearch("gencode") recipeNames(rcphub) recipeLoad(recipeNames(rcphub), return=FALSE) inputs(gencode_transcripts)
######################## ## Load single recipe ######################## library(Rcwl) recipeUpdate() recipeSearch("liftover") rcp <- recipeLoad("ensembl_liftover") Rcwl::inputs(rcp) rm(rcp) gencode_annotation <- recipeLoad("gencode_annotation") inputs(gencode_annotation) rm(gencode_annotation) ######################### ## Load multiple recipes ######################### rcphub <- recipeSearch("gencode") recipeNames(rcphub) recipeLoad(recipeNames(rcphub), return=FALSE) inputs(gencode_transcripts)
Constructor function of data recipe
recipeMake( shscript = character(), paramID = c(), paramType = c(), outputID = c(), outputType = c("File[]"), outputGlob = character(0), requireTools = character(0) )
recipeMake( shscript = character(), paramID = c(), paramType = c(), outputID = c(), outputType = c("File[]"), outputGlob = character(0), requireTools = character(0) )
shscript |
character string. Can take either the file path to the user provided shell script, or directly the script content, that are to be converted into a data recipe. |
paramID |
Character vector. The user specified parameter ID for the recipe. |
paramType |
Character vector specifying the type for each
|
outputID |
the ID for each output. |
outputType |
the output type for each output. |
outputGlob |
the glob pattern of output files. E.g., "hg19.*". |
requireTools |
the command-line tools to be used for data processing/curation in the user-provided shell script. The value here must exactly match the tool name. E.g., "bwa", "samtools", etc. A particular version of that tool can be specified in the format of "tool=version", e.g., "samtools=1.3". |
For parameter types, more details can be found here: "https://www.commonwl.org/v1.2/CommandLineTool.html#CWLType".
recipeMake
is a convenient function for wrapping a shell script
into a data recipe (in cwlProcess
S4 class). Please use
Rcwl::cwlProcess
for more options and functionalities,
especially when the recipe gets complicated, e.g., needs a
docker image for a command-line tool, or one parameter takes
multiple types, etc. Refer to this recipe as an example:
https://github.com/rworkflow/ReUseDataRecipe/blob/master/reference_genome.R
a data recipe in cwlProcess
S4 class with all details
about the shell script for data processing/curation, inputs,
outputs, required tools and corresponding docker files. It is
readily taken by getData()
to evaluate the shell scripts
included and generate the data locally. Find more details with
?Rcwl::cwlProcess
.
## Not run: library(Rcwl) ############## ### example 1 ############## script <- " input=$1 outfile=$2 echo \"Print the input: $input\" > $outfile.txt " rcp <- recipeMake(shscript = script, paramID = c("input", "outfile"), paramType = c("string", "string"), outputID = "echoout", outputGlob = "*.txt") inputs(rcp) outputs(rcp) rcp$input <- "Hello World!" rcp$outfile <- "outfile" res <- getData(rcp, outdir = tempdir(), notes = c("echo", "hello", "world", "txt"), showLog = TRUE) readLines(res$out) ############## ### example 2 ############## shfile <- system.file("extdata", "gencode_transcripts.sh", package = "ReUseData") readLines(shfile) rcp <- recipeMake(shscript = shfile, paramID = c("species", "version"), paramType = c("string", "string"), outputID = "transcripts", outputGlob = "*.transcripts.fa*", requireTools = c("wget", "gzip", "samtools") ) Rcwl::inputs(rcp) rcp$species <- "human" rcp$version <- "42" res <- getData(rcp, outdir = tempdir(), notes = c("gencode", "transcripts", "human", "42"), showLog = TRUE) res$output dir(tempdir()) ## End(Not run)
## Not run: library(Rcwl) ############## ### example 1 ############## script <- " input=$1 outfile=$2 echo \"Print the input: $input\" > $outfile.txt " rcp <- recipeMake(shscript = script, paramID = c("input", "outfile"), paramType = c("string", "string"), outputID = "echoout", outputGlob = "*.txt") inputs(rcp) outputs(rcp) rcp$input <- "Hello World!" rcp$outfile <- "outfile" res <- getData(rcp, outdir = tempdir(), notes = c("echo", "hello", "world", "txt"), showLog = TRUE) readLines(res$out) ############## ### example 2 ############## shfile <- system.file("extdata", "gencode_transcripts.sh", package = "ReUseData") readLines(shfile) rcp <- recipeMake(shscript = shfile, paramID = c("species", "version"), paramType = c("string", "string"), outputID = "transcripts", outputGlob = "*.transcripts.fa*", requireTools = c("wget", "gzip", "samtools") ) Rcwl::inputs(rcp) rcp$species <- "human" rcp$version <- "42" res <- getData(rcp, outdir = tempdir(), notes = c("gencode", "transcripts", "human", "42"), showLog = TRUE) res$output dir(tempdir()) ## End(Not run)
Search existing data recipes.
recipeSearch(keywords = character(), cachePath = "ReUseDataRecipe")
recipeSearch(keywords = character(), cachePath = "ReUseDataRecipe")
keywords |
character vector of keywords to be matched to the recipe names. If not specified, function returns the full recipe list. |
cachePath |
A character string for the recipe cache. Must
match the one specified in |
A recipeHub
object.
recipeSearch() recipeSearch("gencode") recipeSearch(c("STAR", "index"))
recipeSearch() recipeSearch("gencode") recipeSearch(c("STAR", "index"))
Function to sync and get the most updated and newly added data recipes through the pubic "rworkflow/ReUseDataRecipe" GitHub repository or user-specified private GitHub repository.
recipeUpdate( cachePath = "ReUseDataRecipe", force = FALSE, remote = FALSE, repos = "rworkflow/ReUseDataRecipe" )
recipeUpdate( cachePath = "ReUseDataRecipe", force = FALSE, remote = FALSE, repos = "rworkflow/ReUseDataRecipe" )
cachePath |
A character string specifying the name for the
|
force |
Whether to remove existing and regenerate recipes
cache. Default is FALSE. Only use if any old recipes that have
been previously cached locally are updated remotely (on GitHub
|
remote |
Whether to download the data recipes directly from a GitHub repository. Default is FALSE. |
repos |
The GitHub repository containing data recipes that are
to be synced to local cache. Only works when
|
a recipeHub
object.
## recipeUpdate() ## recipeUpdate(force=TRUE) ## recipeUpdate(force = TRUE, remote = TRUE)
## recipeUpdate() ## recipeUpdate(force=TRUE) ## recipeUpdate(force = TRUE, remote = TRUE)
ReUseData is an R/Bioconductor software tool to provide a systematic and versatile approach for standardized and reproducible data management. ReUseData facilitates transformation of shell or other ad hoc scripts for data preprocessing into workflow-based data recipes. Evaluation of data recipes generate curated data files in their generic formats (e.g., VCF, bed). Both recipes and data are cached using database infrastructure for easy data management and reuse. Prebuilt data recipes are available through ReUseData portal ("https://rcwl.org/dataRecipes/") with full annotation and user instructions. Pregenerated data are available through ReUseData cloud bucket that is directly downloadable through "getCloudData()".