ReUseData
provides
functionalities to construct workflow-based data recipes for fully
tracked and reproducible data processing. Evaluation of data recipes
generates curated data resources in their generic formats (e.g., VCF,
bed), as well as a YAML manifest file recording the recipe parameters,
data annotations, and data file paths for subsequent reuse. The datasets
are locally cached using a database infrastructure, where updating and
searching of specific data is made easy.
The data reusability is assured through cloud hosting and enhanced interoperability with downstream software tools or analysis workflows. The workflow strategy enables cross platform reproducibility of curated data resources.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ReUseData")
Use the development version:
ReUseData
core functions for data managementHere we introduce the core functions of ReUseData
for
data management and reuse: getData
for reproducible data
generation, dataUpdate
for syncing and updating data cache,
and dataSearch
for multi-keywords searching of dataset of
interest.
First, we can construct data recipes by transforming shell or other
ad hoc data preprocessing scripts into workflow-based data recipes. Some
prebuilt data recipes for public data resources (e.g., downloading,
unzipping and indexing) are available for direct use through
recipeSearch
and recipeLoad
functions. Then we
will assign values to the input parameters and evaluate the recipe to
generate data of interest.
## set cache in tempdir for test
Sys.setenv(cachePath = file.path(tempdir(), "cache"))
recipeUpdate()
#> Updating recipes...
#> STAR_index.R added
#> bowtie2_index.R added
#> echo_out.R added
#> ensembl_liftover.R added
#> gcp_broad_gatk_hg19.R added
#> gcp_broad_gatk_hg38.R added
#> gcp_gatk_mutect2_b37.R added
#> gcp_gatk_mutect2_hg38.R added
#> gencode_annotation.R added
#> gencode_genome_grch38.R added
#> gencode_transcripts.R added
#> hisat2_index.R added
#> reference_genome.R added
#> salmon_index.R added
#> ucsc_database.R added
#>
#> recipeHub with 15 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseDataRecipe
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#>
#> name
#> BFC1 | STAR_index
#> BFC2 | bowtie2_index
#> BFC3 | echo_out
#> BFC4 | ensembl_liftover
#> BFC5 | gcp_broad_gatk_hg19
#> ... ...
#> BFC11 | gencode_transcripts
#> BFC12 | hisat2_index
#> BFC13 | reference_genome
#> BFC14 | salmon_index
#> BFC15 | ucsc_database
recipeSearch("echo")
#> recipeHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseDataRecipe
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#>
#> name
#> BFC3 | echo_out
echo_out <- recipeLoad("echo_out")
#> Note: you need to assign a name for the recipe: rcpName <- recipeLoad('xx')
#> Data recipe loaded!
#> Use inputs() to check required input parameters before evaluation.
#> Check here: https://rcwl.org/dataRecipes/echo_out.html
#> for user instructions (e.g., eligible input values, data source, etc.)
inputs(echo_out)
#> inputs:
#> input (input) (string):
#> outfile (outfile) (string):
Users can then assign values to the input parameters, and evaluate
the recipe (getData
) to generate data of interest. Users
need to specify an output directory for all files (desired data file,
intermediate files that are internally generated as workflow scripts or
annotation files). Detailed notes for the data is encouraged which will
be used for keywords matching for later data search.
We can install cwltool first to make sure a cwl-runner is available.
invisible(Rcwl::install_cwltool())
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda create --yes --prefix /github/home/.cache/R/basilisk/1.19.0/Rcwl/1.23.0/env_Rcwl 'python=3.11' --quiet -c conda-forge --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/Rcwl/1.23.0/env_Rcwl 'python=3.11' -c conda-forge --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/Rcwl/1.23.0/env_Rcwl -c conda-forge 'python=3.11' 'python=3.11' --override-channels
echo_out$input <- "Hello World!"
echo_out$outfile <- "outfile"
outdir <- file.path(tempdir(), "SharedData")
res <- getData(echo_out,
outdir = outdir,
notes = c("echo", "hello", "world", "txt"))
#> }[1;30mINFO[0m Final process status is success
The file path to newly generated dataset can be easily retrieved. It
can also be retrieved using dataSearch()
functions with
multiple keywords. Before that, dataUpdate()
needs to be
done.
There are some automatically generated files to help track the data
recipe evaluation, including *.sh
to record the original
shell script, *.cwl
file as the official workflow script
which was internally submitted for data recipe evaluation,
*.yml
file as part of CWL workflow evaluation, which also
record data annotations, and *.md5
checksum file to
check/verify the integrity of generated data file.
list.files(outdir, pattern = "echo")
#> [1] "echo_out_Hello_World!_outfile.cwl" "echo_out_Hello_World!_outfile.md5"
#> [3] "echo_out_Hello_World!_outfile.sh" "echo_out_Hello_World!_outfile.yml"
The *.yml
file contains information about recipe input
parameters, the file path to output file, the notes for the dataset, and
auto-added date for data generation time. A later data search using
dataSearch()
will refer to this file for keywords
match.
dataUpdate()
creates (if first time use), syncs and
updates the local cache for curated datasets. It finds and reads all the
.yml
files recursively in the provided data folder, creates
a cache record for each dataset that is associated (including newly
generated ones with getData()
), and updates the local cache
for later data searching and reuse.
IMPORTANT: It is recommended that users create a
specified folder for data archival (e.g.,
file/path/to/SharedData
) that other group members have
access to, and use sub-folders for different kinds of datasets (e.g.,
those generated from same recipe).
(dh <- dataUpdate(dir = outdir))
#>
#> Updating data record...
#> outfile.txt added
#> dataHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC1 | outfile.txt /tmp/RtmpHFQPar/SharedData/outfile.txt
dataUpdate
and dataSearch
return a
dataHub
object with a list of all available or matching
datasets.
One can subset the list with [
and use getter functions
to retrieve the annotation information about the data, e.g., data names,
parameters values to the recipe, notes, tags, and the corresponding yaml
file.
dh[1]
#> dataHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC1 | outfile.txt /tmp/RtmpHFQPar/SharedData/outfile.txt
## dh["BFC1"]
dh[dataNames(dh) == "outfile.txt"]
#> dataHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC1 | outfile.txt /tmp/RtmpHFQPar/SharedData/outfile.txt
dataNames(dh)
#> [1] "outfile.txt"
dataParams(dh)
#> [1] "input: Hello World!; outfile: outfile"
dataNotes(dh)
#> [1] "echo hello world txt"
dataTags(dh)
#> [1] ""
dataYml(dh)
#> [1] "/tmp/RtmpHFQPar/SharedData/echo_out_Hello_World!_outfile.yml"
ReUseData
, as the name suggests, commits to promoting
the data reuse. Data can be prepared in standard input formats
(toList
), e.g., YAML and JSON, to be easily integrated in
workflow methods that are locally or cloud-hosted.
(dh1 <- dataSearch(c("echo", "hello", "world")))
#> dataHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC1 | outfile.txt /tmp/RtmpHFQPar/SharedData/outfile.txt
toList(dh1, listNames = c("input_file"))
#> $input_file
#> [1] "/tmp/RtmpHFQPar/SharedData/outfile.txt"
toList(dh1, format = "yaml", listNames = c("input_file"))
#> [1] "input_file: /tmp/RtmpHFQPar/SharedData/outfile.txt"
toList(dh1, format = "json", file = file.path(tempdir(), "data.json"))
#> File is saved as: "/tmp/RtmpHFQPar/data.json"
#> {
#> "outfile.txt": "/tmp/RtmpHFQPar/SharedData/outfile.txt"
#> }
Data can also be aggregated from different resources by tagging with specific software tools.
dataSearch()
#> dataHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC1 | outfile.txt /tmp/RtmpHFQPar/SharedData/outfile.txt
dataTags(dh[1]) <- "#gatk"
dataSearch("#gatk")
#> dataHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC1 | outfile.txt /tmp/RtmpHFQPar/SharedData/outfile.txt
The package can also be used to add annotation and notes to existing data resources or experiment data for management. Here we add exisiting “exp_data” to local data repository.
We first add notes to the data, and then update data repository with information from the new dataset.
annData(exp_data, notes = c("experiment data"))
#> meta.yml added
#> [1] "/tmp/RtmpHFQPar/exp_data/meta.yml"
dataUpdate(exp_data)
#>
#> Updating data record...
#> exp_data added
#> dataHub with 2 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC1 | outfile.txt /tmp/RtmpHFQPar/SharedData/outfile.txt
#> BFC2 | exp_data /tmp/RtmpHFQPar/exp_data
Now our data hub cached meta information from two different directories, one from data recipe and one from exisiting data. Data can be retrieved by keywords.
dataSearch("experiment")
#> dataHub with 1 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name Path
#> BFC2 | exp_data /tmp/RtmpHFQPar/exp_data
NOTE: if the argument cloud=TRUE
is
enabled, dataUpdate()
will also cache the pregenerated data
sets (from evaluation of public ReUseData recipes) that are available on
ReUseData google bucket and return in the dataHub
object
that are fully searchable. Please see the following section for
details.
With the prebuilt data recipes for curation (e.g., downloading, unzipping, indexing) of commonly used public data resources we have pregenerated some data sets and put them on the cloud space for direct use.
Before searching, one need to use dataUpdate(cloud=TRUE)
to sync the existing data sets on cloud, then dataSearch()
can be used to search any available data set either in local cache and
on the cloud.
gcpdir <- file.path(tempdir(), "gcpData")
dataUpdate(gcpdir, cloud=TRUE)
#>
#> Updating data record...
#> 168e78a05d5d_GRCh38.primary_assembly.genome.fa.1.bt2 added
#> 168e54c85d70_GRCh38.primary_assembly.genome.fa.2.bt2 added
#> 168e3a6dac4f_GRCh38.primary_assembly.genome.fa.3.bt2 added
#> 168e5c714e8d_GRCh38.primary_assembly.genome.fa.4.bt2 added
#> 168e26c41b79_GRCh38.primary_assembly.genome.fa.rev.1.bt2 added
#> 168e62ea45bf_GRCh38.primary_assembly.genome.fa.rev.2.bt2 added
#> 168e2e176ec5_outfile.txt added
#> 168e76a52dec_GRCh37_to_GRCh38.chain added
#> 168e7c49bdda_GRCh37_to_NCBI34.chain added
#> 168e4af04f1f_GRCh37_to_NCBI35.chain added
#> 168e6f795729_GRCh37_to_NCBI36.chain added
#> 168e21f2257a_GRCh38_to_GRCh37.chain added
#> 168e11853f5a_GRCh38_to_NCBI34.chain added
#> 168e6c96301_GRCh38_to_NCBI35.chain added
#> 168e4c964f7b_GRCh38_to_NCBI36.chain added
#> 168e4dae3b25_NCBI34_to_GRCh37.chain added
#> 168e352cad41_NCBI34_to_GRCh38.chain added
#> 168e53299938_NCBI35_to_GRCh37.chain added
#> 168e5806a805_NCBI35_to_GRCh38.chain added
#> 168e3d67d45_NCBI36_to_GRCh37.chain added
#> 168e6446a3c3_NCBI36_to_GRCh38.chain added
#> 168e71a3bb98_GRCm38_to_NCBIM36.chain added
#> 168e323900cd_GRCm38_to_NCBIM37.chain added
#> 168e5341f847_NCBIM36_to_GRCm38.chain added
#> 168e4bc9b12f_NCBIM37_to_GRCm38.chain added
#> 168e46500a96_1000G_omni2.5.b37.vcf.gz added
#> 168e5878a147_1000G_omni2.5.b37.vcf.gz.tbi added
#> 168e2fc9099d_Mills_and_1000G_gold_standard.indels.b37.vcf.gz added
#> 168e7bff1ea3_Mills_and_1000G_gold_standard.indels.b37.vcf.gz.tbi added
#> 168e76a9a1e3_1000G_omni2.5.hg38.vcf.gz added
#> 168e486bef7e_1000G_omni2.5.hg38.vcf.gz.tbi added
#> 168e749f7c00_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz added
#> 168e4b71ff53_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi added
#> 168e2d99bce_af-only-gnomad.raw.sites.vcf added
#> 168e5110ca8d_af-only-gnomad.raw.sites.vcf.idx added
#> 168e72361acc_Mutect2-exome-panel.vcf.idx added
#> 168e65c3e18d_Mutect2-WGS-panel-b37.vcf added
#> 168e7f283952_Mutect2-WGS-panel-b37.vcf.idx added
#> 168e68db48b8_small_exac_common_3.vcf added
#> 168e620d9f67_small_exac_common_3.vcf.idx added
#> 168e4a188872_1000g_pon.hg38.vcf.gz added
#> 168e58549fe1_1000g_pon.hg38.vcf.gz.tbi added
#> 168e3ffc4e1_af-only-gnomad.hg38.vcf.gz added
#> 168e5b9dc7cc_af-only-gnomad.hg38.vcf.gz.tbi added
#> 168e5f1e02e2_small_exac_common_3.hg38.vcf.gz added
#> 168e5096145c_small_exac_common_3.hg38.vcf.gz.tbi added
#> 168e294c02f2_gencode.v41.annotation.gtf added
#> 168e144ab023_gencode.v42.annotation.gtf added
#> 168e23bfad94_gencode.vM30.annotation.gtf added
#> 168e152aaf7_gencode.vM31.annotation.gtf added
#> 168e18212d68_gencode.v41.transcripts.fa added
#> 168e8065157_gencode.v41.transcripts.fa.fai added
#> 168e72f66690_gencode.v42.transcripts.fa added
#> 168e4a5a2e35_gencode.v42.transcripts.fa.fai added
#> 168e5b48499e_gencode.vM30.pc_transcripts.fa added
#> 168e3ec017bf_gencode.vM30.pc_transcripts.fa.fai added
#> 168e10aa38cc_gencode.vM31.pc_transcripts.fa added
#> 168e33c0eae6_gencode.vM31.pc_transcripts.fa.fai added
#> 168e6e89215c_GRCh38.primary_assembly.genome.fa.1.ht2 added
#> 168eca9576f_GRCh38.primary_assembly.genome.fa.2.ht2 added
#> 168e2a6a8cc9_GRCh38.primary_assembly.genome.fa.3.ht2 added
#> 168e36f510db_GRCh38.primary_assembly.genome.fa.4.ht2 added
#> 168e148d370_GRCh38.primary_assembly.genome.fa.5.ht2 added
#> 168e75dc8c1c_GRCh38.primary_assembly.genome.fa.6.ht2 added
#> 168e39ceaca9_GRCh38.primary_assembly.genome.fa.7.ht2 added
#> 168e52599dfd_GRCh38.primary_assembly.genome.fa.8.ht2 added
#> 168e6812a6e8_GRCh38_full_analysis_set_plus_decoy_hla.fa.fai added
#> 168e1f928e36_GRCh38_full_analysis_set_plus_decoy_hla.fa.amb added
#> 168e5181d750_GRCh38_full_analysis_set_plus_decoy_hla.fa.ann added
#> 168e50edefa0_GRCh38_full_analysis_set_plus_decoy_hla.fa.bwt added
#> 168e1a02d9d_GRCh38_full_analysis_set_plus_decoy_hla.fa.pac added
#> 168e1b9a5fc2_GRCh38_full_analysis_set_plus_decoy_hla.fa.sa added
#> 168e29428f81_GRCh38_full_analysis_set_plus_decoy_hla.fa added
#> 168e59ff27e_GRCh38.primary_assembly.genome.fa.fai added
#> 168e7738278e_GRCh38.primary_assembly.genome.fa.amb added
#> 168e8609263_GRCh38.primary_assembly.genome.fa.ann added
#> 168e563606db_GRCh38.primary_assembly.genome.fa.bwt added
#> 168e20842a80_GRCh38.primary_assembly.genome.fa.pac added
#> 168e1cab4286_GRCh38.primary_assembly.genome.fa.sa added
#> 168e79f5b46f_GRCh38.primary_assembly.genome.fa added
#> 168e21d6d578_hs37d5.fa.fai added
#> 168e34cc6fee_hs37d5.fa.amb added
#> 168e1fc05c7_hs37d5.fa.ann added
#> 168e14cd3c08_hs37d5.fa.bwt added
#> 168e7f269e23_hs37d5.fa.pac added
#> 168e5d444f65_hs37d5.fa.sa added
#> 168e538d53c7_hs37d5.fa added
#> 168efd0d6ef_complete_ref_lens.bin added
#> 168e11053a4b_ctable.bin added
#> 168e42167523_ctg_offsets.bin added
#> 168e1c7a2e5f_duplicate_clusters.tsv added
#> 168e3b6fc714_info.json added
#> 168e790b85fe_mphf.bin added
#> 168e1dc301cf_pos.bin added
#> 168e314c5330_pre_indexing.log added
#> 168e32da32a7_rank.bin added
#> 168e701c9fcc_ref_indexing.log added
#> 168e195efa18_refAccumLengths.bin added
#> 168e526cc0dd_reflengths.bin added
#> 168e419e771c_refseq.bin added
#> 168e6a4ce9b8_seq.bin added
#> 168e540cee7a_versionInfo.json added
#> 168e5d38d6de_salmon_index added
#> 168e138f7939_chrLength.txt added
#> 168e59ace0f9_chrName.txt added
#> 168e5470fe6d_chrNameLength.txt added
#> 168e1bf00b9c_chrStart.txt added
#> 168e2fe2e7d4_exonGeTrInfo.tab added
#> 168e74f528ed_exonInfo.tab added
#> 168e389b4e22_geneInfo.tab added
#> 168e29d89c43_Genome added
#> 168e16cbfe65_genomeParameters.txt added
#> 168e6d67be10_Log.out added
#> 168e2bd4a20a_SA added
#> 168e2b993a6d_SAindex added
#> 168e6c8e5c34_sjdbInfo.txt added
#> 168e918f170_sjdbList.fromGTF.out.tab added
#> 168e7f268e34_sjdbList.out.tab added
#> 168e7c5f3323_transcriptInfo.tab added
#> 168e1a1e2bbb_GRCh38.GENCODE.v42_100 added
#> 168e413d0358_knownGene_hg38.sql added
#> 168e18d96182_knownGene_hg38.txt added
#> 168e558df2d0_refGene_hg38.sql added
#> 168e3a488956_refGene_hg38.txt added
#> 168e369c6351_knownGene_mm39.sql added
#> 168e6da4600_knownGene_mm39.txt added
#> 168e6d22bbfe_refGene_mm39.sql added
#> 168e26b9031e_refGene_mm39.txt added
#> dataHub with 130 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name
#> BFC1 | outfile.txt
#> BFC2 | exp_data
#> BFC3 | GRCh38.primary_assembly.genome.fa.1.bt2
#> BFC4 | GRCh38.primary_assembly.genome.fa.2.bt2
#> BFC5 | GRCh38.primary_assembly.genome.fa.3.bt2
#> ... ...
#> BFC126 | refGene_hg38.txt
#> BFC127 | knownGene_mm39.sql
#> BFC128 | knownGene_mm39.txt
#> BFC129 | refGene_mm39.sql
#> BFC130 | refGene_mm39.txt
#> Path
#> BFC1 /tmp/RtmpHFQPar/SharedData/outfile.txt
#> BFC2 /tmp/RtmpHFQPar/exp_data
#> BFC3 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> BFC4 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> BFC5 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> ... ...
#> BFC126 https://storage.googleapis.com/reusedata/ucsc_database/refGene_h...
#> BFC127 https://storage.googleapis.com/reusedata/ucsc_database/knownGene...
#> BFC128 https://storage.googleapis.com/reusedata/ucsc_database/knownGene...
#> BFC129 https://storage.googleapis.com/reusedata/ucsc_database/refGene_m...
#> BFC130 https://storage.googleapis.com/reusedata/ucsc_database/refGene_m...
If the data of interest already exist on the cloud, then
getCloudData
will directly download the data to your
computer. Add it to the local caching system using
dataUpdate()
for later use.
(dh <- dataSearch(c("ensembl", "GRCh38")))
#> dataHub with 8 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name
#> BFC10 | GRCh37_to_GRCh38.chain
#> BFC14 | GRCh38_to_GRCh37.chain
#> BFC15 | GRCh38_to_NCBI34.chain
#> BFC16 | GRCh38_to_NCBI35.chain
#> BFC17 | GRCh38_to_NCBI36.chain
#> BFC19 | NCBI34_to_GRCh38.chain
#> BFC21 | NCBI35_to_GRCh38.chain
#> BFC23 | NCBI36_to_GRCh38.chain
#> Path
#> BFC10 https://storage.googleapis.com/reusedata/ensembl_liftover/GRCh37...
#> BFC14 https://storage.googleapis.com/reusedata/ensembl_liftover/GRCh38...
#> BFC15 https://storage.googleapis.com/reusedata/ensembl_liftover/GRCh38...
#> BFC16 https://storage.googleapis.com/reusedata/ensembl_liftover/GRCh38...
#> BFC17 https://storage.googleapis.com/reusedata/ensembl_liftover/GRCh38...
#> BFC19 https://storage.googleapis.com/reusedata/ensembl_liftover/NCBI34...
#> BFC21 https://storage.googleapis.com/reusedata/ensembl_liftover/NCBI35...
#> BFC23 https://storage.googleapis.com/reusedata/ensembl_liftover/NCBI36...
getCloudData(dh[1], outdir = gcpdir)
#> Data is downloaded:
#> /tmp/RtmpHFQPar/gcpData/GRCh37_to_GRCh38.chain
Now we create the data cache with only local data files, and we can see that the downloaded data is available.
dataUpdate(gcpdir) ## Update local data cache (without cloud data)
#>
#> Updating data record...
#> GRCh37_to_GRCh38.chain added
#> dataHub with 131 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name
#> BFC1 | outfile.txt
#> BFC2 | exp_data
#> BFC3 | GRCh38.primary_assembly.genome.fa.1.bt2
#> BFC4 | GRCh38.primary_assembly.genome.fa.2.bt2
#> BFC5 | GRCh38.primary_assembly.genome.fa.3.bt2
#> ... ...
#> BFC127 | knownGene_mm39.sql
#> BFC128 | knownGene_mm39.txt
#> BFC129 | refGene_mm39.sql
#> BFC130 | refGene_mm39.txt
#> BFC131 | GRCh37_to_GRCh38.chain
#> Path
#> BFC1 /tmp/RtmpHFQPar/SharedData/outfile.txt
#> BFC2 /tmp/RtmpHFQPar/exp_data
#> BFC3 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> BFC4 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> BFC5 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> ... ...
#> BFC127 https://storage.googleapis.com/reusedata/ucsc_database/knownGene...
#> BFC128 https://storage.googleapis.com/reusedata/ucsc_database/knownGene...
#> BFC129 https://storage.googleapis.com/reusedata/ucsc_database/refGene_m...
#> BFC130 https://storage.googleapis.com/reusedata/ucsc_database/refGene_m...
#> BFC131 /tmp/RtmpHFQPar/gcpData/GRCh37_to_GRCh38.chain
dataSearch() ## data is available locally!!!
#> dataHub with 131 records
#> cache path: /tmp/RtmpHFQPar/cache/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name
#> BFC1 | outfile.txt
#> BFC2 | exp_data
#> BFC3 | GRCh38.primary_assembly.genome.fa.1.bt2
#> BFC4 | GRCh38.primary_assembly.genome.fa.2.bt2
#> BFC5 | GRCh38.primary_assembly.genome.fa.3.bt2
#> ... ...
#> BFC127 | knownGene_mm39.sql
#> BFC128 | knownGene_mm39.txt
#> BFC129 | refGene_mm39.sql
#> BFC130 | refGene_mm39.txt
#> BFC131 | GRCh37_to_GRCh38.chain
#> Path
#> BFC1 /tmp/RtmpHFQPar/SharedData/outfile.txt
#> BFC2 /tmp/RtmpHFQPar/exp_data
#> BFC3 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> BFC4 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> BFC5 https://storage.googleapis.com/reusedata/bowtie2_index/GRCh38.pr...
#> ... ...
#> BFC127 https://storage.googleapis.com/reusedata/ucsc_database/knownGene...
#> BFC128 https://storage.googleapis.com/reusedata/ucsc_database/knownGene...
#> BFC129 https://storage.googleapis.com/reusedata/ucsc_database/refGene_m...
#> BFC130 https://storage.googleapis.com/reusedata/ucsc_database/refGene_m...
#> BFC131 /tmp/RtmpHFQPar/gcpData/GRCh37_to_GRCh38.chain
The data supports user-friendly discovery and access through the
ReUseData
portal, where detailed instructions are provided
for straight-forward incorporation into data analysis pipelines run on
local computing nodes, web resources, and cloud computing platforms
(e.g., Terra, CGC).
Here we provide a function meta_data()
to create a data
frame that contains all information about the data sets in the specified
file path (recursively), including the annotation file
($yml
column), parameter values for the recipe
($params
column), data file path ($output
column), keywords for data file (notes
columns), date of
data generation (date
column), and any tag if available
(tag
column).
Use cleanup = TRUE
to cleanup any invalid or
expired/older intermediate files.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ReUseData_1.7.0 Rcwl_1.23.0 S4Vectors_0.45.0
#> [4] BiocGenerics_0.53.1 generics_0.1.3 yaml_2.3.10
#> [7] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
#> [4] filelock_1.0.3 R.utils_2.12.3 fastmap_1.2.0
#> [7] BiocFileCache_2.15.0 promises_1.3.0 digest_0.6.37
#> [10] base64url_1.4 mime_0.12 lifecycle_1.0.4
#> [13] RSQLite_2.3.7 magrittr_2.0.3 compiler_4.4.1
#> [16] rlang_1.1.4 sass_0.4.9 progress_1.2.3
#> [19] tools_4.4.1 utf8_1.2.4 data.table_1.16.2
#> [22] knitr_1.48 prettyunits_1.2.0 brew_1.0-10
#> [25] htmlwidgets_1.6.4 bit_4.5.0 curl_5.2.3
#> [28] reticulate_1.39.0 RColorBrewer_1.1-3 batchtools_0.9.17
#> [31] BiocParallel_1.41.0 purrr_1.0.2 withr_3.0.2
#> [34] sys_3.4.3 R.oo_1.27.0 grid_4.4.1
#> [37] fansi_1.0.6 git2r_0.35.0 xtable_1.8-4
#> [40] cli_3.6.3 rmarkdown_2.28 DiagrammeR_1.0.11
#> [43] crayon_1.5.3 httr_1.4.7 visNetwork_2.1.2
#> [46] DBI_1.2.3 cachem_1.1.0 parallel_4.4.1
#> [49] BiocManager_1.30.25 basilisk_1.19.0 vctrs_0.6.5
#> [52] Matrix_1.7-1 jsonlite_1.8.9 dir.expiry_1.15.0
#> [55] hms_1.1.3 bit64_4.5.2 maketools_1.3.1
#> [58] jquerylib_0.1.4 RcwlPipelines_1.23.0 glue_1.8.0
#> [61] codetools_0.2-20 stringi_1.8.4 later_1.3.2
#> [64] tibble_3.2.1 pillar_1.9.0 basilisk.utils_1.19.0
#> [67] rappdirs_0.3.3 htmltools_0.5.8.1 R6_2.5.1
#> [70] dbplyr_2.5.0 evaluate_1.0.1 shiny_1.9.1
#> [73] lattice_0.22-6 R.methodsS3_1.8.2 png_0.1-8
#> [76] backports_1.5.0 memoise_2.0.1 httpuv_1.6.15
#> [79] bslib_0.8.0 Rcpp_1.0.13-1 checkmate_2.3.2
#> [82] xfun_0.49 buildtools_1.0.0 pkgconfig_2.0.3