Organization of files on a local machine can be cumbersome. This is especially true for local copies of remote resources that may periodically require a new download to have the most updated information available. BiocFileCache is designed to help manage local and remote resource files stored locally. It provides a convenient location to organize files and once added to the cache management, the package provides functions to determine if remote resources are out of date and require a new download.
BiocFileCache
is a Bioconductor package and can
be installed through BiocManager::install()
.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager")
BiocManager::install("BiocFileCache", dependencies=TRUE)
After the package is installed, it can be loaded into R workspace by
The initial step to utilizing BiocFileCache
in managing files is to create a cache object specifying a location. We
will create a temporary directory for use with examples in this
vignette. If a path is not specified upon creation, the default location
is a directory ~/.BiocFileCache
in the typical user cache
directory as defined by
tools::R_user_dir("", which="cache")
.
One use for BiocFileCache is to save local copies of remote resources. The benefits of this approach include reproducibility, faster access, and access (once cached) without need for an internet connection. An example is an Ensembl GTF file (also available via [AnnotationHub][])
## paste to avoid long line in vignette
url <- paste(
"ftp://ftp.ensembl.org/pub/release-71/gtf",
"homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz",
sep="/")
For a system-wide cache, simply load the BiocFileCache
package and ask for the local resource path (rpath
) of the
resource.
Use the path returned by bfcrpath()
as usual, e.g.,
A more compact use, the first or any time, is
Ensembl releases do not change with time, so there is no need to check whether the cached resource needs to be updated.
One might use BiocFileCache
to cache results from experimental analysis. The rname
field provides an opportunity to provide descriptive metadata to help
manage collections of resources, without relying on cryptic file naming
conventions.
Here we create or use a local file cache in the directory in which we are doing our analysis.
We perform our analysis…
suppressPackageStartupMessages({
library(DESeq2)
library(airway)
})
data(airway)
dds <- DESeqDataData(airway, design = ~ cell + dex)
result <- DESeq(dds)
…and then save our result in a location provided by BiocFileCache.
Retrieve the result at a later date
One might imagine the following workflow:
suppressPackageStartupMessages({
library(BiocFileCache)
library(rtracklayer)
})
# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)
# the web resource of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
# check if url is being tracked
res <- bfcquery(bfc, url, exact=TRUE)
if (bfccount(res) == 0L) {
# if it is not in cache, add
ans <- bfcadd(bfc, rname="ensembl, homo sapien", fpath=url)
} else {
# if it is in cache, get path to load
rid = res$rid
ans <- bfcrpath(bfc, rid)
# check to see if the resource needs to be updated
check <- bfcneedsupdate(bfc, rid)
# check can be NA if it cannot be determined, choose how to handle
if (is.na(check)) check <- TRUE
if (check){
ans < - bfcdownload(bfc, rid)
}
}
# ans is the path of the file to load
ans
# we know because we search for the url that the file is a .gtf.gz,
# if we searched on other terms we can use 'bfcpath' to see the
# original fpath to know the appropriate load/read/import method
bfcpath(bfc, names(ans))
temp = GTFFile(ans)
info = import(temp)
#
# A simpler test to see if something is in the cache
# and if not start tracking it is using `bfcrpath`
#
suppressPackageStartupMessages({
library(BiocFileCache)
library(rtracklayer)
})
# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path, ask=FALSE)
# the web resources of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
url2 <- "ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz"
# if not in cache will download and create new entry
pathsToLoad <- bfcrpath(bfc, c(url, url2))
## adding rname 'ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz'
## adding rname 'ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz'
pathsToLoad
## BFC1
## "/tmp/RtmpgSNcjm/tempCacheDir/16a86c989488_Homo_sapiens.GRCh37.71.gtf.gz"
## BFC2
## "/tmp/RtmpgSNcjm/tempCacheDir/16a86140666b_Rattus_norvegicus.Rnor_5.0.71.gtf.gz"
# now load files as see fit
info = import(GTFFile(pathsToLoad[1]))
class(info)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
summary(info)
## [1] "GRanges object with 2253155 ranges and 12 metadata columns"
A package may desire to use BiocFileCache to manage remote data. The following is example code providing some best practice guidelines.
Assumingly, the cache could potentially be called in a variety of
places within code, examples, and vignette. It is desirable to have a
wrapper to the BiocFileCache constructor. The following is a suggested
example for a package called MyNewPackage
:
.get_cache <-
function()
{
cache <- tools::R_user_dir("MyNewPackage", which="cache")
BiocFileCache::BiocFileCache(cache)
}
Essentially this will create a unique cache for the package. If run interactively, the user will have the option to permanently create the package cache, else a temporary directory will be used.
Managing remote resources then involves a function that will query to see if the resource has been added, if it is not it will add to the cache and if it has it checks if the file needs to be updated.
download_data_file <-
function( verbose = FALSE )
{
fileURL <- "http://a_path_to/someremotefile.tsv.gz"
bfc <- .get_cache()
rid <- bfcquery(bfc, "geneFileV2", "rname")$rid
if (!length(rid)) {
if( verbose )
message( "Downloading GENE file" )
rid <- names(bfcadd(bfc, "geneFileV2", fileURL ))
}
if (!isFALSE(bfcneedsupdate(bfc, rid)))
bfcdownload(bfc, rid)
bfcrpath(bfc, rids = rid)
}
A case has been identified where it may be desired to do some
processing of web-based resources before saving the resource in the
cache. This can be done through specific options of the
bfcadd()
and bfcdownload()
functions.
bfcadd()
using the
download=FALSE
argument.bfcdownload()
using the
FUN
argument.The FUN
argument is the name of a function to be applied
before saving the downloaded file into the cache. The default is
file.rename
, simply copying the downloaded file into the
cache. A user-supplied function must take ONLY two arguments. When
invoked, the arguments will be:
character(1)
A temporary file containing the resource
as retrieved from the web.character(1)
The BiocFileCache location where the
processed file should be saved.The function should return a TRUE
on success or a
character(1)
description for failure on error. As an
example:
url <- "http://bioconductor.org/packages/stats/bioc/BiocFileCache/BiocFileCache_stats.tab"
headFile <- # how to process file before caching
function(from, to)
{
dat <- readLines(from)
writeLines(head(dat), to)
TRUE
}
rid <- bfcquery(bfc, url, "fpath")$rid
if (!length(rid)) # not in cache, add but do not download
rid <- names(bfcadd(bfc, url, download = FALSE))
update <- bfcneedsupdate(bfc, rid) # TRUE if newly added or stale
if (!isFALSE(update)) # download & process
bfcdownload(bfc, rid, ask = FALSE, FUN = headFile)
## Warning in readLines(from): incomplete final line found on
## '/tmp/RtmpgSNcjm/tempCacheDir/file16a82b722501'
## BFC3
## "/tmp/RtmpgSNcjm/tempCacheDir/16a81804c354_BiocFileCache_stats.tab"
rpath <- bfcrpath(bfc, rids=rid) # path to processed result
readLines(rpath) # read processed result
## [1] "Year\tMonth\tNb_of_distinct_IPs\tNb_of_downloads"
## [2] "2024\tJan\t25214\t52681"
## [3] "2024\tFeb\t23028\t49660"
## [4] "2024\tMar\t28213\t78340"
## [5] "2024\tApr\t35612\t80526"
## [6] "2024\tMay\t28584\t48808"
Note: By default bfcadd uses the webfile name as the saved local
file. If the processing step involves saving the data in a different
format, utilize the bfcadd argument ext
to assign an
extension to identify the type of file that was saved. For example
url = "http://httpbin.org/get"
bfcadd("myfile", url, download=FALSE)
# would save a file `<uniqueid>_get` in the cache
bfcadd("myfile", url, download=FALSE, ext=".Rdata")
# would save a file `<uniqueid>_get.Rdata` in the cache
BiocFileCache uses CRAN package httr
functions
HEAD
and GET
for accessing web resources. This
can be problematic if operating behind a proxy. The easiest solution is
to set the httr::set_config
with the proxy information.
proxy <- httr::use_proxy("http://my_user:my_password@myproxy:8080")
## or
proxy <- httr::use_proxy(Sys.getenv('http_proxy'))
httr::set_config(proxy)
The situation may occur where a cache is desired to be shared across
multiple users on a system. This presents permissions errors. To allow
access to multiple users create a group that the users belong to and
that the cache belongs too. Permissions of potentially two files need to
be altered depending on what you would like individuals to be able to
accomplish with the cache. A read-only cache will require manual
manipulatios of the BiocFileCache.sqlite.LOCK so that the group
permissions are g+rw
. To allow users to download files to
the shared cache, both the BiocFileCache.sqlite.LOCK file and the
BiocFileCache.sqlite file will need group permissions to
g+rw
. Please google how to create a user group for your
system of interest. To find the location of the cache to be able to
change the group and file permissions, you may run the following in R if
you used the default location:
tools::R_user_dir("BiocFileCache", which="cache")
or if you
created a unique location, something like the following:
bfc = BiocFileCache(cache="someUniquelocation"); bfccache(bfc)
.
For quick reference in linux you will use
chown currentuser:newgroup
to change the group and
chmod
to change the file permissions:
chmod 660
or chmod g+rw
should accomplish the
correct permissions.
It is our hope that this package allows for easier management of local and remote resources.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] rtracklayer_1.67.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [4] IRanges_2.41.2 S4Vectors_0.45.2 BiocGenerics_0.53.3
## [7] generics_0.1.3 dplyr_1.1.4 BiocFileCache_2.15.0
## [10] dbplyr_2.5.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.37.0 rjson_0.2.23
## [3] xfun_0.49 bslib_0.8.0
## [5] lattice_0.22-6 Biobase_2.67.0
## [7] vctrs_0.6.5 tools_4.4.2
## [9] bitops_1.0-9 parallel_4.4.2
## [11] curl_6.0.1 tibble_3.2.1
## [13] RSQLite_2.3.9 blob_1.2.4
## [15] pkgconfig_2.0.3 Matrix_1.7-1
## [17] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [19] compiler_4.4.2 Rsamtools_2.23.1
## [21] Biostrings_2.75.3 codetools_0.2-20
## [23] htmltools_0.5.8.1 sys_3.4.3
## [25] buildtools_1.0.0 sass_0.4.9
## [27] RCurl_1.98-1.16 yaml_2.3.10
## [29] pillar_1.10.0 crayon_1.5.3
## [31] jquerylib_0.1.4 BiocParallel_1.41.0
## [33] DelayedArray_0.33.3 cachem_1.1.0
## [35] abind_1.4-8 tidyselect_1.2.1
## [37] digest_0.6.37 purrr_1.0.2
## [39] restfulr_0.0.15 maketools_1.3.1
## [41] grid_4.4.2 fastmap_1.2.0
## [43] SparseArray_1.7.2 cli_3.6.3
## [45] magrittr_2.0.3 S4Arrays_1.7.1
## [47] XML_3.99-0.17 utf8_1.2.4
## [49] withr_3.0.2 filelock_1.0.3
## [51] UCSC.utils_1.3.0 bit64_4.5.2
## [53] rmarkdown_2.29 XVector_0.47.0
## [55] httr_1.4.7 matrixStats_1.4.1
## [57] bit_4.5.0.1 memoise_2.0.1
## [59] evaluate_1.0.1 knitr_1.49
## [61] BiocIO_1.17.1 rlang_1.1.4
## [63] glue_1.8.0 DBI_1.2.3
## [65] BiocManager_1.30.25 jsonlite_1.8.9
## [67] R6_2.5.1 MatrixGenerics_1.19.0
## [69] GenomicAlignments_1.43.0 zlibbioc_1.52.0