--- title: "BiocFileCache: Managing File Resources Across Sessions" author: Lori Shepherd output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{1. BiocFileCache Overview: Managing File Resources Across Sessions} %\VignetteEncoding{UTF-8} %\VignetteDepends{rtracklayer} --- ```{r setup, echo=FALSE} knitr::opts_chunk$set(collapse=TRUE) ``` # Overview Organization of files on a local machine can be cumbersome. This is especially true for local copies of remote resources that may periodically require a new download to have the most updated information available. [BiocFileCache][] is designed to help manage local and remote resource files stored locally. It provides a convenient location to organize files and once added to the cache management, the package provides functions to determine if remote resources are out of date and require a new download. ## Installation and Loading `BiocFileCache` is a _Bioconductor_ package and can be installed through `BiocManager::install()`. ```{r, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install("BiocFileCache", dependencies=TRUE) ``` After the package is installed, it can be loaded into _R_ workspace by ```{r, library, results='hide', warning=FALSE, message=FALSE} library(BiocFileCache) ``` ## Creating / Loading the Cache The initial step to utilizing [BiocFileCache][] in managing files is to create a cache object specifying a location. We will create a temporary directory for use with examples in this vignette. If a path is not specified upon creation, the default location is a directory `~/.BiocFileCache` in the typical user cache directory as defined by `tools::R_user_dir("", which="cache")`. ```{r, create} path <- tempfile() bfc <- BiocFileCache(path, ask = FALSE) ``` If the path location exists and has been utilized to store files previously, the previous object will be loaded with any files saved to the cache. If the path location does not exist the user will be prompted to create the new directory. If the session is not interactive to promt the user or the user decides not to create the directory a temporary directory will be used. Some utility functions to examine the cache are: * `bfccache(bfc)` * `length(bfc)` * `show(bfc)` * `bfcinfo(bfc)` `bfccache()` will show the cache path. **NOTE**: Because we are using temporary directories, your path location will be different than shown. ```{r, cacheloc} bfccache(bfc) length(bfc) ``` `length()` on a BiocFileCache will show the number of files currently being tracked by the `BiocFileCache`. For more detailed information on what is store in the `BiocFileCache` object, there is a show method which will display the object, object class, cache path, and number of items currently being tracked. ```{r, bfcshow} bfc ``` `bfcinfo()` will list a table of `BiocFileCache` resource files being tracked in the cache. It returns a [dplyr][] object of class `tbl_sqlite`. ```{r, bfcinfo} bfcinfo(bfc) ``` The table of resource files includes the following information: * `rid`: resource id. Autogenerated. This is a unique identifier automatically generated when a resource is added to the cache. * `rname`: resource name. This is given by the user when a resource is added to the cache. It does not have to be unique and can be updated at anytime. We recommend descriptive key words and identifiers. * `create_time`: The date and time a resource is added to the cache. * `access_time`: The date and time a resource is utilized within the cache. The access time is updated when the resource is updated or downloaded. * `rpath`: resource path. This is the path to the local file. * `rtype`: resource type. Either "local" or "web", indicating if the resource has a remote origin. * `fpath`: If rtype is "web", this is the link to the remote resource. It will be utilized to download the remote data. * `last_modified_time`: For a remote resource, the last_modified (if available) information for the local copy of the data. This information is checked against the remote resource to determine if the local copy is stale and needs to be updated. If it is not available or your resource is not a remote resource, the last modified time will be marked as NA. * `etag`: For a remote resource, the etag (if available) information for the local copy of the data. This information is checked against the remote resource to determine if the local copy is stale and needs to be updated. If it is not available or your resource is not a remote resource, the etag will be marked as NA. * `expires`: For a remote resource, the expires (if available) information for the local copy of the data. This information is checked against the `Sys.time` to determine if the local copy needs to be updated. If it is not available or your resource is not a remote resource, the expires will be marked as NA. Now that we have created the cache object and location, let's explore adding files that the cache will manage! ## Adding / Tracking Resources Now that a `BiocFileCache` object and cache location has been created, files can be added to the cache for tracking. There are two functions to add a resource to the cache: * `bfcnew()` * `bfcadd()` The difference between the options: `bfcnew()` creates an entry for a resource and returns a filepath to save to. As there are many types of data that can be saved in many different ways, `bfcnew()` allows you to save any _R_ data object in the appropriate manner and still be able to track the saved file. `bfcadd()` should be utilized when a file already exists or a remote resource is being accessed. `bfcnew` takes the `BiocFileCache` object and a user specified `rname` and returns a path location to save data to. (optionally) you can add the file extension if you know the type of file that will be saved: ```{r, bfcnew} savepath <- bfcnew(bfc, "NewResource", ext=".RData") savepath ## now we can use that path in any save function m = matrix(1:12, nrow=3) save(m, file=savepath) ## and that file will be tracked in the cache bfcinfo(bfc) ``` `bfcadd()` is for existing files or remote resources. The user will still specify an `rname` of their choosing but also must specify a path to local file or web resource as `fpath`. If no `fpath` is given, the default is to assume the `rname` is also the path location. If the `fpath` is a local file, there are a few options for the user determined by the `action` argument. `action` will allow the user to either `copy` the existing file into the cache directory, `move` the existing file into the cache directory, or leave the file whereever it is on the local system yet still track through the cache object `asis`. copy and move will rename the file to the generated cache file path. If the `fpath` is a remote source, the source will try to be downloaded, if it is successful it will save in the cache location and track in the cache object; The original source will be added to the cache information as `fpath`. If the user does not want the remote resource to be downloaded initially, the argument `download=FALSE` may be used to delay the download but add the resource to the cache. Relative path locations may also be used, specified with `rtype = "relative"`. This will store a relative location for the file within the cache; only actions `copy` and `move` are available for relative paths. First let's use local files: ```{r, bfcadd} fl1 <- tempfile(); file.create(fl1) add2 <- bfcadd(bfc, "Test_addCopy", fl1) # copy # returns filepath being tracked in cache add2 # the name is the unique rid in the cache rid2 <- names(add2) fl2 <- tempfile(); file.create(fl2) add3 <- bfcadd(bfc, "Test2_addMove", fl2, action="move") # move rid3 <- names(add3) fl3 <- tempfile(); file.create(fl3) add4 <- bfcadd(bfc, "Test3_addAsis", fl3, rtype="local", action="asis") # reference rid4 <- names(add4) file.exists(fl1) # TRUE - copied from original location file.exists(fl2) # FALSE - moved from original location file.exists(fl3) # TRUE - left asis, original location tracked ``` Now let's add some examples with remote sources: ```{r, bfcaddremote} url <- "http://httpbin.org/get" add5 <- bfcadd(bfc, "TestWeb", fpath=url) rid5 <- names(add5) url2<- "https://bioconductor.org/packages/stats/bioc/BiocFileCache/BiocFileCache_2024_stats.tab" add6 <- bfcadd(bfc, "TestWeb", fpath=url2) rid6 <- names(add6) # add a remote resource but don't initially download add7 <- bfcadd(bfc, "TestNoDweb", fpath=url2, download=FALSE) rid7 <- names(add7) # let's look at our BiocFileCache object now bfc bfcinfo(bfc) ``` Now that we are tracking resources, let's explore accessing their information! ### Caveat Files will by default have a unique identifier added to the start of the original file name (identifier_originalName) when added to the cache to allow for multiple versions of the same file name. There is an option to override this default behavior by using the `fname` argument of `bfcadd` or `bfcnew`. `fname` takes one of two options: `unique` or `exact`. The `unique` option behaves as default and adds a unique identifier to the original file name. The `exact` option wil override and not add a unique identifier and an exact match to the original file name will be added. ## Investigating / Accessing Resources Before we get into exploring individual resources, a helper function. Most of the functions provided require the unique rid[s] assigned to a resource. The `bfcadd` and `bfcnew` return the path as a named character vector, the name of the character vector is the rid. However, you may want to access a resource that you have added some time ago. * `bfcquery()` `bfcquery()` will take in a key word and search across the `rname`, `rpath`, and `fpath` for any matching entries. The columns that are searched can be controlled with the argument `field`. ```{r, bfcquery} bfcquery(bfc, "Web") bfcquery(bfc, "copy") q1 <- bfcquery(bfc, "BiocFileCache") q1 class(q1) ``` As you can see above `bfcquery()`, returns an object of class `tbl_sql` and can be investiaged further utilizing methods for these classes, such as the package `dplyr` methods. The `rid` can be seen in the first column of the table to be used in other functions. To get a quick count of how many objects in the cache matched the query, use `bfccount()`. ```{r, bfccount} bfccount(q1) ``` * `[` `[` allows for subsetting of the BiocFileCache object. The output will be a BiocFileSubCache object. Users will still be able to query, remove (from the subset object only), and access resources of the subset, however the resources cannot be updated. ```{r, bfcsubset} bfcsubWeb = bfc[paste0("BFC", 5:6)] bfcsubWeb bfcinfo(bfcsubWeb) ``` There are three methods for retrieving the `BiocFileCache` resource path location. * `[[` * `bfcpath()` * `bfcrpath()` The `[[` will access the `rpath` saved in the `BiocFileCache`. Retrieving this location will return the path to the local version of the resource; allowing the user to then use this path in any load/read methods most appropriate for the resource. The `bfcpath()` and `bfcrpath()` both return a named character vector also displaying the local file that can be used for retrieval. `bfcpath` requires `rids` while `bfcrpath()` can use `rids` or `rnames` (but not both). `bfcrpath()` can be used to add a resource into the cache when `rnames` are specified; if the element in `rnames` is not found, it will try and add to the cache with `bfcadd()`. ```{r, bfcbracket} bfc[["BFC2"]] bfcpath(bfc, "BFC2") bfcpath(bfc, "BFC5") bfcrpath(bfc, rids="BFC5") bfcrpath(bfc) bfcrpath(bfc, c("http://httpbin.org/get","Test3_addAsis")) ``` Managing remote resources locally involves knowing when to update the local copy of the data. * `bfcneedsupdate()` `bfcneedsupdate()` is a method that will check the local copy of the data's etag and last_modifed time to the etag and last_modified time of the remote resource as well as an expires time. The cache saves this information when the web resource is initially added. The expires time is checked against the current Sys.time to see if the local resource has expired. If so the resource will deem need to be updated; if unavailable or not expired will check the etag and last_modified_time. The etag information is used definitively if it is available, if it is not available it checks the last_modified time. If the resource does not have a last_modified tag either, it is undetermined. If the resource has not been download yet, it is `TRUE`. **Note:** This function does not automatically download the remote source if it is out of date. Please see `bfcdownload()`. ```{r, bfcneedsupdate} bfcneedsupdate(bfc, "BFC5") bfcneedsupdate(bfc, "BFC6") bfcneedsupdate(bfc) ``` ## Updating Resource Entries or Local Copy of Remote Data Just as you could access the `rpath`, the local resource path can be set with * `[[<-` The file must exist in order to be replaced in the `BiocFileCache`. If the user wishes to rename, they must make a copy (or touch) the file first. ```{r, bfcrename} fileBeingReplaced <- bfc[[rid3]] fileBeingReplaced # fl3 was created when we were adding resources fl3 bfc[[rid3]]<-fl3 bfc[[rid3]] ``` The user may also wish to change the `rname` or `fpath` associated with a resource in addition to the `rpath`. This can be done with * `bfcupdate()` Again, if changing the `rpath` the file must exist. If a `fpath` is being updated, the data will be downloaded and the user will be prompted to overwrite the current file specified in `rpath`. If the user does not want to be prompted about overwritting of files, `ask=FALSE` may be used. ```{r, bfcupdate} bfcinfo(bfc, "BFC1") bfcupdate(bfc, "BFC1", rname="FirstEntry") bfcinfo(bfc, "BFC1") ``` Now let's update a web resource ```{r, bfcupdateremote} suppressPackageStartupMessages({ library(dplyr) }) bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath) bfcupdate(bfc, "BFC6", fpath=url, rname="Duplicate", ask=FALSE) bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath) ``` Lastly, remote resources may require an update if the Data is out of date (See `bfcneedsupdate()`). The `bfcdownload` function will attempt to download from the original resource saved in the cache as `fpath` and overwrite the out of date file `rpath` * `bfcdownload()` The following confirms that resources need updating, and the performs the update ```{r, bfcdownload} rid <- "BFC5" test <- !identical(bfcneedsupdate(bfc, rid), FALSE) # 'TRUE' or 'NA' if (test) bfcdownload(bfc, rid, ask=FALSE) ``` ## Adding MetaData The following functions are provided for metadata: * `bfcmeta()<-` * `bfcmeta()` * `bfcmetalist()` * `bfcmetaremove()` Additional metadata can be added as `data.frames` that become tables in the sql database. The `data.frame` must contain a column `rid` that matches the `rid` column in the cache. Any metadata added will then be displayed when accessing the cache. Metadata is added with `bfcmeta()<-`. A table `name` must be provided as an argument. Users can add multiple metadata tables as long as the names are unique. Tables may be appended or overwritten using additional arguments `append=TRUE` or `overwrite=TRUE`. ```{r, bfcmetadata} names(bfcinfo(bfc)) meta <- as.data.frame(list(rid=bfcrid(bfc)[1:3], idx=1:3)) bfcmeta(bfc, name="resourceData") <- meta names(bfcinfo(bfc)) ``` The metadata tables that exist can be listed with `bfcmetalist()` and can be retrieved with `bfcmeta()`. ```{r, bfcmetalist} bfcmetalist(bfc) bfcmeta(bfc, name="resourceData") ``` Lastly, metadata can be removed with `bfcmetaremove()`. ```{r, bfcmetaremove} bfcmetaremove(bfc, name="resourceData") ``` **Note:** While quick implementations of all the functions exist where if you don't specify a BiocFileCache object it will operate on `BiocFileCache()`, this option is not available for `bfcmeta()<-`. This function must always specify a BiocFileCache object by first defining a variable and then passing that variable into the function. Example of ERROR: ```{r eval=FALSE} bfcmeta(name="resourceData") <- meta Error in bfcmeta(name = "resourceData") <- meta : target of assignment expands to non-language object ``` Correct implementation: ```{r eval=FALSE} bfc <- BiocFileCache() bfcmeta(bfc, name="resourceData") <- meta ``` All other functions have a default, if the BiocFileCache object is missing it will operate on the default cache `BiocFileCache()`. ## Removing Resources Now that we have added resources, it is also possible to remove a resource. * `bfcremove()` When you remove a resource from the cache, it will also delete the local file but only if it is stored in the cache directory as given by `bfccache(bfc)`. If it is a path to a file somewhere else on the user system, it will only be removed from the `BiocFileCache` object but the file not deleted. ```{r, bfcremove} # let's remind ourselves of our object bfc bfcremove(bfc, "BFC6") bfcremove(bfc, "BFC1") # let's look at our BiocFileCache object now bfc ``` There is another helper function that may be of use: * `bfcsync()` This function will compare two things: 1. If any `rpath` cannot be found (This would occur if `bfcnew()` is used and the path was not used to save an object) 2. If there are files in the cache directory (`bfccache(bfc)`), that are not being tracked by the `BiocFileCache` object ```{r, bfcsync} # create a new entry that hasn't been used path <- bfcnew(bfc, "UseMe") rmMe <- names(path) # We also have a file not being tracked because we updated rpath bfcsync(bfc) # you can suppress the messages and just have a TRUE/FALSE bfcsync(bfc, FALSE) # # Let's do some cleaning to have a synced object # bfcremove(bfc, rmMe) unlink(fileBeingReplaced) bfcsync(bfc) ``` ## Exporting and Importing Cache There is a helper function to export a BiocFileCache and associated files as a tar or zip archive as well as the appropriate import function. * `exportbfc()` * `importbfc()` The `exportbfc` function will take in a BiocFileCache object or subsetted object and create a tar or zip archive that can then be shared to other collaborators on different computer systems. The user can choose where the archive is created with `outputFile`; the current working directory and the name `BiocFileCacheExport.tar` is used as default. By default a tar archive is created, but the user can create a zip archive instead using the argument `outputMethod="zip"`. Any additional argument to the `utils::zip` or `utils::tar` may also be utilized. The following are some example calls: ```{r eval=FALSE} # export entire biocfilecache exportbfc(bfc) # export the first 4 entries of biocfilecache # as a compressed tar exportbfc(bfc, rids=paste0("BFC", 1:4), outputFile="BiocFileCacheExport.tar.gz", compression="gzip") # export the subsetted object of web resources as zip sub1 <- bfc[bfcrid(bfcquery(bfc, "web", field='rtype'))] exportbfc(sub1, outputFile = "BiocFileCacheExportWeb.zip", outMethod="zip") ``` The archive once inflated on a users system will have a fully functional copy of the sent cache. The archive can be extracted manually and the path used in the constructor `BiocFileCache()` or for convenience the function `importbfc` may be utilized. The `importbfc` function takes in a path to the appropriate tar or zip file, the argument `archiveMethod` indicating if `untar` or `unzip` should be used (the default is untar), a path to where the archive should be extracted to as `exdir`, and any additional arguments to the `utils::untar` and `utils::unzip` methods. The function will extract the files and load the associated BiocFileCache object into the R session. The following are example calls to load the above example exported objects: ```{r eval=FALSE} bfc <- importbfc("BiocFileCacheExport.tar") bfc2 <- importbfc("BiocFileCacheExport.tar.gz", compression="gzip") bfc3 <- importbfc("BiocFileCacheExportWeb.zip", archiveMethod="unzip") ``` ## Creating a Cache from Existing Data There exists the following helper functions to convert existing data to a BiocFileCache: * `makeBiocFileCacheFromDataFrame` These functions may take awhile to run if there are a lot of resources, however if the BiocFileCache is stored in a permanent location it will only need to be run once. ### Create a BiocFileCache from an Existing data.frame `makeBiocFileCacheFromDataFrame` takes an existing data.frame and creates a BiocFileCache object. The cache location can be specified by the `cache` argument. The `cache` must not already exist and the user will be prompted to create the location. If the user opts 'N', the cache will be created in a temporary directory and this function will have to be run again upon a new R session. The original data.frame must contain the required BiocFileCache columns `rtype`, `rpath`, and `fpath` as described in the section 1.2 "Creating / Loading the Cache". The optional columns `rname`, `last_modified_time`, `etag` and `expires` may also be specified in the original data.frame although are not required and will be populated with defaults if missing. For resources with `rtype="local"`, the `actionLocal` will control if the local copy of the file is copied or moved to the cache location, or if it is left asis on the local system; A local copy of the file must exist if the resource is identified as `rtype=local`. For resources with `rtype="web"`, `actionWeb` will control if the local copy of the remote file is copied or moved to the cache location. It is a requirement of BiocFileCache that all remote resources download their local copy to the cache location. A local copy of the file does not have to exist and can be downloaded into the cache at a later time. Any additional columns of the original data.frame besides those required or optional BiocFileCache columns, are separated and added to the BiocFileCache as a meta data table with the name given as `metadataName`. See section 1.6 on "Adding Metadata". The following is an example data.frame with minimal columns 'rtype', 'rpath', and 'fpath' and one additional column that will become metadata 'keywords'. The 'rpath' can be `NA` as these are remote resources (`rtype='web'`) that have not been downloaded yet. ```{r, mock} tbl <- data.frame(rtype=c("web","web"), rpath=c(NA_character_,NA_character_), fpath=c("http://httpbin.org/get", "https://en.wikipedia.org/wiki/Bioconductor"), keywords = c("httpbin", "wiki"), stringsAsFactors=FALSE) tbl ``` ```{r eval=FALSE} newbfc <- makeBiocFileCacheFromDataFrame(tbl, cache=file.path(tempdir(),"BFC"), actionWeb="copy", actionLocal="copy", metadataName="resourceMetadata") ``` ## Cleaning or Removing Cache Finally, there are two function involved with cleaning or deleting the cache: * `cleanbfc()` * `removebfc()` `cleanbfc()` will evaluate the resources in the `BiocFileCache` object and determine which, if any, have not been created, redownloaded, or updated in a specified number of days. If `ask=TRUE`, each entry that is above that threshold will ask if it should be removed from the cache object and the file deleted (only deleted if in `bfccache(bfc)` location). If `ask=FALSE`, it does not ask about each file and automatically removes and deletes the file. The default number of days is 120. If a resource has not needed any updates, this function could give a false positive. It is also does not take into account how many time the resource was loaded by retrieving the path (ie. via [[, bfcpath, bfcrpath), so may not be an accurate indication of how often the resource is utilized. Please use this function with caution. ```{r eval=FALSE} cleanbfc(bfc) ``` `removebfc()` will remove the `BiocFileCache` complete from the system. Any files saved in `bfccache(bfc)` directory will also be deleted. ```{r eval=FALSE} removebfc(bfc) ``` **Note** Use with caution! # Access Behind a Proxy BiocFileCache uses CRAN package `httr` functions `HEAD` and `GET` for accessing web resources. This can be problematic if operating behind a proxy. The easiest solution is to set the `httr::set_config` with the proxy information. ```{r eval=FALSE} proxy <- httr::use_proxy("http://my_user:my_password@myproxy:8080") ## or proxy <- httr::use_proxy(Sys.getenv('http_proxy')) httr::set_config(proxy) ``` # Group Cache Access The situation may occur where a cache is desired to be shared across multiple users on a system. This presents permissions errors. To allow access to multiple users create a group that the users belong to and that the cache belongs too. Permissions of potentially two files need to be altered depending on what you would like individuals to be able to accomplish with the cache. A read-only cache will require manual manipulatios of the BiocFileCache.sqlite.LOCK so that the group permissions are `g+rw`. To allow users to download files to the shared cache, both the BiocFileCache.sqlite.LOCK file and the BiocFileCache.sqlite file will need group permissions to `g+rw`. Please google how to create a user group for your system of interest. To find the location of the cache to be able to change the group and file permissions, you may run the following in R if you used the default location: `tools::R_user_dir("BiocFileCache", which="cache")` or if you created a unique location, something like the following: `bfc = BiocFileCache(cache="someUniquelocation"); bfccache(bfc)`. For quick reference in linux you will use `chown currentuser:newgroup` to change the group and `chmod` to change the file permissions: `chmod 660` or `chmod g+rw` should accomplish the correct permissions. # Summary It is our hope that this package allows for easier management of local and remote resources. # SessionInfo ```{r, sessioninfo} sessionInfo() ``` [BiocFileCache]: https://bioconductor.org/packages/BiocFileCache [dplyr]: https://cran.r-project.org/package=dplyr