--- title: "BiocFileCache Use Cases" author: Lori Shepherd output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{2. BiocFileCache: Use Cases} %\VignetteEncoding{UTF-8} %\VignetteDepends{rtracklayer} --- ```{r setup, echo=FALSE} knitr::opts_chunk$set(collapse=TRUE) ``` # Overview Organization of files on a local machine can be cumbersome. This is especially true for local copies of remote resources that may periodically require a new download to have the most updated information available. [BiocFileCache][] is designed to help manage local and remote resource files stored locally. It provides a convenient location to organize files and once added to the cache management, the package provides functions to determine if remote resources are out of date and require a new download. ## Installation and Loading `BiocFileCache` is a _Bioconductor_ package and can be installed through `BiocManager::install()`. ```{r, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install("BiocFileCache", dependencies=TRUE) ``` After the package is installed, it can be loaded into _R_ workspace by ```{r, results='hide', warning=FALSE, message=FALSE} library(BiocFileCache) ``` ## Creating / Loading the Cache The initial step to utilizing [BiocFileCache][] in managing files is to create a cache object specifying a location. We will create a temporary directory for use with examples in this vignette. If a path is not specified upon creation, the default location is a directory `~/.BiocFileCache` in the typical user cache directory as defined by `tools::R_user_dir("", which="cache")`. ```{r} path <- tempfile() bfc <- BiocFileCache(path, ask = FALSE) ``` # Use Cases ## Local cache of an internet resource One use for [BiocFileCache][] is to save local copies of remote resources. The benefits of this approach include reproducibility, faster access, and access (once cached) without need for an internet connection. An example is an Ensembl GTF file (also available via [AnnotationHub][]) ```{r, url} ## paste to avoid long line in vignette url <- paste( "ftp://ftp.ensembl.org/pub/release-71/gtf", "homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz", sep="/") ``` For a system-wide cache, simply load the [BiocFileCache][] package and ask for the local resource path (`rpath`) of the resource. ```{r, eval=FALSE} library(BiocFileCache) bfc <- BiocFileCache() path <- bfcrpath(bfc, url) ``` Use the path returned by `bfcrpath()` as usual, e.g., ```{r, eval=FALSE} gtf <- rtracklayer::import.gff(path) ``` A more compact use, the first or any time, is ```{r, eval=FALSE} gtf <- rtracklayer::import.gff(bfcrpath(BiocFileCache(), url)) ``` Ensembl releases do not change with time, so there is no need to check whether the cached resource needs to be updated. ## Cache of experimental computations One might use [BiocFileCache][] to cache results from experimental analysis. The `rname` field provides an opportunity to provide descriptive metadata to help manage collections of resources, without relying on cryptic file naming conventions. Here we create or use a local file cache in the directory in which we are doing our analysis. ```{r, eval=FALSE} library(BiocFileCache) bfc <- BiocFileCache("~/my-experiment/results") ``` We perform our analysis... ```{r, eval=FALSE} suppressPackageStartupMessages({ library(DESeq2) library(airway) }) data(airway) dds <- DESeqDataData(airway, design = ~ cell + dex) result <- DESeq(dds) ``` ...and then save our result in a location provided by [BiocFileCache][]. ```{r, eval=FALSE} saveRDS(result, bfcnew(bfc, "airway / DESeq standard analysis")) ``` Retrieve the result at a later date ```{r, eval=FALSE} result <- readRDS(bfcrpath(bfc, "airway / DESeq standard analysis")) ``` One might imagine the following workflow: ```{r eval=FALSE} suppressPackageStartupMessages({ library(BiocFileCache) library(rtracklayer) }) # load the cache path <- file.path(tempdir(), "tempCacheDir") bfc <- BiocFileCache(path) # the web resource of interest url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz" # check if url is being tracked res <- bfcquery(bfc, url, exact=TRUE) if (bfccount(res) == 0L) { # if it is not in cache, add ans <- bfcadd(bfc, rname="ensembl, homo sapien", fpath=url) } else { # if it is in cache, get path to load rid = res$rid ans <- bfcrpath(bfc, rid) # check to see if the resource needs to be updated check <- bfcneedsupdate(bfc, rid) # check can be NA if it cannot be determined, choose how to handle if (is.na(check)) check <- TRUE if (check){ ans < - bfcdownload(bfc, rid) } } # ans is the path of the file to load ans # we know because we search for the url that the file is a .gtf.gz, # if we searched on other terms we can use 'bfcpath' to see the # original fpath to know the appropriate load/read/import method bfcpath(bfc, names(ans)) temp = GTFFile(ans) info = import(temp) ``` ```{r, ensemblremote, eval=TRUE} # # A simpler test to see if something is in the cache # and if not start tracking it is using `bfcrpath` # suppressPackageStartupMessages({ library(BiocFileCache) library(rtracklayer) }) # load the cache path <- file.path(tempdir(), "tempCacheDir") bfc <- BiocFileCache(path, ask=FALSE) # the web resources of interest url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz" url2 <- "ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz" # if not in cache will download and create new entry pathsToLoad <- bfcrpath(bfc, c(url, url2)) pathsToLoad # now load files as see fit info = import(GTFFile(pathsToLoad[1])) class(info) summary(info) ``` ```{r eval=FALSE} # # One could also imagine the following: # library(BiocFileCache) # load the cache bfc <- BiocFileCache() # # Do some work! # # add a location in the cache filepath <- bfcnew(bfc, "R workspace") save(list = ls(), file=filepath) # now the R workspace is being tracked in the cache ``` ## Cache to manage package data A package may desire to use BiocFileCache to manage remote data. The following is example code providing some best practice guidelines. 1. Creating the cache Assumingly, the cache could potentially be called in a variety of places within code, examples, and vignette. It is desirable to have a wrapper to the BiocFileCache constructor. The following is a suggested example for a package called `MyNewPackage`: ```{r, eval=FALSE} .get_cache <- function() { cache <- tools::R_user_dir("MyNewPackage", which="cache") BiocFileCache::BiocFileCache(cache) } ``` Essentially this will create a unique cache for the package. If run interactively, the user will have the option to permanently create the package cache, else a temporary directory will be used. 2. Resources in the cache Managing remote resources then involves a function that will query to see if the resource has been added, if it is not it will add to the cache and if it has it checks if the file needs to be updated. ```{r, eval=FALSE} download_data_file <- function( verbose = FALSE ) { fileURL <- "http://a_path_to/someremotefile.tsv.gz" bfc <- .get_cache() rid <- bfcquery(bfc, "geneFileV2", "rname")$rid if (!length(rid)) { if( verbose ) message( "Downloading GENE file" ) rid <- names(bfcadd(bfc, "geneFileV2", fileURL )) } if (!isFALSE(bfcneedsupdate(bfc, rid))) bfcdownload(bfc, rid) bfcrpath(bfc, rids = rid) } ``` ## Processing web resources before caching A case has been identified where it may be desired to do some processing of web-based resources before saving the resource in the cache. This can be done through specific options of the `bfcadd()` and `bfcdownload()` functions. 1. Add the resource with `bfcadd()` using the `download=FALSE` argument. 2. Download the resource with `bfcdownload()` using the `FUN` argument. The `FUN` argument is the name of a function to be applied before saving the downloaded file into the cache. The default is `file.rename`, simply copying the downloaded file into the cache. A user-supplied function must take ONLY two arguments. When invoked, the arguments will be: 1. `character(1)` A temporary file containing the resource as retrieved from the web. 2. `character(1)` The BiocFileCache location where the processed file should be saved. The function should return a `TRUE` on success or a `character(1)` description for failure on error. As an example: ```{r, preprocess} url <- "http://bioconductor.org/packages/stats/bioc/BiocFileCache/BiocFileCache_stats.tab" headFile <- # how to process file before caching function(from, to) { dat <- readLines(from) writeLines(head(dat), to) TRUE } rid <- bfcquery(bfc, url, "fpath")$rid if (!length(rid)) # not in cache, add but do not download rid <- names(bfcadd(bfc, url, download = FALSE)) update <- bfcneedsupdate(bfc, rid) # TRUE if newly added or stale if (!isFALSE(update)) # download & process bfcdownload(bfc, rid, ask = FALSE, FUN = headFile) rpath <- bfcrpath(bfc, rids=rid) # path to processed result readLines(rpath) # read processed result ``` Note: By default bfcadd uses the webfile name as the saved local file. If the processing step involves saving the data in a different format, utilize the bfcadd argument `ext` to assign an extension to identify the type of file that was saved. For example ``` url = "http://httpbin.org/get" bfcadd("myfile", url, download=FALSE) # would save a file `_get` in the cache bfcadd("myfile", url, download=FALSE, ext=".Rdata") # would save a file `_get.Rdata` in the cache ``` # Access Behind a Proxy BiocFileCache uses CRAN package `httr` functions `HEAD` and `GET` for accessing web resources. This can be problematic if operating behind a proxy. The easiest solution is to set the `httr::set_config` with the proxy information. ``` proxy <- httr::use_proxy("http://my_user:my_password@myproxy:8080") ## or proxy <- httr::use_proxy(Sys.getenv('http_proxy')) httr::set_config(proxy) ``` # Group Cache Access The situation may occur where a cache is desired to be shared across multiple users on a system. This presents permissions errors. To allow access to multiple users create a group that the users belong to and that the cache belongs too. Permissions of potentially two files need to be altered depending on what you would like individuals to be able to accomplish with the cache. A read-only cache will require manual manipulatios of the BiocFileCache.sqlite.LOCK so that the group permissions are `g+rw`. To allow users to download files to the shared cache, both the BiocFileCache.sqlite.LOCK file and the BiocFileCache.sqlite file will need group permissions to `g+rw`. Please google how to create a user group for your system of interest. To find the location of the cache to be able to change the group and file permissions, you may run the following in R if you used the default location: `tools::R_user_dir("BiocFileCache", which="cache")` or if you created a unique location, something like the following: `bfc = BiocFileCache(cache="someUniquelocation"); bfccache(bfc)`. For quick reference in linux you will use `chown currentuser:newgroup` to change the group and `chmod` to change the file permissions: `chmod 660` or `chmod g+rw` should accomplish the correct permissions. # Summary It is our hope that this package allows for easier management of local and remote resources. # SessionInfo ```{r, sessioninfo} sessionInfo() ``` [BiocFileCache]: https://bioconductor.org/packages/BiocFileCache [dplyr]: https://cran.r-project.org/package=dplyr