--- title: Saving objects to artifacts and back again author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com package: alabaster.base date: "Revised: February 1, 2024" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Saving and loading artifacts} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} library(BiocStyle) self <- Biocpkg("alabaster.base"); knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) ``` # Introduction The `r self` package (and its family) implements methods to save common Bioconductor objects to file artifacts and load them back into R. This aims to provide a functional equivalent to RDS-based serialization that is: - More stable to changes in the class definition. Such changes would typically require costly `updateObject()` operations at best, or invalidate RDS files at worst. - More interoperable with other analysis frameworks. All artifacts are saved in standard formats (e.g., JSON, HDF5) and can be easily parsed by applications in other languages by following the [relevant specifications](https://github.com/ArtifactDB/takane). - More modular, with each object split into multiple artifacts. This enables parts of the object to be loaded into memory according to each application's needs. Parts can also be updated cheaply on disk without rewriting all files. # Quick start To demonstrate, let's mock up a `DataFrame` object from the `r Biocpkg("S4Vectors")` package. ```{r} library(S4Vectors) df <- DataFrame(X=1:10, Y=letters[1:10]) df ``` We'll save this `DataFrame` to a directory: ```{r} tmp <- tempfile() library(alabaster.base) saveObject(df, tmp) ``` And read it back in: ```{r} readObject(tmp) ``` # Class-specific methods Each class implements a saving and reading method for use by the _alabaster_ framework. The saving method (for the `saveObject()` generic) will save the object to one or more files inside a user-specified directory: ```{r} tmp <- tempfile() saveObject(df, tmp) list.files(tmp, recursive=TRUE) ``` Conversely, the reading function will - as the name suggests - load the object back into memory, given the path to its directory. The correct loading function for each class is automatically called by the `readObject()` function: ```{r} readObject(tmp) ``` `r self` also provides a `validateObject()` function, which checks that each object's on-disk representation follows its associated specification. For all **alabaster**-supported objects, this validates the file contents against the [**takane** specifications](https://github.com/ArtifactDB/takane) - successful validation provides guarantees for readers like `readObject()` and [**dolomite-base**](https://github.com/ArtifactDB/dolomite-base) (for Python). In fact, `saveObject()` will automatically run `validateObject()` on the directory to ensure compliance. ```{r} validateObject(tmp) ``` `r self` itself supports a small set of classes from the `r Biocpkg("S4Vectors")` packages; support for additional classes can be found in other packages like `r Biocpkg("alabaster.ranges")` and `r Biocpkg("alabaster.se")`. Third-party developers can also add support for their own classes by defining new methods, see the [_Extensions_](extensions.html) vignette for details. # Operating on directories Users can move freely rename or relocate directories and `readObject()` function will still work. For example, we can easily copy the entire directory to a new file system and everything will still be correctly referenced within the directory. The simplest way to share objects is to just `zip` or `tar` the staging directory for _ad hoc_ distribution. For more serious applications, `r self` can be used in conjunction with storage systems like AWS S3 for large-scale distribution. ```{r} tmp <- tempfile() saveObject(df, tmp) tmp2 <- tempfile() file.rename(tmp, tmp2) readObject(tmp2) ``` That said, it is unwise to manipulate the files inside the directory created by `saveObject()`. Reading functions will usually depend on specific file names or subdirectory structures within the directory, and fiddling with them may cause unexpected results. Advanced users can exploit this by loading components from subdirectories if only the full object is not required: ```{r} # Creating a nested DF to be a little spicy: df2 <- DataFrame(Z=factor(1:5), AA=I(DataFrame(B=runif(5), C=rnorm(5)))) tmp <- tempfile() meta2 <- saveObject(df2, tmp) # Now reading in the nested DF: list.files(tmp, recursive=TRUE) readObject(file.path(tmp, "other_columns/1")) ``` # Extending to new classes The _alabaster_ framework is easily extended to new classes by: 1. Writing a method for `saveObject()`. This should accept an instance of the object and a path to a directory, and save the contents of the object inside the directory. It should also produce an `OBJECT` file that specifies the type of the object, e.g., `data_frame`, `hdf5_sparse_matrix`. 2. Writing a function for `readObject()` and registering it with `registerReadObjectFunction()` (or, for core Bioconductor classes, by requesting a change to the default registry in `r self`). This should accept a path to a directory and read its contents to reconstruct the object. The registered type should be the same as that used in the `OBJECT` file. 3. Writing a function for `validateObject()` and registering it with `registerValidateObjectFunction()`. This should accept a path to a directory and read its contents to determine if it is a valid on-disk representation. The registered type should be the same as that used in the `OBJECT` file. - (optional) Devleopers can alternatively formalize the on-disk representation by adding a specification to the [**takane**](https://github.com/ArtifactDB/takane) repository. This aims to provide C++-based validators for each representation, allowing us to enforce consistency across multiple languages (e.g., Python). Any **takane** validator is automatically used by `validateObject()` so no registration is required. To illustrate, let's extend _alabaster_ to the `dgTMatrix` from the `r CRANpkg("Matrix")` package. First, the saving method: ```{r} library(Matrix) setMethod("saveObject", "dgTMatrix", function(x, path, ...) { # Create a directory to stash our contents. dir.create(path) # Saving a DataFrame with the triplet data. df <- DataFrame(i = x@i, j = x@j, x = x@x) write.csv(df, file.path(path, "matrix.csv"), row.names=FALSE) # Adding some more information. write(dim(x), file=file.path(path, "dimensions.txt"), ncol=1) # Creating an object file. saveObjectFile(path, "triplet_sparse_matrix") }) ``` And now the reading and validation methods. The registration is usually done in the extension package's `onLoad()` function. ```{r} readSparseTripletMatrix <- function(path, metadata, ...) { df <- read.table(file.path(path, "matrix.csv"), header=TRUE, sep=",") dims <- readLines(file.path(path, "dimensions.txt")) sparseMatrix( i=df$i + 1L, j=df$j + 1L, x=df$x, dims=as.integer(dims), repr="T" ) } registerReadObjectFunction("triplet_sparse_matrix", readSparseTripletMatrix) validateSparseTripletMatrix <- function(path, metadata) { df <- read.table(file.path(path, "matrix.csv"), header=TRUE, sep=",") dims <- as.integer(readLines(file.path(path, "dimensions.txt"))) stopifnot(is.integer(df$i), all(df$i >= 0 & df$i < dims[1])) stopifnot(is.integer(df$j), all(df$j >= 0 & df$j < dims[2])) stopifnot(is.numeric(df$x)) } registerValidateObjectFunction("triplet_sparse_matrix", validateSparseTripletMatrix) ``` Let's run them and see how it works: ```{r} x <- sparseMatrix( i=c(1,2,3,5,6), j=c(3,6,1,3,8), x=runif(5), dims=c(10, 10), repr="T" ) x tmp <- tempfile() saveObject(x, tmp) list.files(tmp, recursive=TRUE) readObject(tmp) ``` For more complex objects that are composed of multiple smaller "child" objects, developers should consider saving each of their children in subdirectories of `path`. This can be achieved by calling `altSaveObject()` and `altReadObject()` in the saving and reading functions, respectively. (We use the `alt*` versions of these functions to respect application overrides, see below.) # Creating applications Developers can also create applications that customize the machinery of the _alabaster_ framework for specific needs. In most cases, this involves storing more metadata to describe the object in more detail. For example, we might want to remember the identity of the author for each object. This is achieved by creating an application-specific saving generic with the same signature as `saveObject()`: ```{r} setGeneric("appSaveObject", function(x, path, ...) { ans <- standardGeneric("appSaveObject") # File names with leading underscores are reserved for application-specific # use, so they won't clash with anything produced by saveObject. metapath <- file.path(path, "_metadata.json") write(jsonlite::toJSON(ans, auto_unbox=TRUE), file=metapath) }) setMethod("appSaveObject", "ANY", function(x, path, ...) { saveObject(x, path, ...) # does the real work list(authors=I(Sys.info()[["user"]])) # adds the desired metadata }) # We can specialize the behavior for specific classes like DataFrames. setMethod("appSaveObject", "DFrame", function(x, path, ...) { ans <- callNextMethod() ans$columns <- I(colnames(x)) ans }) ``` Applications should call `altSaveObjectFunction()` to instruct `altSaveObject()` to use this new generic. This ensures that the customizations are applied to all child objects, such as the nested `DataFrame` below. ```{r} # Create a friendly user-visible function to handle the generic override; this # is reversed on function exit to avoid interfering with other applications. saveForApplication <- function(x, path, ...) { old <- altSaveObjectFunction(appSaveObject) on.exit(altSaveObjectFunction(old)) altSaveObject(x, path, ...) } # Saving our mocked up DataFrame with our overrides active. df2 <- DataFrame(Z=factor(1:5), AA=I(DataFrame(B=runif(5), C=rnorm(5)))) tmp <- tempfile() saveForApplication(df2, tmp) # Both the parent and child DataFrames have new metadata. cat(readLines(file.path(tmp, "_metadata.json")), sep="\n") cat(readLines(file.path(tmp, "other_columns/1/_metadata.json")), sep="\n") ``` The reading function can be similarly overridden by setting `altReadObjectFunction()` to instruct all `altReadObject()` calls to use the override. This allows applications to, e.g., do something with the metadata that we just added. ```{r} # Defining the override for altReadObject(). appReadObject <- function(path, metadata=NULL, ...) { if (is.null(metadata)) { metadata <- readObjectFile(path) } # Print custom message based on the type and application-specific metadata. appmeta <- jsonlite::fromJSON(file.path(path, "_metadata.json")) cat("I am a ", metadata$type, " created by ", appmeta$authors[1], ".\n", sep="") if (metadata$type == "data_frame") { all.cols <- paste(appmeta$columns, collapse=", ") cat("I have the following columns: ", all.cols, ".\n", sep="") } readObject(path, metadata=metadata, ...) } # Creating a user-friendly function to set the override before the read. readForApplication <- function(path, metadata=NULL, ...) { old <- altReadObjectFunction(appReadObject) on.exit(altReadObjectFunction(old)) altReadObject(path, metadata, ...) } # This diverts to the override with printing of custom messages. readForApplication(tmp) ``` By overriding the saving and reading process for one or more classes, each application can customize the behavior of _alabaster_ to their own needs. # Session information {-} ```{r} sessionInfo() ```