The alabaster.base package (and its family) implements methods to save common Bioconductor objects to file artifacts and load them back into R. This aims to provide a functional equivalent to RDS-based serialization that is:
updateObject()
operations at best,
or invalidate RDS files at worst.To demonstrate, let’s mock up a DataFrame
object from
the S4Vectors
package.
## DataFrame with 10 rows and 2 columns
## X Y
## <integer> <character>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
We’ll save this DataFrame
to a directory:
And read it back in:
## DataFrame with 10 rows and 2 columns
## X Y
## <integer> <character>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
Each class implements a saving and reading method for use by the
alabaster framework. The saving method (for the
saveObject()
generic) will save the object to one or more
files inside a user-specified directory:
## [1] "OBJECT" "basic_columns.h5"
Conversely, the reading function will - as the name suggests - load
the object back into memory, given the path to its directory. The
correct loading function for each class is automatically called by the
readObject()
function:
## DataFrame with 10 rows and 2 columns
## X Y
## <integer> <character>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
alabaster.base
also provides a validateObject()
function, which checks
that each object’s on-disk representation follows its associated
specification. For all alabaster-supported objects,
this validates the file contents against the takane
specifications - successful validation provides guarantees for
readers like readObject()
and dolomite-base
(for Python). In fact, saveObject()
will automatically run
validateObject()
on the directory to ensure compliance.
alabaster.base itself supports a small set of classes from the S4Vectors packages; support for additional classes can be found in other packages like alabaster.ranges and alabaster.se. Third-party developers can also add support for their own classes by defining new methods, see the Extensions vignette for details.
Users can move freely rename or relocate directories and
readObject()
function will still work. For example, we can
easily copy the entire directory to a new file system and everything
will still be correctly referenced within the directory. The simplest
way to share objects is to just zip
or tar
the
staging directory for ad hoc distribution. For more serious
applications, alabaster.base
can be used in conjunction with storage systems like AWS S3 for
large-scale distribution.
## [1] TRUE
## DataFrame with 10 rows and 2 columns
## X Y
## <integer> <character>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
## 7 7 g
## 8 8 h
## 9 9 i
## 10 10 j
That said, it is unwise to manipulate the files inside the directory
created by saveObject()
. Reading functions will usually
depend on specific file names or subdirectory structures within the
directory, and fiddling with them may cause unexpected results. Advanced
users can exploit this by loading components from subdirectories if only
the full object is not required:
# Creating a nested DF to be a little spicy:
df2 <- DataFrame(Z=factor(1:5), AA=I(DataFrame(B=runif(5), C=rnorm(5))))
tmp <- tempfile()
meta2 <- saveObject(df2, tmp)
# Now reading in the nested DF:
list.files(tmp, recursive=TRUE)
## [1] "OBJECT" "basic_columns.h5"
## [3] "other_columns/1/OBJECT" "other_columns/1/basic_columns.h5"
## DataFrame with 5 rows and 2 columns
## B C
## <numeric> <numeric>
## 1 0.0196919 0.1261901
## 2 0.7982964 -0.8294884
## 3 0.7786083 -0.5711057
## 4 0.6334135 -0.0311062
## 5 0.0557657 0.2537396
The alabaster framework is easily extended to new classes by:
saveObject()
. This should accept
an instance of the object and a path to a directory, and save the
contents of the object inside the directory. It should also produce an
OBJECT
file that specifies the type of the object, e.g.,
data_frame
, hdf5_sparse_matrix
.readObject()
and registering it
with registerReadObjectFunction()
(or, for core
Bioconductor classes, by requesting a change to the default registry in
alabaster.base).
This should accept a path to a directory and read its contents to
reconstruct the object. The registered type should be the same as that
used in the OBJECT
file.validateObject()
and registering
it with registerValidateObjectFunction()
. This should
accept a path to a directory and read its contents to determine if it is
a valid on-disk representation. The registered type should be the same
as that used in the OBJECT
file.
validateObject()
so no registration
is required.To illustrate, let’s extend alabaster to the
dgTMatrix
from the Matrix
package. First, the saving method:
library(Matrix)
setMethod("saveObject", "dgTMatrix", function(x, path, ...) {
# Create a directory to stash our contents.
dir.create(path)
# Saving a DataFrame with the triplet data.
df <- DataFrame(i = x@i, j = x@j, x = x@x)
write.csv(df, file.path(path, "matrix.csv"), row.names=FALSE)
# Adding some more information.
write(dim(x), file=file.path(path, "dimensions.txt"), ncol=1)
# Creating an object file.
saveObjectFile(path, "triplet_sparse_matrix")
})
And now the reading and validation methods. The registration is
usually done in the extension package’s onLoad()
function.
readSparseTripletMatrix <- function(path, metadata, ...) {
df <- read.table(file.path(path, "matrix.csv"), header=TRUE, sep=",")
dims <- readLines(file.path(path, "dimensions.txt"))
sparseMatrix(
i=df$i + 1L,
j=df$j + 1L,
x=df$x,
dims=as.integer(dims),
repr="T"
)
}
registerReadObjectFunction("triplet_sparse_matrix", readSparseTripletMatrix)
validateSparseTripletMatrix <- function(path, metadata) {
df <- read.table(file.path(path, "matrix.csv"), header=TRUE, sep=",")
dims <- as.integer(readLines(file.path(path, "dimensions.txt")))
stopifnot(is.integer(df$i), all(df$i >= 0 & df$i < dims[1]))
stopifnot(is.integer(df$j), all(df$j >= 0 & df$j < dims[2]))
stopifnot(is.numeric(df$x))
}
registerValidateObjectFunction("triplet_sparse_matrix", validateSparseTripletMatrix)
## NULL
Let’s run them and see how it works:
## 10 x 10 sparse Matrix of class "dgTMatrix"
##
## [1,] . . 0.32918510 . . . . . . .
## [2,] . . . . . 0.8271621 . . . .
## [3,] 0.2751764 . . . . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . 0.02819651 . . . . . . .
## [6,] . . . . . . . 0.542808 . .
## [7,] . . . . . . . . . .
## [8,] . . . . . . . . . .
## [9,] . . . . . . . . . .
## [10,] . . . . . . . . . .
## [1] "OBJECT" "dimensions.txt" "matrix.csv"
## 10 x 10 sparse Matrix of class "dgTMatrix"
##
## [1,] . . 0.32918510 . . . . . . .
## [2,] . . . . . 0.8271621 . . . .
## [3,] 0.2751764 . . . . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . 0.02819651 . . . . . . .
## [6,] . . . . . . . 0.542808 . .
## [7,] . . . . . . . . . .
## [8,] . . . . . . . . . .
## [9,] . . . . . . . . . .
## [10,] . . . . . . . . . .
For more complex objects that are composed of multiple smaller
“child” objects, developers should consider saving each of their
children in subdirectories of path
. This can be achieved by
calling altSaveObject()
and altReadObject()
in
the saving and reading functions, respectively. (We use the
alt*
versions of these functions to respect application
overrides, see below.)
Developers can also create applications that customize the machinery
of the alabaster framework for specific needs. In most cases,
this involves storing more metadata to describe the object in more
detail. For example, we might want to remember the identity of the
author for each object. This is achieved by creating an
application-specific saving generic with the same signature as
saveObject()
:
setGeneric("appSaveObject", function(x, path, ...) {
ans <- standardGeneric("appSaveObject")
# File names with leading underscores are reserved for application-specific
# use, so they won't clash with anything produced by saveObject.
metapath <- file.path(path, "_metadata.json")
write(jsonlite::toJSON(ans, auto_unbox=TRUE), file=metapath)
})
## [1] "appSaveObject"
setMethod("appSaveObject", "ANY", function(x, path, ...) {
saveObject(x, path, ...) # does the real work
list(authors=I(Sys.info()[["user"]])) # adds the desired metadata
})
# We can specialize the behavior for specific classes like DataFrames.
setMethod("appSaveObject", "DFrame", function(x, path, ...) {
ans <- callNextMethod()
ans$columns <- I(colnames(x))
ans
})
Applications should call altSaveObjectFunction()
to
instruct altSaveObject()
to use this new generic. This
ensures that the customizations are applied to all child objects, such
as the nested DataFrame
below.
# Create a friendly user-visible function to handle the generic override; this
# is reversed on function exit to avoid interfering with other applications.
saveForApplication <- function(x, path, ...) {
old <- altSaveObjectFunction(appSaveObject)
on.exit(altSaveObjectFunction(old))
altSaveObject(x, path, ...)
}
# Saving our mocked up DataFrame with our overrides active.
df2 <- DataFrame(Z=factor(1:5), AA=I(DataFrame(B=runif(5), C=rnorm(5))))
tmp <- tempfile()
saveForApplication(df2, tmp)
# Both the parent and child DataFrames have new metadata.
cat(readLines(file.path(tmp, "_metadata.json")), sep="\n")
## {"authors":["root"],"columns":["Z","AA"]}
## {"authors":["root"],"columns":["B","C"]}
The reading function can be similarly overridden by setting
altReadObjectFunction()
to instruct all
altReadObject()
calls to use the override. This allows
applications to, e.g., do something with the metadata that we just
added.
# Defining the override for altReadObject().
appReadObject <- function(path, metadata=NULL, ...) {
if (is.null(metadata)) {
metadata <- readObjectFile(path)
}
# Print custom message based on the type and application-specific metadata.
appmeta <- jsonlite::fromJSON(file.path(path, "_metadata.json"))
cat("I am a ", metadata$type, " created by ", appmeta$authors[1], ".\n", sep="")
if (metadata$type == "data_frame") {
all.cols <- paste(appmeta$columns, collapse=", ")
cat("I have the following columns: ", all.cols, ".\n", sep="")
}
readObject(path, metadata=metadata, ...)
}
# Creating a user-friendly function to set the override before the read.
readForApplication <- function(path, metadata=NULL, ...) {
old <- altReadObjectFunction(appReadObject)
on.exit(altReadObjectFunction(old))
altReadObject(path, metadata, ...)
}
# This diverts to the override with printing of custom messages.
readForApplication(tmp)
## I am a data_frame created by root.
## I have the following columns: Z, AA.
## I am a data_frame created by root.
## I have the following columns: B, C.
## DataFrame with 5 rows and 2 columns
## Z AA
## <factor> <DataFrame>
## 1 1 0.246322:-1.210172
## 2 2 0.154906:-0.429744
## 3 3 0.915335:-0.633189
## 4 4 0.239906: 0.734313
## 5 5 0.705503: 1.126401
By overriding the saving and reading process for one or more classes, each application can customize the behavior of alabaster to their own needs.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Matrix_1.7-1 alabaster.base_1.7.2 S4Vectors_0.45.2
## [4] BiocGenerics_0.53.3 generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.9 compiler_4.4.2 BiocManager_1.30.25
## [4] crayon_1.5.3 Rcpp_1.0.13-1 rhdf5filters_1.19.0
## [7] alabaster.matrix_1.7.4 jquerylib_0.1.4 IRanges_2.41.2
## [10] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [13] XVector_0.47.0 R6_2.5.1 S4Arrays_1.7.1
## [16] knitr_1.49 DelayedArray_0.33.3 MatrixGenerics_1.19.0
## [19] maketools_1.3.1 bslib_0.8.0 rlang_1.1.4
## [22] HDF5Array_1.35.2 cachem_1.1.0 xfun_0.49
## [25] sass_0.4.9 sys_3.4.3 SparseArray_1.7.2
## [28] cli_3.6.3 Rhdf5lib_1.29.0 zlibbioc_1.52.0
## [31] digest_0.6.37 grid_4.4.2 alabaster.schemas_1.7.0
## [34] rhdf5_2.51.1 lifecycle_1.0.4 evaluate_1.0.1
## [37] buildtools_1.0.0 abind_1.4-8 rmarkdown_2.29
## [40] matrixStats_1.4.1 tools_4.4.2 htmltools_0.5.8.1