The DelayedArray framework currently supports a small number of on-disk backends: HDF5 (via the HDF5Array package), GDS (via the GDSArray package), and VCF (via the VCFArray package). This can be extended to support other on-disk backends. In theory, it should be possible to implement a DelayedArray backend for any file format that has the capability to store array data with fast random access.
Let’s assume that the ADS format (Array Data Store) is such format (this is a made-up format for the purpose of this vignette only). Implementing a DelayedArray backend for ADS files should typically be done in a dedicated package (say ADSArray) that will depend on the DelayedArray package.
The ADSArray package will need to implement:
A low-level class for representing a reference to an array located in an ADS file. We’ll refer to this class as “the seed class” and will name it ADSArraySeed.
Two high-level classes that derive from DelayedArray: ADSArray and ADSMatrix. Only the latter is needed if the ADS format only supports 2-dimensional arrays.
A “realization sink” class if you also want to support realization of DelayedArray objects as ADSArray objects. This is not documented yet.
The rest of this document covers the above topics in greater details. Some familiarity with writing R packages is assumed. Don’t hesitate to look at the source of the HDF5Array package for a real example of DelayedArray on-disk backend implementation.
A “seed object” should store at least the path or URL to the file. If the file format allows storing more than one array per file, then the seed object should also store any additional information needed to locate a particular array in the file.
The definition of the seed class will look something like this:
setClass("ADSArraySeed",
contains="Array",
slots=c(
filepath="character",
...
... additional slots needed
... to locate the array in the file
...
)
)
The filepath
slot should be a single string that
contains the absolute path to the ADS file so the object doesn’t break
when the user changes the working directory (e.g. with
setwd()
).
Note that storing an open connection to the file should be avoided because connections don’t work properly in the context of a fork (e.g. when processing the seed object in parallel) and tend to break when serializing the object.
It is highly recommended to provide a “seed constructor” e.g.:
ADSArraySeed <- function(filepath, other args)
{
sanity checks
...
filepath <- file_path_as_absolute(filepath)
...
new("ADSArraySeed", filepath=filepath, other args)
}
Note that file_path_as_absolute()
is defined in the
tools package so it needs to be imported by adding the
following to the NAMESPACE file of the ADSArray package:
importFrom(tools, file_path_as_absolute)
and adding tools to the Imports
field of the
DESCRIPTION file of the package.
Seed objects are expected to comply with the “seed contract” i.e. to
support dim()
, dimnames()
, and
extract_array()
. This is normally done by implementing
methods for these generics, but, as we will see below, explicitly
defining dim()
or dimnames()
methods is rarely
needed.
For example, the dim()
method for ADSArraySeed objects
could look like this:
### An implementation that extracts the dimensions from the file
### each time the method is called.
setMethod("dim", "ADSArraySeed",
function(x)
{
- open the connection to the file
- on.exit(close the connection)
- extract the dimensions and return them in an integer vector
}
)
Note that the above dim()
method consults the ADS file
each time it’s called. However this can be avoided by adding a
dim
(and dimnames
) slot (of type
integer
for dim
, of type list
for
dimnames
) to the ADSArraySeed class, and to populate it at
construction time, so this information is retrieved from the file only
once. With this approach dim()
and dimnames()
work out-of-the-box on ADSArraySeed objects i.e. there is no need to
define dim()
and dimnames()
methods for these
objects. This is because the dim()
and
dimnames()
primitive functions in base R return the content
of these slots if present.
If the ADS format does not allow storage of the dimnames, then there
is no need to implement a dimnames()
method or to add a
dimnames()
slot to the ADSArraySeed class. Calling
dimnames(x)
then will simply return NULL
for
any ADSArraySeed object x
.
If the ADS format allows storage of the dimnames, make sure that
dimnames()
always returns them in the standard
form, that is:
The dimnames must be returned as a NULL
(if the
dataset has no dimnames) or as an ordinary list with one list element
per dimension in the dataset.
Each element in the returned list is either NULL
or
a character vector of length the extend of the dataset along the
corresponding dimension. It is particularly important to make sure that
the vectors in the list returned by dimnames()
are
character vectors. Other types like factors or integer vectors are not
allowed and will break downstream code.
extract_array()
is a generic function defined in the
DelayedArray package:
library(DelayedArray)
?extract_array
It takes 2 arguments: x
and index
.
x
is the seed object to extract array values from.
index
must be an unnamed list of subscripts as positive
integer vectors, one vector per seed dimension. Empty and missing
subscripts (represented by integer(0)
and NULL
list elements, respectively) are allowed. The subscripts in
index
can contain duplicated indices. They cannot contain
NAs or non-positive values.
The extract_array()
method must return an ordinary
array of the appropriate type (i.e. integer
,
double
, etc…). For example, if x
is an
ADSArraySeed object representing an M x N on-disk matrix of complex
numbers, extract_array(x, list(NULL, 2L))
must return its
2nd column as an M x 1 ordinary matrix of type
complex
.
Note that the extract_array()
method needs to support
empty and missing subscripts
e.g. extract_array(x, list(NULL, integer(0)))
must return
an M x 0 matrix of type complex
and
extract_array(x, list(integer(0), integer(0)))
a 0 x 0
matrix of type complex
. This last edge case is important
because the type()
and show()
methods for
DelayedArray objects rely on it to work. More precisely, once the
extract_array()
method supports an index
with
empty integer vectors, the following should work:
seed <- ADSArraySeed(...)
M <- DelayedArray(seed)
type(M)
show(M)
Finally note that subscripts are allowed to contain duplicated
indices so things like
extract_array(seed, list(c(1:3, 3:1), 2L))
need to be
supported.
Make sure the NAMESPACE file of the ADSArray package contains at least the following imports:
import(methods)
importFrom(tools, file_path_as_absolute)
import(BiocGenerics)
import(S4Vectors)
import(IRanges)
import(DelayedArray)
Unless you have a good reason for it, don’t try to selectively import things from the methods, BiocGenerics, S4Vectors, IRanges, and DelayedArray packages. This will only complicate maintenance of the ADSArray package in the long run and has no real benefits (contrary to popular belief).
Add methods, BiocGenerics, and
DelayedArray to the Depends
field of the
DESCRIPTION file of the package, and tools, S4Vectors,
and IRanges to its Imports
field.
Make sure to export the ADSArraySeed class, its constructor, and the
dim
, dimnames
, and extract_array
methods.
At this point, you should be able to wrap an ADSArraySeed object
seed
in a DelayedArray object with
DelayedArray(seed)
, and this should return a fully
functional DelayedArray object.
These classes are not strictly needed but add a nice level of convenience.
An ADSArray or ADSMatrix object is a DelayedArray derivative that doesn’t carry delayed operations yet. As soon as the user will start operating on it, it will be degraded to a DelayedArray instance.
The ADSArray and ADSMatrix classes should extend the DelayedArray and DelayedMatrix classes, respectively, without adding any slot to them.
So just:
setClass("ADSArray",
contains="DelayedArray",
representation(seed="ADSArraySeed")
)
We’ll define the ADSMatrix class later.
Add a DelayedArray()
method for ADSArraySeed objects
that does:
setMethod("DelayedArray", "ADSArraySeed",
function(seed) new_DelayedArray(seed, Class="ADSArray")
)
Now you should be able to construct an ADSArray object with:
DelayedArray(ADSArraySeed(...))
The ADSArray
constructor should just do that:
ADSArray <- function(filepath, other args)
DelayedArray(ADSArraySeed(filepath, other args))
However, it’s also nice to be able to pass an ADSArraySeed object to
this constructor (with ADSArray(seed)
). This can easily be
supported with something like:
### Works directly on an ADSArraySeed object, in which case it must be
### called with a single argument.
ADSArray <- function(filepath, other args)
{
if (is(filepath, "ADSArraySeed")) {
if (!(missing(other arg1) && missing(other arg2) && ...))
stop(wmsg("ADSArray() must be called with a single argument ",
"when passed an ADSArraySeed object"))
seed <- filepath
} else {
seed <- ADSArraySeed(filepath, other args)
}
DelayedArray(seed)
}
setClass("ADSMatrix", contains=c("ADSArray", "DelayedMatrix"))
Define a matrixClass()
method for ADSArray objects as
follow:
setMethod("matrixClass", "ADSArray", function(x) "ADSMatrix")
matrixClass()
is a generic function defined in the
DelayedArray package. When passed an ADSArraySeed object,
low-level constructor new_DelayedArray
(see below) will
generally return an ADSArray instance, except when the
ADSArraySeed object is 2-dimensional, in which case it needs to return
an ADSMatrix instance. It will obtain the name of the class of
the object to return ("ADSMatrix"
in this case) by calling
matrixClass
.
Also coercion from ADSArray to ADSMatrix needs to be supported with:
setAs("ADSArray", "ADSMatrix", function(from) new("ADSMatrix", from))
This coercion will make sure that the end user gets the following error when trying to coerce an ADSArray object that is not 2-dimensional to ADSMatrix:
as(x, "ADSMatrix")
# Error in validObject(.Object) : invalid class "ADSMatrix" object:
# 'x' must have exactly 2 dimensions
Without the above coercion method, as(x, "ADSMatrix")
would silently return an invalid ADSMatrix object.
The user should not be able to degrade an ADSMatrix object to an
ADSArray object so as(x, "ADSArray", strict=TRUE)
should
fail or be a no-op when x
is an ADSMatrix object. The
easiest (and recommended) way to achieve this is to define the following
coercion method:
setAs("ADSMatrix", "ADSArray", function(from) from) # no-op
It is possible, and enouraged, to overwrite current DelayedArray
block-processed operations (e.g. max
, colSums
,
%*%
, etc…) with optimized backend-specific methods. For
example, let’s imagine that ADS files have the capability to store some
precomputed stats about the dataset. Then one could define a fast
max()
method for ADSArray objects with something like:
setMethod("max", "ADSArraySeed",
function(x, na.rm=FALSE)
{
get the precomputed max from the file
}
)
setMethod("max", "ADSArray",
function(x, na.rm=FALSE) max(x@seed, na.rm=na.rm)
)
Note that delayed operations like setting dimnames on an ADSArray
object (with dimnames(A) <- new_dimnames
) or transposing
an ADSMatrix object (with M2 <- t(M)
) will degrade the
object to a DelayedArray or DelayedMatrix instance, causing
max(A)
and max(M2)
to use the far less
efficient block-processed max()
method defined for
DelayedArray objects. There is clearly room for improvement here and
work will be done in the near future to make the max()
method (and other block-processed methods) for DelayedArray objects try
to take advantage of the backend-specific methods whenever it can.
However in the meantime, backend authors should resist the temptation
to overwrite the dimnames<-()
and t()
methods for DelayedArray objects with backend-specific methods that
modify the seed. This would be a violation of the “never touch the seed”
principle which is central to the DelayedArray framework. More
precisely, no matter what delayed operations are performed on a
DelayedArray object, the seeds of the result should always be identical
to the original seeds (e.g. seed(t(M))
should always be
identical to seed(M)
).
Make sure to export the ADSArray and ADSMatrix classes, the
ADSArray
constructor, the coerce
methods, and
any backend-specific method.
Install the ADSArray package and load it in a fresh R session:
library(ADSArray)
... coming soon ...