Saving and reloading DelayedArray objects

Motivation

The chihaya package saves DelayedArray objects for efficient, portable and stable reproduction of delayed operations in a new R session or other programming frameworks.

  • Portability. We provide a file specification in a standard format (HDF5) that enables other languages to easily interpret and reproduce the delayed operations.
  • Stability. By converting DelayedArray operations into our specification, we provide a layer of protection against changes in the S4 class structure that would invalidate serialized RDS objects.
  • Efficiency. We avoid any realization of the delayed operations, enabling quick saving and avoiding loss of data structure in the seed (e.g., sparsity).

Check out the specification for more details.

Quick start

Make a DelayedArray object with some operations:

library(DelayedArray)
x <- DelayedArray(matrix(runif(1000), ncol=10))
x <- x[11:15,] / runif(5) 
x <- log2(x + 1)
x
## <5 x 10> DelayedMatrix object of type "double":
##           [,1]      [,2]      [,3] ...      [,9]     [,10]
## [1,] 3.1892406 3.6005533 2.8672847   . 2.5606825 0.4006983
## [2,] 0.6848353 0.4640816 0.3398459   . 0.8913409 0.3815078
## [3,] 1.1551322 1.1473146 1.2479248   . 1.0237410 0.2752791
## [4,] 0.3725146 1.1186413 0.4230907   . 1.5618270 0.8849816
## [5,] 2.1491976 2.2959302 2.2487783   . 1.7864836 1.6093835
showtree(x)
## 5x10 double: DelayedMatrix object
## └─ 5x10 double: Stack of 2 unary iso op(s)
##    └─ 5x10 double: Unary iso op with args
##       └─ 5x10 double: Subset
##          └─ 100x10 double: [seed] matrix object

Save it into a HDF5 file with saveDelayed():

library(chihaya)
tmp <- tempfile(fileext=".h5")
saveDelayed(x, tmp)
rhdf5::h5ls(tmp)
##                            group    name       otype  dclass      dim
## 0                              / delayed   H5I_GROUP                 
## 1                       /delayed    base H5I_DATASET   FLOAT    ( 0 )
## 2                       /delayed  method H5I_DATASET  STRING    ( 0 )
## 3                       /delayed    seed   H5I_GROUP                 
## 4                  /delayed/seed  method H5I_DATASET  STRING    ( 0 )
## 5                  /delayed/seed    seed   H5I_GROUP                 
## 6             /delayed/seed/seed   along H5I_DATASET INTEGER    ( 0 )
## 7             /delayed/seed/seed  method H5I_DATASET  STRING    ( 0 )
## 8             /delayed/seed/seed    seed   H5I_GROUP                 
## 9        /delayed/seed/seed/seed   index   H5I_GROUP                 
## 10 /delayed/seed/seed/seed/index       0 H5I_DATASET INTEGER        5
## 11       /delayed/seed/seed/seed    seed   H5I_GROUP                 
## 12  /delayed/seed/seed/seed/seed    data H5I_DATASET   FLOAT 100 x 10
## 13  /delayed/seed/seed/seed/seed  native H5I_DATASET INTEGER    ( 0 )
## 14            /delayed/seed/seed    side H5I_DATASET  STRING    ( 0 )
## 15            /delayed/seed/seed   value H5I_DATASET   FLOAT        5
## 16                 /delayed/seed    side H5I_DATASET  STRING    ( 0 )
## 17                 /delayed/seed   value H5I_DATASET   FLOAT    ( 0 )

And then load it back in later:

y <- loadDelayed(tmp)
y
## <5 x 10> DelayedMatrix object of type "double":
##           [,1]      [,2]      [,3] ...      [,9]     [,10]
## [1,] 3.1892406 3.6005533 2.8672847   . 2.5606825 0.4006983
## [2,] 0.6848353 0.4640816 0.3398459   . 0.8913409 0.3815078
## [3,] 1.1551322 1.1473146 1.2479248   . 1.0237410 0.2752791
## [4,] 0.3725146 1.1186413 0.4230907   . 1.5618270 0.8849816
## [5,] 2.1491976 2.2959302 2.2487783   . 1.7864836 1.6093835

Of course, this is not a particularly interesting case as we end up saving the original array inside our HDF5 file anyway. The real fun begins when you have some more interesting seeds.

More interesting seeds

We can use the delayed nature of the operations to avoid breaking sparsity. For example:

library(Matrix)
x <- rsparsematrix(1000, 1000, density=0.01)
x <- DelayedArray(x) + runif(1000)

tmp <- tempfile(fileext=".h5")
saveDelayed(x, tmp)
rhdf5::h5ls(tmp)
##            group     name       otype  dclass   dim
## 0              /  delayed   H5I_GROUP              
## 1       /delayed    along H5I_DATASET INTEGER ( 0 )
## 2       /delayed   method H5I_DATASET  STRING ( 0 )
## 3       /delayed     seed   H5I_GROUP              
## 4  /delayed/seed     data H5I_DATASET   FLOAT 10000
## 5  /delayed/seed dimnames   H5I_GROUP              
## 6  /delayed/seed  indices H5I_DATASET INTEGER 10000
## 7  /delayed/seed   indptr H5I_DATASET INTEGER  1001
## 8  /delayed/seed    shape H5I_DATASET INTEGER     2
## 9       /delayed     side H5I_DATASET  STRING ( 0 )
## 10      /delayed    value H5I_DATASET   FLOAT  1000
file.info(tmp)[["size"]]
## [1] 101782
# Compared to a dense array.
tmp2 <- tempfile(fileext=".h5")
out <- HDF5Array::writeHDF5Array(x, tmp2, "data")
file.info(tmp2)[["size"]]
## [1] 279928
# Loading it back in.
y <- loadDelayed(tmp)
showtree(y)
## 1000x1000 double: DelayedMatrix object
## └─ 1000x1000 double: Unary iso op with args
##    └─ 1000x1000 double, sparse: [seed] dgCMatrix object

We can also store references to external files, thus avoiding data duplication:

library(HDF5Array)
test <- HDF5Array(tmp2, "data")
stuff <- log2(test + 1)
stuff
## <1000 x 1000> DelayedMatrix object of type "double":
##               [,1]       [,2]       [,3] ...     [,999]    [,1000]
##    [1,]  0.6609491  0.6609491  0.6609491   .  0.6609491  0.6609491
##    [2,]  0.1282267  0.1282267  0.1282267   .  0.1282267  0.1282267
##    [3,]  0.4803945  0.4803945  0.4803945   .  0.4803945  0.4803945
##    [4,]  0.1038005  0.1038005  0.1038005   .  0.1038005  0.1038005
##    [5,]  0.7387484  0.7387484  0.7387484   .  0.7387484  0.7387484
##     ...          .          .          .   .          .          .
##  [996,] 0.28253275 0.28253275 0.28253275   . 0.28253275 0.28253275
##  [997,] 0.45990290 0.45990290 0.45990290   . 0.45990290 0.45990290
##  [998,] 0.00830843 0.00830843 0.00830843   . 0.00830843 0.00830843
##  [999,] 0.36752085 0.36752085 0.36752085   . 0.36752085 0.36752085
## [1000,] 0.49238432 0.49238432 0.49238432   . 0.49238432 0.49238432
tmp <- tempfile(fileext=".h5")
saveDelayed(stuff, tmp)
rhdf5::h5ls(tmp)
##                 group       name       otype  dclass   dim
## 0                   /    delayed   H5I_GROUP              
## 1            /delayed       base H5I_DATASET   FLOAT ( 0 )
## 2            /delayed     method H5I_DATASET  STRING ( 0 )
## 3            /delayed       seed   H5I_GROUP              
## 4       /delayed/seed     method H5I_DATASET  STRING ( 0 )
## 5       /delayed/seed       seed   H5I_GROUP              
## 6  /delayed/seed/seed dimensions H5I_DATASET INTEGER     2
## 7  /delayed/seed/seed       file H5I_DATASET  STRING ( 0 )
## 8  /delayed/seed/seed       name H5I_DATASET  STRING ( 0 )
## 9  /delayed/seed/seed     sparse H5I_DATASET INTEGER ( 0 )
## 10 /delayed/seed/seed       type H5I_DATASET  STRING ( 0 )
## 11      /delayed/seed       side H5I_DATASET  STRING ( 0 )
## 12      /delayed/seed      value H5I_DATASET   FLOAT ( 0 )
file.info(tmp)[["size"]] # size of the delayed operations + pointer to the actual file
## [1] 49642
y <- loadDelayed(tmp)
y
## <1000 x 1000> DelayedMatrix object of type "double":
##               [,1]       [,2]       [,3] ...     [,999]    [,1000]
##    [1,]  0.6609491  0.6609491  0.6609491   .  0.6609491  0.6609491
##    [2,]  0.1282267  0.1282267  0.1282267   .  0.1282267  0.1282267
##    [3,]  0.4803945  0.4803945  0.4803945   .  0.4803945  0.4803945
##    [4,]  0.1038005  0.1038005  0.1038005   .  0.1038005  0.1038005
##    [5,]  0.7387484  0.7387484  0.7387484   .  0.7387484  0.7387484
##     ...          .          .          .   .          .          .
##  [996,] 0.28253275 0.28253275 0.28253275   . 0.28253275 0.28253275
##  [997,] 0.45990290 0.45990290 0.45990290   . 0.45990290 0.45990290
##  [998,] 0.00830843 0.00830843 0.00830843   . 0.00830843 0.00830843
##  [999,] 0.36752085 0.36752085 0.36752085   . 0.36752085 0.36752085
## [1000,] 0.49238432 0.49238432 0.49238432   . 0.49238432 0.49238432

Session information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] HDF5Array_1.33.8      rhdf5_2.49.0          chihaya_1.7.0        
##  [4] DelayedArray_0.31.14  SparseArray_1.5.45    S4Arrays_1.5.11      
##  [7] abind_1.4-8           IRanges_2.39.2        S4Vectors_0.43.2     
## [10] MatrixGenerics_1.17.1 matrixStats_1.4.1     BiocGenerics_0.51.3  
## [13] Matrix_1.7-1          BiocStyle_2.33.1     
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.9      compiler_4.4.1      BiocManager_1.30.25
##  [4] crayon_1.5.3        Rcpp_1.0.13         rhdf5filters_1.17.0
##  [7] jquerylib_0.1.4     yaml_2.3.10         fastmap_1.2.0      
## [10] lattice_0.22-6      R6_2.5.1            XVector_0.45.0     
## [13] knitr_1.48          maketools_1.3.1     bslib_0.8.0        
## [16] rlang_1.1.4         cachem_1.1.0        xfun_0.48          
## [19] sass_0.4.9          sys_3.4.3           cli_3.6.3          
## [22] Rhdf5lib_1.27.0     zlibbioc_1.51.2     digest_0.6.37      
## [25] grid_4.4.1          lifecycle_1.0.4     evaluate_1.0.1     
## [28] buildtools_1.0.0    rmarkdown_2.28      tools_4.4.1        
## [31] htmltools_0.5.8.1