Saving arrays to artifacts and back again

Overview

The alabaster.matrix package implements methods to save matrix-like objects to file artifacts and load them back into R. Check out the alabaster.base for more details on the motivation and the alabaster framework.

Quick start

Given an array-like object, we can use saveObject() to save it inside a staging directory:

library(Matrix)
y <- rsparsematrix(1000, 100, density=0.05)

library(alabaster.matrix)
tmp <- tempfile()
saveObject(y, tmp)

list.files(tmp, recursive=TRUE)
## [1] "OBJECT"    "matrix.h5"

We then load it back into our R session with loadObject(). This creates a HDF5-backed S4 array that can be easily coerced into the desired format, e.g., a dgCMatrix.

roundtrip <- readObject(tmp)
class(roundtrip)
## [1] "ReloadedMatrix"
## attr(,"package")
## [1] "alabaster.matrix"

This process is supported for all base arrays, Matrix objects and DelayedArray objects.

Saving delayed operations

For DelayedArrays, we may instead choose to save the delayed operations themselves to file. This creates a HDF5 file following the chihaya format, containing the delayed operations rather than the results of their evaluation.

library(DelayedArray)
y <- DelayedArray(rsparsematrix(1000, 100, 0.05))
y <- log1p(abs(y) / 1:100) # adding some delayed ops.

tmp <- tempfile()
saveObject(y, tmp, DelayedArray.preserve.ops=TRUE)

# Inspecting the HDF5 file reveals many delayed operations:
rhdf5::h5ls(file.path(tmp, "array.h5"))
##                            group          name       otype  dclass   dim
## 0                              / delayed_array   H5I_GROUP              
## 1                 /delayed_array        method H5I_DATASET  STRING ( 0 )
## 2                 /delayed_array          seed   H5I_GROUP              
## 3            /delayed_array/seed         along H5I_DATASET INTEGER ( 0 )
## 4            /delayed_array/seed        method H5I_DATASET  STRING ( 0 )
## 5            /delayed_array/seed          seed   H5I_GROUP              
## 6       /delayed_array/seed/seed        method H5I_DATASET  STRING ( 0 )
## 7       /delayed_array/seed/seed          seed   H5I_GROUP              
## 8  /delayed_array/seed/seed/seed     by_column H5I_DATASET INTEGER ( 0 )
## 9  /delayed_array/seed/seed/seed          data H5I_DATASET   FLOAT  5000
## 10 /delayed_array/seed/seed/seed      dimnames   H5I_GROUP              
## 11 /delayed_array/seed/seed/seed       indices H5I_DATASET INTEGER  5000
## 12 /delayed_array/seed/seed/seed        indptr H5I_DATASET INTEGER   101
## 13 /delayed_array/seed/seed/seed         shape H5I_DATASET INTEGER     2
## 14           /delayed_array/seed          side H5I_DATASET  STRING ( 0 )
## 15           /delayed_array/seed         value H5I_DATASET INTEGER  1000
# And indeed, we can recover those same operations.
readObject(tmp)
## <1000 x 100> sparse ReloadedMatrix object of type "double":
##           [,1]   [,2]   [,3] ...       [,99]      [,100]
##    [1,]      0      0      0   .   0.0000000   0.0000000
##    [2,]      0      0      0   .   0.0000000   0.0000000
##    [3,]      0      0      0   .   0.2725684   0.0000000
##    [4,]      0      0      0   .   0.0000000   0.0000000
##    [5,]      0      0      0   .   0.0000000   0.0000000
##     ...      .      .      .   .           .           .
##  [996,]      0      0      0   . 0.000000000 0.000000000
##  [997,]      0      0      0   . 0.000000000 0.000000000
##  [998,]      0      0      0   . 0.006002358 0.000000000
##  [999,]      0      0      0   . 0.000000000 0.000000000
## [1000,]      0      0      0   . 0.000000000 0.008265744

This allows users to avoid evaluation of the operations when saving objects, which may improve efficiency, e.g., by avoiding loss of sparsity or casting to a larger type.

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DelayedArray_0.33.3    SparseArray_1.7.2      S4Arrays_1.7.1        
##  [4] abind_1.4-8            IRanges_2.41.2         S4Vectors_0.45.2      
##  [7] MatrixGenerics_1.19.0  matrixStats_1.4.1      BiocGenerics_0.53.3   
## [10] generics_0.1.3         alabaster.matrix_1.7.4 alabaster.base_1.7.2  
## [13] Matrix_1.7-1           BiocStyle_2.35.0      
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.9          compiler_4.4.2          BiocManager_1.30.25    
##  [4] crayon_1.5.3            Rcpp_1.0.13-1           rhdf5filters_1.19.0    
##  [7] jquerylib_0.1.4         yaml_2.3.10             fastmap_1.2.0          
## [10] lattice_0.22-6          XVector_0.47.0          R6_2.5.1               
## [13] knitr_1.49              maketools_1.3.1         bslib_0.8.0            
## [16] rlang_1.1.4             HDF5Array_1.35.2        cachem_1.1.0           
## [19] xfun_0.49               sass_0.4.9              sys_3.4.3              
## [22] cli_3.6.3               Rhdf5lib_1.29.0         zlibbioc_1.52.0        
## [25] digest_0.6.37           grid_4.4.2              alabaster.schemas_1.7.0
## [28] rhdf5_2.51.0            lifecycle_1.0.4         evaluate_1.0.1         
## [31] buildtools_1.0.0        rmarkdown_2.29          tools_4.4.2            
## [34] htmltools_0.5.8.1