Using HDF5-backed matrices with beachmat

Overview

beachmat.hdf5 provides a C++ API to extract numeric data from HDF5-backed matrices from the HDF5Array package. This extends the beachmat package to the matrix representations in the tatami_hdf5 library. By including this package, users and developers can enable tatami-compatible C++ code to operate natively on file-backed data via the HDF5 C library.

For users

Users can simply load the package in their R session:

library(beachmat.hdf5)

This will automatically extend beachmat’s functionality to HDF5Array matrices. Any package code based on beachmat will now be able to access HDF5 data natively without any further work.

For developers

Developers should read the beachmat developer guide if they have not done so already.

Developers can import beachmat.hdf5 in their packages to guarantee native support for HDF5Array classes. This registers more initializeCpp() methods that initializes the appropriate C++ representations for these classes. Of course, this adds some more dependencies to the package, which may or may not be acceptable; some developers may prefer to leave this choice to the user or hide it behind an optional parameter to reduce the installation burden (e.g., if HDF5-backed matrices are not expected to be a common input in the package workflow).

It’s worth noting that beachmat by itself will already work with HDF5Matrix, H5SparseMatrix, etc. objects even without loading beachmat.hdf5. However, this is not as efficient as any package C++ code needs to go back into R to extract the matrix data via DelayedArray::extract_array() and friends. Importing beachmat.hdf5 provides native support without the need for calls to R functions.

In-memory caching

The initializeCpp() methods for the HDF5Array classes have an optional memorize= parameter. If this is TRUE, the entire matrix is loaded from the HDF5 file into memory and stored in a global cache on first use. Any subsequent calls to initializeCpp() on the same matrix instance will re-use the cached value.

In-memory caching is intended for functions or workflows that need to iterate through the matrix multiple times. By setting memorize=TRUE, developers can pay an up-front loading cost to avoid the repeated penalty of disk access on subsequent iterations. Obviously, this assumes that the matrix is still small enough that an in-memory store is feasible.

For long-running analyses, users may call beachmat::flushMemoryCache() to clear the cache.

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] beachmat.hdf5_1.5.1 knitr_1.49          BiocStyle_2.35.0   
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-1          jsonlite_1.8.9        compiler_4.4.2       
##  [4] BiocManager_1.30.25   crayon_1.5.3          Rcpp_1.0.13-1        
##  [7] rhdf5filters_1.19.0   jquerylib_0.1.4       IRanges_2.41.2       
## [10] yaml_2.3.10           fastmap_1.2.0         lattice_0.22-6       
## [13] R6_2.5.1              XVector_0.47.0        S4Arrays_1.7.1       
## [16] generics_0.1.3        BiocGenerics_0.53.3   DelayedArray_0.33.3  
## [19] MatrixGenerics_1.19.0 maketools_1.3.1       bslib_0.8.0          
## [22] rlang_1.1.4           cachem_1.1.0          HDF5Array_1.35.2     
## [25] xfun_0.49             sass_0.4.9            sys_3.4.3            
## [28] SparseArray_1.7.2     cli_3.6.3             Rhdf5lib_1.29.0      
## [31] zlibbioc_1.52.0       digest_0.6.37         grid_4.4.2           
## [34] rhdf5_2.51.0          lifecycle_1.0.4       S4Vectors_0.45.2     
## [37] evaluate_1.0.1        buildtools_1.0.0      beachmat_2.23.4      
## [40] abind_1.4-8           stats4_4.4.2          rmarkdown_2.29       
## [43] matrixStats_1.4.1     tools_4.4.2           htmltools_0.5.8.1