beachmat.hdf5 provides a C++ API to extract numeric data from HDF5-backed matrices from the HDF5Array package. This extends the beachmat package to the matrix representations in the tatami_hdf5 library. By including this package, users and developers can enable tatami-compatible C++ code to operate natively on file-backed data via the HDF5 C library.
Users can simply load the package in their R session:
This will automatically extend beachmat’s functionality to HDF5Array matrices. Any package code based on beachmat will now be able to access HDF5 data natively without any further work.
Developers should read the beachmat developer guide if they have not done so already.
Developers can import beachmat.hdf5
in their packages to guarantee native support for HDF5Array
classes. This registers more initializeCpp()
methods that
initializes the appropriate C++ representations for these classes. Of
course, this adds some more dependencies to the package, which may or
may not be acceptable; some developers may prefer to leave this choice
to the user or hide it behind an optional parameter to reduce the
installation burden (e.g., if HDF5-backed matrices are not expected to
be a common input in the package workflow).
It’s worth noting that beachmat
by itself will already work with HDF5Matrix
,
H5SparseMatrix
, etc. objects even without loading beachmat.hdf5.
However, this is not as efficient as any package C++ code needs to go
back into R to extract the matrix data via
DelayedArray::extract_array()
and friends. Importing beachmat.hdf5
provides native support without the need for calls to R functions.
The initializeCpp()
methods for the HDF5Array
classes have an optional memorize=
parameter. If this is
TRUE
, the entire matrix is loaded from the HDF5 file into
memory and stored in a global cache on first use. Any subsequent calls
to initializeCpp()
on the same matrix instance will re-use
the cached value.
In-memory caching is intended for functions or workflows that need to
iterate through the matrix multiple times. By setting
memorize=TRUE
, developers can pay an up-front loading cost
to avoid the repeated penalty of disk access on subsequent iterations.
Obviously, this assumes that the matrix is still small enough that an
in-memory store is feasible.
For long-running analyses, users may call
beachmat::flushMemoryCache()
to clear the cache.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] beachmat.hdf5_1.5.1 knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 jsonlite_1.8.9 compiler_4.4.2
## [4] BiocManager_1.30.25 crayon_1.5.3 Rcpp_1.0.13-1
## [7] rhdf5filters_1.19.0 jquerylib_0.1.4 IRanges_2.41.0
## [10] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [13] R6_2.5.1 XVector_0.47.0 S4Arrays_1.7.1
## [16] generics_0.1.3 BiocGenerics_0.53.1 DelayedArray_0.33.1
## [19] MatrixGenerics_1.19.0 maketools_1.3.1 bslib_0.8.0
## [22] rlang_1.1.4 cachem_1.1.0 HDF5Array_1.35.1
## [25] xfun_0.49 sass_0.4.9 sys_3.4.3
## [28] SparseArray_1.7.1 cli_3.6.3 Rhdf5lib_1.29.0
## [31] zlibbioc_1.52.0 digest_0.6.37 grid_4.4.2
## [34] rhdf5_2.51.0 lifecycle_1.0.4 S4Vectors_0.45.0
## [37] evaluate_1.0.1 buildtools_1.0.0 beachmat_2.23.0
## [40] abind_1.4-8 stats4_4.4.2 rmarkdown_2.29
## [43] matrixStats_1.4.1 tools_4.4.2 htmltools_0.5.8.1