h5mread
is an R/Bioconductor package that allows fast and memory-efficient
loading of HDF5 data into R. The main function in the package is
h5mread()
which allows reading arbitrary data from an HDF5
dataset into R, similarly to what the h5read()
function
from the rhdf5
package does. In the case of h5mread()
, the implementation
has been optimized to make it as fast and memory-efficient as
possible.
In addition to the h5mread()
function, the package also
provides the following low-level functionality:
h5dim()
and h5chunkdim()
;
Utility functions to manipulate the dimnames of an HDF5 dataset;
H5File objects to facilitate access to remote HDF5 files like files stored in an Amazon S3 bucket.
Note that the primary use case for the h5mread package is to support higher-level functionality implemented in the HDF5Array package.
Like any other Bioconductor package, h5mread
should always be installed with BiocManager::install()
:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("h5mread")
Load the package:
h5mread()
is an efficient and flexible alternative to
rhdf5::h5read()
.
Note that we’ll use writeHDF5Array()
from the HDF5Array
package to conveniently create the HDF5 datasets used in the examples
below.
Create a 70,000 x 1,500 random dataset:
set.seed(2009)
m0 <- matrix(runif(105e6), ncol=1500) # 70,000 x 1,500 matrix
temp0_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(m0, temp0_h5, "m0", chunkdim=c(100, 100))
Load 1,000 random rows from the HDF5 dataset:
## group name otype dclass dim
## 0 / m0 H5I_DATASET FLOAT 70000 x 1500
nrow0 <- h5dim(temp0_h5, "m0")[[1L]]
starts <- list(sample(nrow0, 1000), NULL)
m <- h5mread(temp0_h5, "m0", starts=starts)
See ?h5mread
for more information and additional
examples.
Sanity check:
The HDF5 format doesn’t natively support sparse data representation. However, because it stores the data in compressed chunks, sparse data gets efficiently compressed in the HDF5 file.
The more sparse the data, the smaller the resulting HDF5 file:
a1 <- poissonSparseArray(c(6100, 960, 75), density=0.5)
a2 <- poissonSparseArray(c(6100, 960, 75), density=0.05)
temp1_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a1, temp1_h5, "a1", chunkdim=c(50, 40, 5), level=5)
temp2_h5 <- tempfile(fileext=".h5")
HDF5Array::writeHDF5Array(a2, temp2_h5, "a2", chunkdim=c(50, 40, 5), level=5)
## [1] 128949150
## [1] 38533754
However, the small size on disk won’t translate into a small size in memory if the data gets loaded back as an ordinary array:
## 1756800224 bytes
To keep memory usage as low as possible, h5mread()
can
load the data in a SparseArray derivative from the SparseArray
package. This is achieved by setting its as.sparse
argument
to TRUE
:
## 351486208 bytes
Note that the data is loaded as a COO_SparseArray object:
## [1] "COO_SparseArray"
## attr(,"package")
## [1] "SparseArray"
See ?h5mread
for more information and additional
examples.
Sanity checks:
The COO_SparseArray class is one of the two SparseArray concrete subclasses defined in the SparseArray package, the other one being SVT_SparseArray. Note that the latter tends to be even more memory-efficient than the former and to achieve better performance overall.
Use coercion to switch back and forth between the two representations:
## [1] "SVT_SparseArray"
## attr(,"package")
## [1] "SparseArray"
About half the memory footprint of the COO_SparseArray representation:
## 188106768 bytes
See ?SparseArray
in the SparseArray
package for more information.
Two convenience functions to obtain the dimensions of an HDF5 dataset as well as the dimensions of its chunks.
See ?h5dim
for more information and some examples.
A small set of low-level utilities is provided to manipulate the dimnames of an HDF5 dataset.
See ?h5writeDimnames
for more information and some
examples.
Use an H5File object to access an HDF5 file stored in an Amazon S3 bucket.
See ?H5File
for more information and some examples.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] h5mread_0.99.4 SparseArray_1.7.6 S4Arrays_1.7.3
## [4] IRanges_2.41.3 abind_1.4-8 S4Vectors_0.45.4
## [7] MatrixGenerics_1.19.1 matrixStats_1.5.0 Matrix_1.7-2
## [10] BiocGenerics_0.53.6 generics_0.1.3 rhdf5_2.51.2
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.9.0 compiler_4.4.2 BiocManager_1.30.25
## [4] crayon_1.5.3 rhdf5filters_1.19.1 jquerylib_0.1.4
## [7] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [10] R6_2.6.1 XVector_0.47.2 knitr_1.49
## [13] DelayedArray_0.33.6 maketools_1.3.2 bslib_0.9.0
## [16] rlang_1.1.5 cachem_1.1.0 HDF5Array_1.35.15
## [19] xfun_0.51 sass_0.4.9 sys_3.4.3
## [22] cli_3.6.4 Rhdf5lib_1.29.0 digest_0.6.37
## [25] grid_4.4.2 lifecycle_1.0.4 evaluate_1.0.3
## [28] buildtools_1.0.0 rmarkdown_2.29 tools_4.4.2
## [31] htmltools_0.5.8.1