Title: | A fast HDF5 reader |
---|---|
Description: | The main function in the h5mread package is h5mread(), which allows reading arbitrary data from an HDF5 dataset into R, similarly to what the h5read() function from the rhdf5 package does. In the case of h5mread(), the implementation has been optimized to make it as fast and memory-efficient as possible. |
Authors: | Hervé Pagès [aut, cre] |
Maintainer: | Hervé Pagès <[email protected]> |
License: | Artistic-2.0 |
Version: | 0.99.4 |
Built: | 2025-02-21 03:17:56 UTC |
Source: | https://github.com/bioc/h5mread |
Two convenience functions to obtain the dimensions of an HDF5 dataset as well as the dimensions of its chunks.
h5dim(filepath, name, as.integer=TRUE) h5chunkdim(filepath, name, adjust=FALSE)
h5dim(filepath, name, as.integer=TRUE) h5chunkdim(filepath, name, adjust=FALSE)
filepath |
The path (as a single string) to an HDF5 file. |
name |
The name (as a single string) of a dataset in the HDF5 file. |
as.integer |
By default Use |
adjust |
By default You can use |
An integer (or numeric) vector of length the number of dimensions of the HDF5 dataset.
test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5ls(test_h5) h5dim(test_h5, "m2") h5chunkdim(test_h5, "m2") h5dim(test_h5, "a3") h5chunkdim(test_h5, "a3")
test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5ls(test_h5) h5dim(test_h5, "m2") h5chunkdim(test_h5, "m2") h5dim(test_h5, "a3") h5chunkdim(test_h5, "a3")
The H5File class provides a formal representation of an HDF5 file (local or remote).
## Constructor function: H5File(filepath, s3=FALSE, s3credentials=NULL, .no_rhdf5_h5id=FALSE)
## Constructor function: H5File(filepath, s3=FALSE, s3credentials=NULL, .no_rhdf5_h5id=FALSE)
filepath |
A single string specifying the path or URL to an HDF5 file. |
s3 |
|
s3credentials |
A list of length 3, providing the credentials for accessing
files stored in a private Amazon S3 bucket.
See |
.no_rhdf5_h5id |
For internal use only. Don't use. |
IMPORTANT NOTE ABOUT H5File OBJECTS AND PARALLEL EVALUATION
The short story is that H5File objects cannot be used in the context of parallel evaluation at the moment.
Here is why:
H5File objects contain an identifier to an open connection to the HDF5 file. This identifier becomes invalid in the 2 following situations:
After serialization/deserialization, that is, after loading
a serialized H5File object with readRDS()
or
load()
.
In the context of parallel evaluation, when using the
SnowParam parallelization backend.
This is because, unlike the MulticoreParam
backend which used a system fork
, the
SnowParam backend uses
serialization/deserialization to transmit the object
to the workers.
In both cases, the connection to the file is lost and any attempt to
read data from the H5File object will fail. Note that the above also
happens to any H5File object that got serialized indirectly i.e. as
part of a bigger object. For example, if an HDF5Array
object was constructed from an H5File object, then it contains the H5File
object and therefore
blockApply(..., BPPARAM=SnowParam(4))
cannot be used on it.
Furthermore, even if sometimes an H5File object seems to work fine with the MulticoreParam parallelization backend, this is highly unreliable and must be avoided.
An H5File object.
H5Pset_fapl_ros3 in the rhdf5 package for
detailed information about how to pass your S3 credentials
to the s3credentials
argument.
The HDF5Array class defined in the HDF5Array package for representing and operating on a conventional (a.k.a. dense) HDF5 dataset.
The H5SparseMatrix class defined in the HDF5Array package for representing and operating on an HDF5 sparse matrix.
The H5ADMatrix class defined in the HDF5Array
package for representing and operating on the central matrix of an
h5ad
file, or any matrix in its /layers
group.
The TENxMatrix class defined in the HDF5Array package for representing and operating on a 10x Genomics dataset.
The h5mread
function in this package (h5mread)
that is used internally by HDF5Array,
TENxMatrix, and H5ADMatrix
objects, for (almost) all their data reading needs.
h5ls
to list the content of an HDF5 file.
bplapply
,
MulticoreParam
, and
SnowParam
, in the
BiocParallel package.
## --------------------------------------------------------------------- ## A. BASIC USAGE ## --------------------------------------------------------------------- ## With a local file: test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5file1 <- H5File(test_h5) h5ls(h5file1) path(h5file1) h5mread(h5file1, "m2", list(1:10, 1:6)) get_h5mread_returned_type(h5file1, "m2") ## With a file stored in an Amazon S3 bucket: if (Sys.info()[["sysname"]] != "Darwin") { public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5" h5file2 <- H5File(public_S3_url, s3=TRUE) h5ls(h5file2) h5mread(h5file2, "a1") get_h5mread_returned_type(h5file2, "a1") } ## --------------------------------------------------------------------- ## B. H5File OBJECTS AND PARALLEL EVALUATION ## --------------------------------------------------------------------- ## H5File objects cannot be used in the context of parallel evaluation ## at the moment! library(BiocParallel) FUN1 <- function(i, h5file, name) sum(h5mread::h5mread(h5file, name, list(i, NULL))) FUN2 <- function(i, h5file, name) sum(h5mread::h5mread(h5file, name, list(i, NULL, NULL))) ## With the SnowParam parallelization backend, the H5File object ## does NOT work on the workers: ## Not run: ## ERROR! res1 <- bplapply(1:150, FUN1, h5file1, "m2", BPPARAM=SnowParam(3)) ## ERROR! res2 <- bplapply(1:5, FUN2, h5file2, "a1", BPPARAM=SnowParam(3)) ## End(Not run) ## With the MulticoreParam parallelization backend, the H5File object ## might seem to work on the workers. However this is highly unreliable ## and must be avoided: ## Not run: if (.Platform$OS.type != "windows") { ## UNRELIABLE! res1 <- bplapply(1:150, FUN1, h5file1, "m2", BPPARAM=MulticoreParam(3)) ## UNRELIABLE! res2 <- bplapply(1:5, FUN2, h5file2, "a1", BPPARAM=MulticoreParam(3)) } ## End(Not run)
## --------------------------------------------------------------------- ## A. BASIC USAGE ## --------------------------------------------------------------------- ## With a local file: test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5file1 <- H5File(test_h5) h5ls(h5file1) path(h5file1) h5mread(h5file1, "m2", list(1:10, 1:6)) get_h5mread_returned_type(h5file1, "m2") ## With a file stored in an Amazon S3 bucket: if (Sys.info()[["sysname"]] != "Darwin") { public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5" h5file2 <- H5File(public_S3_url, s3=TRUE) h5ls(h5file2) h5mread(h5file2, "a1") get_h5mread_returned_type(h5file2, "a1") } ## --------------------------------------------------------------------- ## B. H5File OBJECTS AND PARALLEL EVALUATION ## --------------------------------------------------------------------- ## H5File objects cannot be used in the context of parallel evaluation ## at the moment! library(BiocParallel) FUN1 <- function(i, h5file, name) sum(h5mread::h5mread(h5file, name, list(i, NULL))) FUN2 <- function(i, h5file, name) sum(h5mread::h5mread(h5file, name, list(i, NULL, NULL))) ## With the SnowParam parallelization backend, the H5File object ## does NOT work on the workers: ## Not run: ## ERROR! res1 <- bplapply(1:150, FUN1, h5file1, "m2", BPPARAM=SnowParam(3)) ## ERROR! res2 <- bplapply(1:5, FUN2, h5file2, "a1", BPPARAM=SnowParam(3)) ## End(Not run) ## With the MulticoreParam parallelization backend, the H5File object ## might seem to work on the workers. However this is highly unreliable ## and must be avoided: ## Not run: if (.Platform$OS.type != "windows") { ## UNRELIABLE! res1 <- bplapply(1:150, FUN1, h5file1, "m2", BPPARAM=MulticoreParam(3)) ## UNRELIABLE! res2 <- bplapply(1:5, FUN2, h5file2, "a1", BPPARAM=MulticoreParam(3)) } ## End(Not run)
Like rhdf5::h5ls()
, but works on an
H5File object.
h5ls(file, recursive=TRUE, all=FALSE, datasetinfo=TRUE, index_type=h5default("H5_INDEX"), order=h5default("H5_ITER"), s3=FALSE, s3credentials=NULL, native=FALSE)
h5ls(file, recursive=TRUE, all=FALSE, datasetinfo=TRUE, index_type=h5default("H5_INDEX"), order=h5default("H5_ITER"), s3=FALSE, s3credentials=NULL, native=FALSE)
file , recursive , all , datasetinfo , index_type , order , s3 , s3credentials , native
|
See Note that the only difference with |
See ?rhdf5::h5ls
in the rhdf5 package.
test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5ls(test_h5) h5file <- H5File(test_h5) h5ls(h5file) ## See '?H5File' for more examples.
test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5ls(test_h5) h5file <- H5File(test_h5) h5ls(h5file) ## See '?H5File' for more examples.
rhdf5::h5read
An efficient and flexible alternative to rhdf5::h5read()
.
h5mread(filepath, name, starts=NULL, counts=NULL, noreduce=FALSE, as.vector=NA, as.integer=FALSE, as.sparse=FALSE, method=0L, use.H5Dread_chunk=FALSE) get_h5mread_returned_type(filepath, name, as.integer=FALSE)
h5mread(filepath, name, starts=NULL, counts=NULL, noreduce=FALSE, as.vector=NA, as.integer=FALSE, as.sparse=FALSE, method=0L, use.H5Dread_chunk=FALSE) get_h5mread_returned_type(filepath, name, as.integer=FALSE)
filepath |
The path (as a single string) to the HDF5 file where the dataset to read from is located, or an H5File object. Note that you must create and use an H5File object if the HDF5
file to access is stored in an Amazon S3 bucket. See Also please note that H5File objects must NOT be used in the context of parallel evaluation at the moment. |
name |
The name of the dataset in the HDF5 file. |
starts , counts
|
If If Each list element in If Finally note that when |
as.vector |
Should the data be returned in a vector instead of an array?
By default (i.e. when set to |
as.integer |
If set to Note that, when |
as.sparse |
By default |
noreduce , method , use.H5Dread_chunk
|
For testing and advanced usage only. Do not use. |
h5mread()
returns an ordinary array or vector if as.sparse
is FALSE
(the default), and a COO_SparseArray
object if as.sparse
is TRUE
.
get_h5mread_returned_type()
returns the type of the array or
vector that will be returned by h5mread()
.
Equivalent to (but more efficient than):
typeof(h5mread(filepath, name, rep(list(integer(0)), ndim)))
where ndim
is the number of dimensions (a.k.a. rank in
HDF5 jargon) of the dataset.
H5File objects.
h5read
in the rhdf5 package.
extract_array
in the S4Arrays
package.
COO_SparseArray objects in the SparseArray package.
h5mread_from_reshaped
to read data from a virtually
reshaped HDF5 dataset.
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5ls(test_h5) m1 <- h5mread(test_h5, "m1") # 12 x 5 integer matrix m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), NULL)) m ## Sanity check: stopifnot(identical(m1[c(8, 12:7), ], m)) m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), integer(0))) m ## Sanity check: stopifnot(identical(m1[c(8, 12:7), NULL], m)) m2 <- h5mread(test_h5, "m2") # 4000 x 90 double matrix m2a <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8)) m2a ## Sanity check: stopifnot(identical(m2[31:40, 1:8], m2a)) m2b <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8), as.integer=TRUE) m2b ## Sanity check: storage.mode(m2a) <- "integer" stopifnot(identical(m2a, m2b)) a3 <- h5mread(test_h5, "a3") # 180 x 75 x 4 integer array starts <- list(c(21, 101), NULL, 3:4) counts <- list(c( 5, 22), NULL, NULL) a <- h5mread(test_h5, "a3", starts=starts, counts=counts) dim(a) a[1:10, 1:12, ] ## Sanity check: stopifnot(identical(a3[c(21:25, 101:122), , 3:4, drop=FALSE], a)) ## --------------------------------------------------------------------- ## RETURNING THE DATA AS A SPARSE ARRAY ## --------------------------------------------------------------------- starts <- list(c(21:25, 101:122), NULL, 3:4) coo <- h5mread(test_h5, "a3", starts=starts, as.sparse=TRUE) coo class(coo) # COO_SparseArray object (see ?COO_SparseArray) dim(coo) ## Sanity check: stopifnot(is(coo, "COO_SparseArray"), identical(a, as.array(coo))) ## --------------------------------------------------------------------- ## PERFORMANCE ## --------------------------------------------------------------------- library(ExperimentHub) hub <- ExperimentHub() ## With the "sparse" TENxBrainData dataset ## --------------------------------------- fname0 <- hub[["EH1039"]] h5ls(fname0) # all datasets are 1D datasets index <- list(77 * sample(34088679, 5000, replace=TRUE)) ## h5mread() is about 4x faster than h5read(): system.time(a <- h5mread::h5mread(fname0, "mm10/data", index)) system.time(b <- h5read(fname0, "mm10/data", index=index)) stopifnot(identical(a, as.vector(b))) index <- list(sample(1306127, 7500, replace=TRUE)) ## h5mread() is about 14x faster than h5read(): system.time(a <- h5mread::h5mread(fname0, "mm10/barcodes", index)) system.time(b <- h5read(fname0, "mm10/barcodes", index=index)) stopifnot(identical(a, as.vector(b))) ## With the "dense" TENxBrainData dataset ## -------------------------------------- fname1 <- hub[["EH1040"]] h5ls(fname1) # "counts" is a 2D dataset set.seed(33) index <- list(sample(27998, 300), sample(1306127, 450)) ## h5mread() is about 2x faster than h5read(): system.time(a <- h5mread::h5mread(fname1, "counts", index)) system.time(b <- h5read(fname1, "counts", index=index)) stopifnot(identical(a, b)) ## Alternatively 'as.sparse=TRUE' can be used to reduce memory usage: system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE)) stopifnot(identical(a, as.array(coo))) ## The bigger the selection, the greater the speedup between ## h5read() and h5mread(): index <- list(sample(27998, 1000), sample(1306127, 1000)) ## h5mread() about 4x faster than h5read() (12s vs 48s): system.time(a <- h5mread::h5mread(fname1, "counts", index)) system.time(b <- h5read(fname1, "counts", index=index)) stopifnot(identical(a, b)) ## With 'as.sparse=TRUE' (about the same speed as with 'as.sparse=FALSE'): system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE)) stopifnot(identical(a, as.array(coo)))
## --------------------------------------------------------------------- ## BASIC EXAMPLES ## --------------------------------------------------------------------- test_h5 <- system.file("extdata", "test.h5", package="h5mread") h5ls(test_h5) m1 <- h5mread(test_h5, "m1") # 12 x 5 integer matrix m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), NULL)) m ## Sanity check: stopifnot(identical(m1[c(8, 12:7), ], m)) m <- h5mread(test_h5, "m1", starts=list(c(8, 12:7), integer(0))) m ## Sanity check: stopifnot(identical(m1[c(8, 12:7), NULL], m)) m2 <- h5mread(test_h5, "m2") # 4000 x 90 double matrix m2a <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8)) m2a ## Sanity check: stopifnot(identical(m2[31:40, 1:8], m2a)) m2b <- h5mread(test_h5, "m2", starts=list(31, 1), counts=list(10, 8), as.integer=TRUE) m2b ## Sanity check: storage.mode(m2a) <- "integer" stopifnot(identical(m2a, m2b)) a3 <- h5mread(test_h5, "a3") # 180 x 75 x 4 integer array starts <- list(c(21, 101), NULL, 3:4) counts <- list(c( 5, 22), NULL, NULL) a <- h5mread(test_h5, "a3", starts=starts, counts=counts) dim(a) a[1:10, 1:12, ] ## Sanity check: stopifnot(identical(a3[c(21:25, 101:122), , 3:4, drop=FALSE], a)) ## --------------------------------------------------------------------- ## RETURNING THE DATA AS A SPARSE ARRAY ## --------------------------------------------------------------------- starts <- list(c(21:25, 101:122), NULL, 3:4) coo <- h5mread(test_h5, "a3", starts=starts, as.sparse=TRUE) coo class(coo) # COO_SparseArray object (see ?COO_SparseArray) dim(coo) ## Sanity check: stopifnot(is(coo, "COO_SparseArray"), identical(a, as.array(coo))) ## --------------------------------------------------------------------- ## PERFORMANCE ## --------------------------------------------------------------------- library(ExperimentHub) hub <- ExperimentHub() ## With the "sparse" TENxBrainData dataset ## --------------------------------------- fname0 <- hub[["EH1039"]] h5ls(fname0) # all datasets are 1D datasets index <- list(77 * sample(34088679, 5000, replace=TRUE)) ## h5mread() is about 4x faster than h5read(): system.time(a <- h5mread::h5mread(fname0, "mm10/data", index)) system.time(b <- h5read(fname0, "mm10/data", index=index)) stopifnot(identical(a, as.vector(b))) index <- list(sample(1306127, 7500, replace=TRUE)) ## h5mread() is about 14x faster than h5read(): system.time(a <- h5mread::h5mread(fname0, "mm10/barcodes", index)) system.time(b <- h5read(fname0, "mm10/barcodes", index=index)) stopifnot(identical(a, as.vector(b))) ## With the "dense" TENxBrainData dataset ## -------------------------------------- fname1 <- hub[["EH1040"]] h5ls(fname1) # "counts" is a 2D dataset set.seed(33) index <- list(sample(27998, 300), sample(1306127, 450)) ## h5mread() is about 2x faster than h5read(): system.time(a <- h5mread::h5mread(fname1, "counts", index)) system.time(b <- h5read(fname1, "counts", index=index)) stopifnot(identical(a, b)) ## Alternatively 'as.sparse=TRUE' can be used to reduce memory usage: system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE)) stopifnot(identical(a, as.array(coo))) ## The bigger the selection, the greater the speedup between ## h5read() and h5mread(): index <- list(sample(27998, 1000), sample(1306127, 1000)) ## h5mread() about 4x faster than h5read() (12s vs 48s): system.time(a <- h5mread::h5mread(fname1, "counts", index)) system.time(b <- h5read(fname1, "counts", index=index)) stopifnot(identical(a, b)) ## With 'as.sparse=TRUE' (about the same speed as with 'as.sparse=FALSE'): system.time(coo <- h5mread::h5mread(fname1, "counts", index, as.sparse=TRUE)) stopifnot(identical(a, as.array(coo)))
An h5mread
wrapper that reads data from a virtually
reshaped HDF5 dataset.
h5mread_from_reshaped(filepath, name, dim, starts, noreduce=FALSE, as.integer=FALSE, method=0L)
h5mread_from_reshaped(filepath, name, dim, starts, noreduce=FALSE, as.integer=FALSE, method=0L)
filepath |
The path (as a single string) to the HDF5 file where the dataset to read from is located, or an H5File object. Note that you must create and use an H5File object if the HDF5
file to access is stored in an Amazon S3 bucket. See Also please note that H5File objects must NOT be used in the context of parallel evaluation at the moment. |
name |
The name of the dataset in the HDF5 file. |
dim |
A vector of dimensions that describes the virtual reshaping i.e. the reshaping that is virtually applied upfront to the HDF5 dataset to read from. Note that the HDF5 dataset is treated as read-only so never gets effectively reshaped, that is, the dataset dimensions encoded in the HDF5 file are not mmodified. Also please note that arbitrary reshapings are not supported. Only reshapings that reduce the number of dimensions by collapsing a group of consecutive dimensions into a single dimension are supported. For example, reshaping a 10 x 3 x 5 x 1000 array as a 10 x 15 x 1000 array or as a 150 x 1000 matrix is supported. |
starts |
A multidimensional subsetting index with respect to the reshaped dataset, that is, a list with one list element per dimension in the reshaped dataset. Each list element in |
noreduce , as.integer , method
|
See |
An array.
temp_h5 <- tempfile(fileext=".h5") ## --------------------------------------------------------------------- ## BASIC USAGE ## --------------------------------------------------------------------- a1 <- array(1:350, c(10, 5, 7)) h5write(a1, temp_h5, "A1") ## Collapse the first 2 dimensions: h5mread_from_reshaped(temp_h5, "A1", dim=c(50, 7), starts=list(8:11, NULL)) h5mread_from_reshaped(temp_h5, "A1", dim=c(50, 7), starts=list(8:11, NULL)) ## Collapse the last 2 dimensions: h5mread_from_reshaped(temp_h5, "A1", dim=c(10, 35), starts=list(NULL, 3:11)) a2 <- array(1:150000 + 0.1*runif(150000), c(10, 3, 5, 1000)) h5write(a2, temp_h5, name="A2") ## Collapse the 2nd and 3rd dimensions: h5mread_from_reshaped(temp_h5, "A2", dim=c(10, 15, 1000), starts=list(NULL, 8:11, 999:1000)) ## Collapse the first 3 dimensions: h5mread_from_reshaped(temp_h5, "A2", dim=c(150, 1000), starts=list(71:110, 999:1000))
temp_h5 <- tempfile(fileext=".h5") ## --------------------------------------------------------------------- ## BASIC USAGE ## --------------------------------------------------------------------- a1 <- array(1:350, c(10, 5, 7)) h5write(a1, temp_h5, "A1") ## Collapse the first 2 dimensions: h5mread_from_reshaped(temp_h5, "A1", dim=c(50, 7), starts=list(8:11, NULL)) h5mread_from_reshaped(temp_h5, "A1", dim=c(50, 7), starts=list(8:11, NULL)) ## Collapse the last 2 dimensions: h5mread_from_reshaped(temp_h5, "A1", dim=c(10, 35), starts=list(NULL, 3:11)) a2 <- array(1:150000 + 0.1*runif(150000), c(10, 3, 5, 1000)) h5write(a2, temp_h5, name="A2") ## Collapse the 2nd and 3rd dimensions: h5mread_from_reshaped(temp_h5, "A2", dim=c(10, 15, 1000), starts=list(NULL, 8:11, 999:1000)) ## Collapse the first 3 dimensions: h5mread_from_reshaped(temp_h5, "A2", dim=c(150, 1000), starts=list(71:110, 999:1000))
h5writeDimnames
and h5readDimnames
can be used to
write/read the dimnames of an HDF5 dataset to/from the HDF5 file.
Note that h5writeDimnames
is used internally by
HDF5Array::writeHDF5Array(x, ..., with.dimnames=TRUE)
to write the dimnames of x
to the HDF5 file together
with the array data.
set_h5dimnames
and get_h5dimnames
are low-level
utilities that can be used to attach existing HDF5 datasets
along the dimensions of a given HDF5 dataset, or to retrieve
the names of the HDF5 datasets that are attached along the
dimensions of a given HDF5 dataset.
h5writeDimnames(dimnames, filepath, name, group=NA, h5dimnames=NULL) h5readDimnames(filepath, name, as.character=FALSE) set_h5dimnames(filepath, name, h5dimnames, dry.run=FALSE) get_h5dimnames(filepath, name)
h5writeDimnames(dimnames, filepath, name, group=NA, h5dimnames=NULL) h5readDimnames(filepath, name, as.character=FALSE) set_h5dimnames(filepath, name, h5dimnames, dry.run=FALSE) get_h5dimnames(filepath, name)
dimnames |
The dimnames to write to the HDF5 file. Must be supplied as a list
(possibly named) with one list element per dimension in the HDF5 dataset
specified via the |
filepath |
For For |
name |
For For |
group |
Except when |
h5dimnames |
For If set to For |
as.character |
Even though the dimnames of an HDF5 dataset are usually stored as
datasets of type |
dry.run |
When set to |
h5writeDimnames
and set_h5dimnames
return nothing.
h5readDimnames
returns a list (possibly named) with one list
element per dimension in HDF5 dataset name
and containing its
dimnames retrieved from the file.
get_h5dimnames
returns a character vector containing the names
of the HDF5 datasets that are currently set as the dimnames of the
dataset specified in name
. The vector has one element per
dimension in dataset name
. NAs in the vector indicate dimensions
along which nothing is set.
writeHDF5Array
in the HDF5Array
package for a high-level function to write an array-like object
and its dimnames to an HDF5 file.
h5write
in the rhdf5 package that
h5writeDimnames
uses internally to write the dimnames
to the HDF5 file.
h5mread
in this package (h5mread) that
h5readDimnames
uses internally to read the dimnames
from the HDF5 file.
h5ls
to list the content of an HDF5 file.
## --------------------------------------------------------------------- ## BASIC EXAMPLE ## --------------------------------------------------------------------- library(rhdf5) # for h5write() m0 <- matrix(1:60, ncol=5) colnames(m0) <- LETTERS[1:5] h5file <- tempfile(fileext=".h5") h5write(m0, h5file, "M0") # h5write() ignores the dimnames h5ls(h5file) h5writeDimnames(dimnames(m0), h5file, "M0") h5ls(h5file) get_h5dimnames(h5file, "M0") h5readDimnames(h5file, "M0") ## Reconstruct 'm0' from HDF5 file: m1 <- h5mread(h5file, "M0") dimnames(m1) <- h5readDimnames(h5file, "M0") stopifnot(identical(m0, m1)) ## Create an HDF5Array object that points to HDF5 dataset M0: HDF5Array::HDF5Array(h5file, "M0") ## Sanity checks: stopifnot( identical(dimnames(m0), h5readDimnames(h5file, "M0")), identical(dimnames(m0), dimnames(HDF5Array::HDF5Array(h5file, "M0"))) ) ## --------------------------------------------------------------------- ## SHARED DIMNAMES ## --------------------------------------------------------------------- ## If a collection of HDF5 datasets share the same dimnames, the ## dimnames only need to be written once in the HDF5 file. Then they ## can be attached to the individual datasets with set_h5dimnames(): h5write(array(runif(240), c(12, 5:4)), h5file, "A1") set_h5dimnames(h5file, "A1", get_h5dimnames(h5file, "M0")) get_h5dimnames(h5file, "A1") h5readDimnames(h5file, "A1") HDF5Array::HDF5Array(h5file, "A1") h5write(matrix(sample(letters, 60, replace=TRUE), ncol=5), h5file, "A2") set_h5dimnames(h5file, "A2", get_h5dimnames(h5file, "M0")) get_h5dimnames(h5file, "A2") h5readDimnames(h5file, "A2") HDF5Array::HDF5Array(h5file, "A2") ## Sanity checks: stopifnot(identical(dimnames(m0), h5readDimnames(h5file, "A1")[1:2])) stopifnot(identical(dimnames(m0), h5readDimnames(h5file, "A2"))) ## --------------------------------------------------------------------- ## USE h5writeDimnames() AFTER A CALL TO writeHDF5Array() ## --------------------------------------------------------------------- ## After calling writeHDF5Array(x, ..., with.dimnames=FALSE) the ## dimnames on 'x' can still be written to the HDF5 file by doing the ## following: ## 1. Write 'm0' to the HDF5 file and ignore the dimnames (for now): HDF5Array::writeHDF5Array(m0, h5file, "M2", with.dimnames=FALSE) ## 2. Use h5writeDimnames() to write 'dimnames(m0)' to the file and ## associate them with the "M2" dataset: h5writeDimnames(dimnames(m0), h5file, "M2") ## 3. Use the HDF5Array() constructor to make an HDF5Array object that ## points to the "M2" dataset: HDF5Array::HDF5Array(h5file, "M2") ## Note that at step 2. you can use the extra arguments of ## h5writeDimnames() to take full control of where the dimnames ## should be stored in the file: HDF5Array::writeHDF5Array(m0, h5file, "M3", with.dimnames=FALSE) h5writeDimnames(dimnames(m0), h5file, "M3", group="a_secret_place", h5dimnames=c("NA", "M3_dim2")) h5ls(h5file) ## h5readDimnames() and HDF5Array() still "find" the dimnames: h5readDimnames(h5file, "M3") HDF5Array::HDF5Array(h5file, "M3") ## Sanity checks: stopifnot( identical(dimnames(m0), h5readDimnames(h5file, "M3")), identical(dimnames(m0), dimnames(HDF5Array::HDF5Array(h5file, "M3"))) ) ## --------------------------------------------------------------------- ## STORE THE DIMNAMES AS NON-CHARACTER TYPES ## --------------------------------------------------------------------- HDF5Array::writeHDF5Array(m0, h5file, "M4", with.dimnames=FALSE) dimnames <- list(1001:1012, as.raw(11:15)) h5writeDimnames(dimnames, h5file, "M4") h5ls(h5file) h5readDimnames(h5file, "M4") h5readDimnames(h5file, "M4", as.character=TRUE) ## Sanity checks: stopifnot(identical(dimnames, h5readDimnames(h5file, "M4"))) dimnames(m0) <- dimnames stopifnot(identical( dimnames(m0), h5readDimnames(h5file, "M4", as.character=TRUE) ))
## --------------------------------------------------------------------- ## BASIC EXAMPLE ## --------------------------------------------------------------------- library(rhdf5) # for h5write() m0 <- matrix(1:60, ncol=5) colnames(m0) <- LETTERS[1:5] h5file <- tempfile(fileext=".h5") h5write(m0, h5file, "M0") # h5write() ignores the dimnames h5ls(h5file) h5writeDimnames(dimnames(m0), h5file, "M0") h5ls(h5file) get_h5dimnames(h5file, "M0") h5readDimnames(h5file, "M0") ## Reconstruct 'm0' from HDF5 file: m1 <- h5mread(h5file, "M0") dimnames(m1) <- h5readDimnames(h5file, "M0") stopifnot(identical(m0, m1)) ## Create an HDF5Array object that points to HDF5 dataset M0: HDF5Array::HDF5Array(h5file, "M0") ## Sanity checks: stopifnot( identical(dimnames(m0), h5readDimnames(h5file, "M0")), identical(dimnames(m0), dimnames(HDF5Array::HDF5Array(h5file, "M0"))) ) ## --------------------------------------------------------------------- ## SHARED DIMNAMES ## --------------------------------------------------------------------- ## If a collection of HDF5 datasets share the same dimnames, the ## dimnames only need to be written once in the HDF5 file. Then they ## can be attached to the individual datasets with set_h5dimnames(): h5write(array(runif(240), c(12, 5:4)), h5file, "A1") set_h5dimnames(h5file, "A1", get_h5dimnames(h5file, "M0")) get_h5dimnames(h5file, "A1") h5readDimnames(h5file, "A1") HDF5Array::HDF5Array(h5file, "A1") h5write(matrix(sample(letters, 60, replace=TRUE), ncol=5), h5file, "A2") set_h5dimnames(h5file, "A2", get_h5dimnames(h5file, "M0")) get_h5dimnames(h5file, "A2") h5readDimnames(h5file, "A2") HDF5Array::HDF5Array(h5file, "A2") ## Sanity checks: stopifnot(identical(dimnames(m0), h5readDimnames(h5file, "A1")[1:2])) stopifnot(identical(dimnames(m0), h5readDimnames(h5file, "A2"))) ## --------------------------------------------------------------------- ## USE h5writeDimnames() AFTER A CALL TO writeHDF5Array() ## --------------------------------------------------------------------- ## After calling writeHDF5Array(x, ..., with.dimnames=FALSE) the ## dimnames on 'x' can still be written to the HDF5 file by doing the ## following: ## 1. Write 'm0' to the HDF5 file and ignore the dimnames (for now): HDF5Array::writeHDF5Array(m0, h5file, "M2", with.dimnames=FALSE) ## 2. Use h5writeDimnames() to write 'dimnames(m0)' to the file and ## associate them with the "M2" dataset: h5writeDimnames(dimnames(m0), h5file, "M2") ## 3. Use the HDF5Array() constructor to make an HDF5Array object that ## points to the "M2" dataset: HDF5Array::HDF5Array(h5file, "M2") ## Note that at step 2. you can use the extra arguments of ## h5writeDimnames() to take full control of where the dimnames ## should be stored in the file: HDF5Array::writeHDF5Array(m0, h5file, "M3", with.dimnames=FALSE) h5writeDimnames(dimnames(m0), h5file, "M3", group="a_secret_place", h5dimnames=c("NA", "M3_dim2")) h5ls(h5file) ## h5readDimnames() and HDF5Array() still "find" the dimnames: h5readDimnames(h5file, "M3") HDF5Array::HDF5Array(h5file, "M3") ## Sanity checks: stopifnot( identical(dimnames(m0), h5readDimnames(h5file, "M3")), identical(dimnames(m0), dimnames(HDF5Array::HDF5Array(h5file, "M3"))) ) ## --------------------------------------------------------------------- ## STORE THE DIMNAMES AS NON-CHARACTER TYPES ## --------------------------------------------------------------------- HDF5Array::writeHDF5Array(m0, h5file, "M4", with.dimnames=FALSE) dimnames <- list(1001:1012, as.raw(11:15)) h5writeDimnames(dimnames, h5file, "M4") h5ls(h5file) h5readDimnames(h5file, "M4") h5readDimnames(h5file, "M4", as.character=TRUE) ## Sanity checks: stopifnot(identical(dimnames, h5readDimnames(h5file, "M4"))) dimnames(m0) <- dimnames stopifnot(identical( dimnames(m0), h5readDimnames(h5file, "M4", as.character=TRUE) ))