Reading HDF5 Files In The Cloud

The rhdf5 provides limited support for read-only access to HDF5 files stored in Amazon S3 buckets. This is implemented via the HDF5 S3 Virtual File Driver and allows access to HDF5 files hosted in both public and private S3 buckets.

Currently only the functions h5ls(), h5dump() and h5read() are supported.

library(rhdf5)

Public S3 Buckets

To access a file in a public Amazon S3 bucket you provide the file’s URL to the file argument. You also need to set the argument s3 = TRUE, otherwise h5ls() will treat the URL as a path on the local disk fail.

public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/h5ex_t_array.h5"
h5ls(file = public_S3_url,
     s3 = TRUE)
##   group name       otype dclass dim
## 0     /  DS1 H5I_DATASET  ARRAY   4

The same arguments are also valid for using h5dump() to retrieve the contents of a file.

public_S3_url <- "https://rhdf5-public.s3.eu-central-1.amazonaws.com/h5ex_t_cmpd.h5"
h5dump(file = public_S3_url,
     s3 = TRUE)
## $DS1
##   Serial number          Location Temperature (F) Pressure (inHg)
## 1          1153 Exterior (static)           53.23           24.57
## 2          1184            Intake           55.12           22.95
## 3          1027   Intake manifold          103.55           31.23
## 4          1313  Exhaust manifold         1252.89           84.11

In addition to examining and reading whole files, we can also extract just a subset, without needed to read or download the entire file. In the example below we use h5ls() to examine a file in an S3 bucket and identify the name of a dataset within it (a1) and the number of dimensions for that dataset (3). We can then use h5read() along with the name and index arguments to read only a subset of the dataset into our R session.

public_S3_url <- 'https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5'
h5ls(file = public_S3_url, s3 = TRUE)
##   group name       otype dclass        dim
## 0     /   a1 H5I_DATASET  FLOAT 5 x 10 x 2
h5read(public_S3_url, 
       name = "a1", 
       index = list(1:2, 3, NULL),
       s3 = TRUE)
## , , 1
## 
##           [,1]
## [1,] 0.2444485
## [2,] 0.3873723
## 
## , , 2
## 
##           [,1]
## [1,] 0.7906603
## [2,] 0.3274960

Private S3 Buckets

To access files in a private Amazon S3 bucket you will need to provide three additional details: The AWS region where the files are hosted, your AWS access key ID, and your AWS secret access key. More information on how to obtain AWS access keys can be found under AWS Security Credentials.

These three values need to be stored in a list like below. Important note: for now they must be in this specific order.

## these are example credentials and will not work
s3_cred <- list(
    aws_region = "eu-central-1",
    access_key_id = "AKIAIOSFODNN7EXAMPLE",
    secret_access_key = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
)

Finally we pass this list to h5ls() via the s3credentials argument.

public_S3_url <- "https://rhdf5-private.s3.eu-central-1.amazonaws.com/h5ex_t_array.h5"
h5ls(file = public_S3_url,
     s3 = TRUE,
     s3credentials = s3_cred)

The s3credentials arguments is used in exactly the same way for h5dump() and h5read().

Session Info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.41.0 ggplot2_3.5.1       dplyr_1.1.4        
## [4] rhdf5_2.51.0        BiocStyle_2.35.0   
## 
## loaded via a namespace (and not attached):
##  [1] bit_4.5.0           gtable_0.3.6        jsonlite_1.8.9     
##  [4] highr_0.11          compiler_4.4.1      BiocManager_1.30.25
##  [7] tidyselect_1.2.1    rhdf5filters_1.18.0 parallel_4.4.1     
## [10] jquerylib_0.1.4     scales_1.3.0        yaml_2.3.10        
## [13] fastmap_1.2.0       R6_2.5.1            labeling_0.4.3     
## [16] generics_0.1.3      knitr_1.48          tibble_3.2.1       
## [19] maketools_1.3.1     munsell_0.5.1       bslib_0.8.0        
## [22] pillar_1.9.0        rlang_1.1.4         utf8_1.2.4         
## [25] cachem_1.1.0        xfun_0.48           sass_0.4.9         
## [28] sys_3.4.3           bit64_4.5.2         cli_3.6.3          
## [31] withr_3.0.2         magrittr_2.0.3      Rhdf5lib_1.28.0    
## [34] digest_0.6.37       grid_4.4.1          lifecycle_1.0.4    
## [37] vctrs_0.6.5         bench_1.1.3         evaluate_1.0.1     
## [40] glue_1.8.0          farver_2.1.2        codetools_0.2-20   
## [43] buildtools_1.0.0    colorspace_2.1-1    fansi_1.0.6        
## [46] rmarkdown_2.28      tools_4.4.1         pkgconfig_2.0.3    
## [49] htmltools_0.5.8.1