Package 'Rarr' reference manual

Title:	Read Zarr Files in R
Description:	The Zarr specification defines a format for chunked, compressed, N-dimensional arrays. It's design allows efficient access to subsets of the stored array, and supports both local and cloud storage systems. Rarr aims to implement this specifcation in R with minimal reliance on an external tools or libraries.
Authors:	Mike Smith [aut, cre]
Maintainer:	Mike Smith <[email protected]>
License:	MIT + file LICENSE
Version:	1.5.0
Built:	2024-06-30 05:27:42 UTC
Source:	https://github.com/bioc/Rarr

Compress and write a single chunk

Description

Compress and write a single chunk

Usage

.compress_and_write_chunk(
  input_chunk,
  chunk_path,
  compressor = use_zlib(),
  data_type_size,
  is_base64 = FALSE
)
.compress_and_write_chunk(
  input_chunk,
  chunk_path,
  compressor = use_zlib(),
  data_type_size,
  is_base64 = FALSE
)

Arguments

`input_chunk`	Array containing the chunk data to be compressed. Will be converted to a raw vector before compression.
`chunk_path`	Character string giving the path to the chunk that should be written.
`compressor`	A "compressor" function that returns a list giving the details of the compression tool to apply. See compressors for more details.
`data_type_size`	An integer giving the size of the original datatype. This is passed to the blosc algorithm, which seems to need it to achieve any compression.
`is_base64`	When dealing with Py_unicode strings we convert them to base64 strings for storage in our intermediate R arrays. This argument indicates if base64 is in use, because the conversion to raw in .as_raw should be done differently for base64 strings vs other types.

Value

Returns TRUE if writing is successful. Mostly called for the side-effect of writing the compressed chunk to disk.

Create a string of the form `x[idx[[1]], idx[[2]]] <- y` for an array `x` where the number of dimensions is variable.

Description

Create a string of the form x[idx[[1]], idx[[2]]] <- y for an array x where the number of dimensions is variable.

Usage

.create_replace_call(x_name, idx_name, idx_length, y_name)
.create_replace_call(x_name, idx_name, idx_length, y_name)

Arguments

`x_name`	Name of the object to have items replaced
`idx_name`	Name of the list containing the indices
`idx_length`	Length of the list specified in `idx_name`
`y_name`	Name of the object containing the replacement items

Value

A character vector of length one containing the replacement commands. This is expected to be passed to parse() |> eval().

Decompress a chunk in memory

Description

R has internal decompression tools for zlib, bz2 and lzma compression. We use external libraries bundled with the package for blosc and lz4 decompression.

Usage

.decompress_chunk(compressed_chunk, metadata)
.decompress_chunk(compressed_chunk, metadata)

Arguments

`compressed_chunk`	Raw vector holding the compressed bytes for this chunk.
`metadata`	List produced by `read_array_metadata()` with the contents of the `.zarray` file.

Value

An array with the number of dimensions specified in the Zarr metadata. In most cases it will have the same size as the Zarr chunk, however in the case of edge chunks, which overlap the extent of the array, the returned chunk will be smaller.

Format the decompressed chunk as an array of the correct type

Description

When a chunk is decompressed it is returned as a vector of raw bytes. This function uses the array metadata to select how to convert the bytes into the final datatype and then converts the resulting output into an array of the appropriate dimensions, including re-ordering if the original data is in row-major order.

Usage

.format_chunk(decompressed_chunk, metadata, alt_chunk_dim)
.format_chunk(decompressed_chunk, metadata, alt_chunk_dim)

Arguments

`decompressed_chunk`	Raw vector holding the decompressed bytes for this chunk.
`metadata`	List produced by `read_array_metadata()` holding the contents of the `.zarray` file.
`alt_chunk_dim`	The dimensions of the array that should be created from this chunk. Normally this will be the same as the chunk shape in `metadata`, but when dealing with edge chunks, which may overlap the true extent of the array, the returned array should be smaller than the chunk shape.

Value

A list of length 2. The first element is the formatted chunk data. The second is an integer of length 1, indicating if warnings were encountered when converting types

If "chunk_data" is larger than the space remaining in destination array i.e. it contains the overflowing elements, these will be trimmed when the chunk is returned to read_data()

This is a modified version of paws.storge:::get_credentials(). It is included to prevent using the `:::` operator. Look at that function if things stop working.

Description

This is a modified version of paws.storge:::get_credentials(). It is included to prevent using the ::: operator. Look at that function if things stop working.

Usage

.get_credentials(credentials)
.get_credentials(credentials)

Arguments

credentials

Content stored at .internal$config$credentials in an object created by paws.storage::s3().

Value

A credentials list to be reinserted into a paws.storage s3 object. If no valid credentials are found this function will error, which is expected and is caught by .check_credentials.

Normalize a Zarr array path

Description

Taken from https://zarr.readthedocs.io/en/stable/spec/v2.html#logical-storage-paths

Usage

.normalize_array_path(path)
.normalize_array_path(path)

Arguments

path

Character vector of length 1 giving the path to be normalised.

Value

A character vector of length 1 containing the normalised path.

Parse the data type encoding string

Description

Parse the data type encoding string

Usage

.parse_datatype(typestr)
.parse_datatype(typestr)

Arguments

typestr

The datatype encoding string. This is in the Numpy array typestr format.

Value

A list of length 4 containing the details of the data type.

Define compression tool and settings

Description

These functions select a compression tool and its setting when writing a Zarr file

Usage

use_blosc(cname = "lz4")

use_zlib(level = 6L)

use_gzip(level = 6L)

use_bz2(level = 6L)

use_lzma(level = 9L)

use_lz4()
use_blosc(cname = "lz4")

use_zlib(level = 6L)

use_gzip(level = 6L)

use_bz2(level = 6L)

use_lzma(level = 9L)

use_lz4()

Arguments

`cname`	Blosc is a 'meta-compressor' providing access to several compression algorithms. This argument defines which compression tool should be used. Valid options are: 'lz4', 'lz4hc', 'blosclz', 'zstd', 'zlib', 'snappy'.
`level`	Specify the compression level to use.

Value

A list containing the details of the selected compression tool. This will be written to the .zarray metadata when the Zarr array is created.

Examples


## define 2 compression filters for blosc (using snappy) and bzip2 (level 5)
blosc_with_snappy_compression <- use_blosc(cname = "snappy")
bzip2_compression <- use_bz2(level = 5)

## create an example array to write to a file
x <- array(runif(n = 1000, min = -10, max = 10), dim = c(10, 20, 5))

## write the array to two files using each compression filter
blosc_path <- tempfile()
bzip2_path <- tempfile()
write_zarr_array(
  x = x, zarr_array_path = blosc_path, chunk_dim = c(2, 5, 1), 
  compressor = blosc_with_snappy_compression
)
write_zarr_array(
  x = x, zarr_array_path = bzip2_path, chunk_dim = c(2, 5, 1), 
  compressor = bzip2_compression
)

## the contents of the two arrays should be the same
identical(read_zarr_array(blosc_path), read_zarr_array(bzip2_path))

## the size of the files on disk are not the same
sum(file.size(list.files(blosc_path, full.names = TRUE)))
sum(file.size(list.files(bzip2_path, full.names = TRUE)))

## define 2 compression filters for blosc (using snappy) and bzip2 (level 5)
blosc_with_snappy_compression <- use_blosc(cname = "snappy")
bzip2_compression <- use_bz2(level = 5)

## create an example array to write to a file
x <- array(runif(n = 1000, min = -10, max = 10), dim = c(10, 20, 5))

## write the array to two files using each compression filter
blosc_path <- tempfile()
bzip2_path <- tempfile()
write_zarr_array(
  x = x, zarr_array_path = blosc_path, chunk_dim = c(2, 5, 1), 
  compressor = blosc_with_snappy_compression
)
write_zarr_array(
  x = x, zarr_array_path = bzip2_path, chunk_dim = c(2, 5, 1), 
  compressor = bzip2_compression
)

## the contents of the two arrays should be the same
identical(read_zarr_array(blosc_path), read_zarr_array(bzip2_path))

## the size of the files on disk are not the same
sum(file.size(list.files(blosc_path, full.names = TRUE)))
sum(file.size(list.files(bzip2_path, full.names = TRUE)))

Create an (empty) Zarr array

Description

Create an (empty) Zarr array

Usage

create_empty_zarr_array(
  zarr_array_path,
  dim,
  chunk_dim,
  data_type,
  order = "F",
  compressor = use_zlib(),
  fill_value,
  nchar,
  dimension_separator = "."
)
create_empty_zarr_array(
  zarr_array_path,
  dim,
  chunk_dim,
  data_type,
  order = "F",
  compressor = use_zlib(),
  fill_value,
  nchar,
  dimension_separator = "."
)

Arguments

`zarr_array_path`	Character vector of length 1 giving the path to the new Zarr array.
`dim`	Dimensions of the new array. Should be a numeric vector with the same length as the number of dimensions.
`chunk_dim`	Dimensions of the array chunks. Should be a numeric vector with the same length as the `dim` argument.
`data_type`	Character vector giving the data type of the new array. Currently this is limited to standard R data types. Valid options are: "integer", "double", "character". You can also use the analogous NumpPy formats: "<i4", "<f8", "\|S". If this argument isn't provided the `fill_value` will be used to determine the datatype.
`order`	Define the layout of the bytes within each chunk. Valid options are 'column', 'row', 'F' & 'C'. 'column' or 'F' will specify "column-major" ordering, which is how R arrays are arranged in memory. 'row' or 'C' will specify "row-major" order.
`compressor`	What (if any) compression tool should be applied to the array chunks. The default is to use `zlib` compression. Supplying `NULL` will disable chunk compression. See compressors for more details.
`fill_value`	The default value for uninitialized portions of the array. Does not have to be provided, in which case the default for the specified data type will be used.
`nchar`	For `datatype = "character"` this parameter gives the maximum length of the stored strings. It is an error not to specify this for a character array, but it is ignored for other data types.
`dimension_separator`	The character used to to separate the dimensions in the names of the chunk files. Valid options are limited to "." and "/".

Value

If successful returns (invisibly) TRUE. However this function is primarily called for the size effect of initialising a Zarr array location and creating the .zarray metadata.

Examples


new_zarr_array <- file.path(tempdir(), "temp.zarr")
create_empty_zarr_array(new_zarr_array,
  dim = c(10, 20), chunk_dim = c(2, 5),
  data_type = "integer"
)

new_zarr_array <- file.path(tempdir(), "temp.zarr")
create_empty_zarr_array(new_zarr_array,
  dim = c(10, 20), chunk_dim = c(2, 5),
  data_type = "integer"
)

Determine the size of chunk in bytes

Description

Determine the size of chunk in bytes

Usage

get_chunk_size(datatype, dimensions)
get_chunk_size(datatype, dimensions)

Arguments

`datatype`	A list of details for the array datatype. Expected to be produced by `.parse_datatype()`.
`dimensions`	A list containing the dimensions of the chunk. Expected to be found in a list produced by `read_array_metadata()`.

Value

An integer giving the size of the chunk in bytes

Read the .zarray metadata file associated with a Zarr array

Description

Read the .zarray metadata file associated with a Zarr array

Usage

read_array_metadata(path, s3_client = NULL)
read_array_metadata(path, s3_client = NULL)

Arguments

`path`	A character vector of length 1. This provides the path to a Zarr array or group of arrays. This can either be on a local file system or on S3 storage.
`s3_client`	A list representing an S3 client. This should be produced by `paws.storage::s3()`.

Value

A list containing the array metadata

Read a single Zarr chunk

Description

Read a single Zarr chunk

Usage

read_chunk(
  zarr_array_path,
  chunk_id,
  metadata,
  s3_client = NULL,
  alt_chunk_dim = NULL
)
read_chunk(
  zarr_array_path,
  chunk_id,
  metadata,
  s3_client = NULL,
  alt_chunk_dim = NULL
)

Arguments

`zarr_array_path`	A character vector of length 1, giving the path to the Zarr array
`chunk_id`	A numeric vector or single data.frame row with length equal to the number of dimensions of a chunk.
`metadata`	List produced by `read_array_metadata()` holding the contents of the `.zarray` file. If missing this function will be called automatically, but it is probably preferable to pass the meta data rather than read it repeatedly for every chunk.
`s3_client`	Object created by `paws.storage::s3()`. Only required for a file on S3. Leave as `NULL` for a file on local storage.
`alt_chunk_dim`	The dimensions of the array that should be created from this chunk. Normally this will be the same as the chunk shape in `metadata`, but when dealing with edge chunks, which may overlap the true extent of the array the returned array should be smaller than the chunk shape.

Value

A list of length 2. The entries should be names "chunk_data" and "warning". The first is an array containing the decompressed chunk values, the second is an integer indicating whether there were any overflow warnings generated will reading the chunk into an R datatype.

Read a Zarr array

Description

Read a Zarr array

Usage

read_zarr_array(zarr_array_path, index, s3_client)
read_zarr_array(zarr_array_path, index, s3_client)

Arguments

`zarr_array_path`	Path to a Zarr array. A character vector of length 1. This can either be a location on a local file system or the URI to an array in S3 storage.
`index`	A list of the same length as the number of dimensions in the Zarr array. Each entry in the list provides the indices in that dimension that should be read from the array. Setting a list entry to `NULL` will read everything in the associated dimension. If this argument is missing the entirety of the the Zarr array will be read.
`s3_client`	Object created by `paws.storage::s3()`. Only required for a file on S3. Leave as `NULL` for a file on local storage.

Value

An array with the same number of dimensions as the input array. The extent of each dimension will correspond to the length of the values provided to the index argument.

Examples


## Using a local file provided with the package
## This array has 3 dimensions
z1 <- system.file("extdata", "zarr_examples", "row-first",
  "int32.zarr",
  package = "Rarr"
)

## read the entire array
read_zarr_array(zarr_array_path = z1)

## extract values for first 10 rows, all columns, first slice
read_zarr_array(zarr_array_path = z1, index = list(1:10, NULL, 1))


## using a Zarr file hosted on Amazon S3
## This array has a single dimension with length 576
z2 <- "https://power-analysis-ready-datastore.s3.amazonaws.com/power_901_constants.zarr/lon/"

## read the entire array
read_zarr_array(zarr_array_path = z2)

## read alternating elements
read_zarr_array(zarr_array_path = z2, index = list(seq(1, 576, 2)))


## Using a local file provided with the package
## This array has 3 dimensions
z1 <- system.file("extdata", "zarr_examples", "row-first",
  "int32.zarr",
  package = "Rarr"
)

## read the entire array
read_zarr_array(zarr_array_path = z1)

## extract values for first 10 rows, all columns, first slice
read_zarr_array(zarr_array_path = z1, index = list(1:10, NULL, 1))


## using a Zarr file hosted on Amazon S3
## This array has a single dimension with length 576
z2 <- "https://power-analysis-ready-datastore.s3.amazonaws.com/power_901_constants.zarr/lon/"

## read the entire array
read_zarr_array(zarr_array_path = z2)

## read alternating elements
read_zarr_array(zarr_array_path = z2, index = list(seq(1, 576, 2)))

Convert special fill values from strings to numbers

Description

Special case fill values (NaN, Inf, -Inf) are encoded as strings in the Zarr metadata. R will create arrays of type character if these are defined and the chunk isn't present on disk. This function updates the fill value to be R's representation of these special values, so numeric arrays are created. A "null" fill value implies no missing values. We set this to NA as you can't create an array of type NULL in R. It should have no impact if there are really no missing values.

Usage

update_fill_value(metadata)
update_fill_value(metadata)

Arguments

metadata

A list containing the array metadata. This should normally be generated by running read_json() on the .zarray file.

Value

Returns a list with the same structure as the input. The returned list will be identical to the input, unless the fill_value entry was on of: NULL, "NaN", "Infinity" or "-Infinity".

Update (a subset of) an existing Zarr array

Description

Update (a subset of) an existing Zarr array

Usage

update_zarr_array(zarr_array_path, x, index)
update_zarr_array(zarr_array_path, x, index)

Arguments

`zarr_array_path`	Character vector of length 1 giving the path to the Zarr array that is to be modified.
`x`	The R array (or object that can be coerced to an array) that will be written to the Zarr array.
`index`	A list with the same length as the number of dimensions of the target array. This argument indicates which elements in the target array should be updated.

Value

The function is primarily called for the side effect of writing to disk. Returns (invisibly) TRUE if the array is successfully updated.

Examples


## first create a new, empty, Zarr array
new_zarry_array <- file.path(tempdir(), "new_array.zarr")
create_empty_zarr_array(
  zarr_array_path = new_zarry_array, dim = c(20, 10),
  chunk_dim = c(10, 5), data_type = "double"
)

## create a matrix smaller than our Zarr array
small_matrix <- matrix(runif(6), nrow = 3)

## insert the matrix into the first 3 rows, 2 columns of the Zarr array
update_zarr_array(new_zarry_array, x = small_matrix, index = list(1:3, 1:2))

## reading back a slightly larger subset,
## we can see only the top left corner has been changed
read_zarr_array(new_zarry_array, index = list(1:5, 1:5))

## first create a new, empty, Zarr array
new_zarry_array <- file.path(tempdir(), "new_array.zarr")
create_empty_zarr_array(
  zarr_array_path = new_zarry_array, dim = c(20, 10),
  chunk_dim = c(10, 5), data_type = "double"
)

## create a matrix smaller than our Zarr array
small_matrix <- matrix(runif(6), nrow = 3)

## insert the matrix into the first 3 rows, 2 columns of the Zarr array
update_zarr_array(new_zarry_array, x = small_matrix, index = list(1:3, 1:2))

## reading back a slightly larger subset,
## we can see only the top left corner has been changed
read_zarr_array(new_zarry_array, index = list(1:5, 1:5))

Write an R array to Zarr

Description

Write an R array to Zarr

Usage

write_zarr_array(
  x,
  zarr_array_path,
  chunk_dim,
  order = "F",
  compressor = use_zlib(),
  fill_value,
  nchar,
  dimension_separator = "."
)
write_zarr_array(
  x,
  zarr_array_path,
  chunk_dim,
  order = "F",
  compressor = use_zlib(),
  fill_value,
  nchar,
  dimension_separator = "."
)

Arguments

`x`	The R array (or object that can be coerced to an array) that will be written to the Zarr array.
`zarr_array_path`	Character vector of length 1 giving the path to the new Zarr array.
`chunk_dim`	Dimensions of the array chunks. Should be a numeric vector with the same length as the `dim` argument.
`order`	Define the layout of the bytes within each chunk. Valid options are 'column', 'row', 'F' & 'C'. 'column' or 'F' will specify "column-major" ordering, which is how R arrays are arranged in memory. 'row' or 'C' will specify "row-major" order.
`compressor`	What (if any) compression tool should be applied to the array chunks. The default is to use `zlib` compression. Supplying `NULL` will disable chunk compression. See compressors for more details.
`fill_value`	The default value for uninitialized portions of the array. Does not have to be provided, in which case the default for the specified data type will be used.
`nchar`	For character arrays this parameter gives the maximum length of the stored strings. If this argument is not specified the array provided to `x` will be checked and the length of the longest string found will be used so no data are truncated. However this may be slow and providing a value to `nchar` can provide a modest performance improvement.
`dimension_separator`	The character used to to separate the dimensions in the names of the chunk files. Valid options are limited to "." and "/".

Value

The function is primarily called for the side effect of writing to disk. Returns (invisibly) TRUE if the array is successfully written.

Examples


new_zarr_array <- file.path(tempdir(), "integer.zarr")
x <- array(1:50, dim = c(10, 5))
write_zarr_array(
  x = x, zarr_array_path = new_zarr_array,
  chunk_dim = c(2, 5)
)

new_zarr_array <- file.path(tempdir(), "integer.zarr")
x <- array(1:50, dim = c(10, 5))
write_zarr_array(
  x = x, zarr_array_path = new_zarr_array,
  chunk_dim = c(2, 5)
)

Print a summary of a Zarr array

Description

When reading a Zarr array using read_zarr_array() it is necessary to know it's shape and size. zarr_overview() can be used to get a quick overview of the array shape and contents, based on the .zarray metadata file each array contains.

Usage

zarr_overview(zarr_array_path, s3_client, as_data_frame = FALSE)
zarr_overview(zarr_array_path, s3_client, as_data_frame = FALSE)

Arguments

`zarr_array_path`	A character vector of length 1. This provides the path to a Zarr array or group of arrays. This can either be on a local file system or on S3 storage.
`s3_client`	A list representing an S3 client. This should be produced by `paws.storage::s3()`.
`as_data_frame`	Logical determining whether the Zarr array details should be printed to screen (`FALSE`) or returned as a `data.frame` (`TRUE`) so they can be used computationally.

Details

The function currently prints the following information to the R console:

array path
array shape and size
chunk and size
the number of chunks
the datatype of the array
codec used for data compression (if any)

If given the path to a group of arrays the function will attempt to print the details of all sub-arrays in the group.

Value

If as_data_frame = FALSE the function invisible returns TRUE if successful. However it is primarily called for the side effect of printing details of the Zarr array(s) to the screen. If as_data_frame = TRUE then a data.frame containing details of the array is returned.

Examples


## Using a local file provided with the package
z1 <- system.file("extdata", "zarr_examples", "row-first",
  "int32.zarr",
  package = "Rarr"
)

## read the entire array
zarr_overview(zarr_array_path = z1)

## using a file on S3 storage
## don't run this on the BioC Linux build - it's very slow there
is_BBS_linux <- nzchar(Sys.getenv("IS_BIOC_BUILD_MACHINE")) && Sys.info()["sysname"] == "Linux"
if(!is_BBS_linux) {
  z2 <- "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0101A/13457539.zarr/1"
  zarr_overview(z2)
}
## Using a local file provided with the package
z1 <- system.file("extdata", "zarr_examples", "row-first",
  "int32.zarr",
  package = "Rarr"
)

## read the entire array
zarr_overview(zarr_array_path = z1)

## using a file on S3 storage
## don't run this on the BioC Linux build - it's very slow there
is_BBS_linux <- nzchar(Sys.getenv("IS_BIOC_BUILD_MACHINE")) && Sys.info()["sysname"] == "Linux"
if(!is_BBS_linux) {
  z2 <- "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0101A/13457539.zarr/1"
  zarr_overview(z2)
}

Package 'Rarr'

Help Index

Compress and write a single chunk

Description

Usage

Arguments

Value

Create a string of the form x[idx[[1]], idx[[2]]] <- y for an array x where the number of dimensions is variable.

Description

Usage

Arguments

Value

Decompress a chunk in memory

Description

Usage

Arguments

Value

Format the decompressed chunk as an array of the correct type

Description

Usage

Arguments

Value

This is a modified version of paws.storge:::get_credentials(). It is included to prevent using the ::: operator. Look at that function if things stop working.

Description

Usage

Arguments

Value

Normalize a Zarr array path

Description

Usage

Arguments

Value

Parse the data type encoding string

Description

Usage

Arguments

Value

Define compression tool and settings

Description

Usage

Arguments

Value

Examples

Create an (empty) Zarr array

Description

Usage

Arguments

Value

See Also

Examples

Determine the size of chunk in bytes

Description

Usage

Arguments

Value

Read the .zarray metadata file associated with a Zarr array

Description

Usage

Arguments

Value

Read a single Zarr chunk

Description

Usage

Arguments

Value

Read a Zarr array

Description

Usage

Arguments

Value

Examples

Convert special fill values from strings to numbers

Description

Usage

Arguments

Value

Update (a subset of) an existing Zarr array

Description

Usage

Arguments

Create a string of the form `x[idx[[1]], idx[[2]]] <- y` for an array `x` where the number of dimensions is variable.

This is a modified version of paws.storge:::get_credentials(). It is included to prevent using the `:::` operator. Look at that function if things stop working.