Package 'matter'

Title: Out-of-core statistical computing and signal processing
Description: Toolbox for larger-than-memory scientific computing and visualization, providing efficient out-of-core data structures using files or shared memory, for dense and sparse vectors, matrices, and arrays, with applications to nonuniformly sampled signals and images.
Authors: Kylie A. Bemis <[email protected]>
Maintainer: Kylie A. Bemis <[email protected]>
License: Artistic-2.0 | file LICENSE
Version: 2.9.1
Built: 2025-01-04 06:44:34 UTC
Source: https://github.com/bioc/matter

Help Index


Resampling in 1D with Interpolation

Description

Resample the given data at specified points. Interpolation can be performed within a tolerance using several interpolation methods.

Usage

approx1(x, y, xout, interp = "linear", n = length(x),
	tol = NA_real_, tol.ref = "abs", extrap = NA_real_)

Arguments

x, y

The data to be interpolated.

xout

A vector of values where the resampling should take place.

interp

Interpolation method. One of 'none', 'sum', 'mean', 'max', 'min', 'area', 'linear', 'cubic', 'gaussian', or 'lanczos'.

n

If xout is not given, then interpolation is performed at n equally spaced data points along the range of x.

tol

The tolerance for the data points used for interpolation. Must be nonnegative. If NA, then the tolerance is estimated from the maximum differences in x.

tol.ref

If 'abs', then comparison is done by taking the absolute difference. If 'x', then relative differences are used.

extrap

The value to be returned when performing extrapolation, i.e., in the case when there is no data within tol.

Details

The algorithm is implemented in C and provides several fast interpolation methods. Note that interpolation is limited to using data within the given tolerance. This is also used to specify the width for kernel-based interpolation methods such as interp = "gaussian". The use of a tolerance also means that interpolating within the range of the data but where no data points are within the tolerance window is considered extrapolation. This can be useful when resampling sparse signals with large empty regions, by setting extrap = 0, and setting an appropriate tolerance.

Value

A vector of the same length as xout, giving the resampled data.

Author(s)

Kylie A. Bemis

See Also

asearch, approx approx2

Examples

x <- c(1.11, 2.22, 3.33, 5.0, 5.1)
y <- x^1.11

approx1(x, y, 2.22) # 2.42359
approx1(x, y, 3.0) # NA
approx1(x, y, 3.0, tol=0.2, tol.ref="x") # 3.801133

Resampling in 2D with Interpolation

Description

Resample the (potentially scattered) 2D data to a rectilinear grid at the specified coordinates. Interpolation can be performed within a tolerance using several interpolation methods.

Usage

approx2(x, y, z, xout, yout,
	interp = "linear", nx = length(x), ny = length(y),
	tol = NA_real_, tol.ref = "abs", extrap = NA_real_)

Arguments

x, y, z

The data to be interpolated. Alternatively, x can be a matrix, in which case the matrix elements are used for z and x and y are generated from the matrix's dimensions.

xout, yout

The coordinate (grid lines) where the resampling should take place. These are expanded into a rectilinear grid using expand.grid().

interp

Interpolation method. One of 'none', 'sum', 'mean', 'max', 'min', 'area', 'linear', 'cubic', 'gaussian', or 'lanczos'.

nx, ny

If xout is not given, then interpolation is performed at nx * ny equally spaced data points along the range of x and y.

tol

The tolerance for the data points used for interpolation. Must be nonnegative. May be length-2 to have different tolerances for x and y. If NA, then the tolerance is estimated from the maximum differences in x and y.

tol.ref

If 'abs', then comparison is done by taking the absolute difference. If 'x', then relative differences are used.

extrap

The value to be returned when performing extrapolation, i.e., in the case when there is no data within tol.

Details

See approx1 for details of the 1D implementation. The 2D implementation is mostly the same, except it uses a kd-tree to quickly find neighboring points.

Note that interp = "linear" and interp = "cubic" use a kernel-based approximation. Traditionally, bilinear and bicubic interpolation use 4 and 16 neighboring points, respectively. However, to support scattered data, approx2 will use as many points as are found within the given tolerance, and scale the kernels accordingly. If the input data falls on a regular grid already, then the tolerance should be specified accordingly. Set tol equal to the sampling rate for interp = "linear" and twice the sampling rate for interp = "cubic".

Value

A vector of the same length as xout, giving the resampled data.

Author(s)

Kylie A. Bemis

See Also

expand.grid, asearch, approx, approx1

Examples

x <- matrix(1:25, nrow=5, ncol=5)

approx2(x, nx=10, ny=10, interp="cubic") # upsampling

Central Tendency

Description

Calculate the mean, median, or mode of a vector.

Usage

avg(x, center = mean)

Arguments

x

A vector to summarize.

center

A function to use to calculate the central tendency for numeric vectors. Defaults to mean.

Details

Missing values are always removed before calculating the central tendency. The center funtion will be used to calculate the central tendency for any x for which either is.numeric or is.complex is TRUE. Otherwise, the mode is calculated.

Value

A single value summarizing x.

Author(s)

Kylie A. Bemis

See Also

mean, median

Examples

set.seed(1)
x <- sample(LETTERS, 50, replace=TRUE)
y <- runif(50)

avg(x)
avg(y)
avg(y, median)

Peak Processing

Description

Combine peaks from multiple signals.

Usage

# Bin a list of peaks
binpeaks(peaklist, domain = NULL, xlist = peaklist,
    tol = NA_real_, tol.ref = "abs", merge = FALSE,
    na.drop = TRUE)

# Merge peaks
mergepeaks(peaks, n = nobs(peaks), x = peaks,
    tol = NA_real_, tol.ref = "abs",
    na.drop = TRUE)

Arguments

peaklist, xlist

A list of vectors of peak indices (or domain values), and the values to be binned according to the peak locations.

peaks, x

The indices (or domain values) of peaks which should be merged, or for which the corresponding values should be averaged. If n is not provided, this should be a numeric stream_stat vector produced by binpeaks().

domain

The domain variable of the signal.

tol, tol.ref

A tolerance specifying the maximum allowed distance between binned or merged peaks. See bsearch for details. For binpeaks, this is used to determine whether a peak should be binned to a domain value. For mergepeaks, peaks closer than this are merged unless a local minimum in their counts (n) indicates that they should be separate peaks. If missing, binpeaks estimates it as one half the minimum gap between same-signal peaks, and mergepeaks estimates it as one hundredth of the average gap beteen peaks.

merge

Should the binned peaks be merged?

na.drop

Should missing values be dropped from the result?

n

The count of times each peak was observed. This is used to weight the averaging. Local minima in counts are also used to separate distinct peaks that are closer together than tol.

Details

binpeaks() is used to bin a list of peaks from multiple signals to a set of common peaks. The peaks (or their corresponding values) are binned to the given domain values and are averaged within each bin. If domain is not given, then the bins are created from the range of the peak locations and the specified tol.

mergepeaks() is used to merge any peaks with gaps smaller than the given tolerance and whose counts (n) do not indicate that they should be considered separate peaks. The merged peaks are averaged together.

Value

A numeric stream_stat vector, giving the average locations of each peak.

Author(s)

Kylie A. Bemis

Examples

x <- c(0, 1, 1, 2, 3, 2, 1, 4, 5, 1, 1, 0)
y <- c(0, 1, 1, 3, 2, 2, 1, 5, 4, 1, 1, 0)

p1 <- findpeaks(x)
p2 <- findpeaks(y)
binpeaks(list(p1, p2), merge=FALSE)

Binned Summaries of a Vector

Description

Summarize a vector in the bins at the specified indices.

Usage

binvec(x, lower, upper, stat = "sum", prob = 0.5)

Arguments

x

A numeric vector.

lower, upper

The (inclusive) lower and upper indices of the bins

stat

The statistic used to summarize the values in each bin. Must be one of "sum", "mean", "max", "min", "sd", "var", "mad", or "quantile".

prob

The quantile for stat = "quantile".

Value

An numeric vector of the summarized (binned) values.

Author(s)

Kylie A. Bemis

Examples

set.seed(1)

x <- sort(runif(20))

binvec(x, c(1,6,11,16), c(5,10,15,20))

binvec(x, seq(from=1, to=16, by=5), stat="mean")

Binary Search with Approximate Matching

Description

Use a binary search to find approximate matches for the elements of its first argument among those in its second. This implementation allows for returning the index of the nearest match if there are no exact matches. It also allows specifying a tolerance for the comparison.

Usage

bsearch(x, table, tol = 0, tol.ref = "abs",
		nomatch = NA_integer_, nearest = FALSE)

reldiff(x, y, ref = "y")

Arguments

x

A vector of values to be matched. Only integer, numeric, and character vectors are supported.

y, table

A sorted (non-decreasing) vector of values to be matched against. Only integer, numeric, and character vectors are supported.

tol

The tolerance for matching doubles. Must be >= 0.

ref, tol.ref

One of 'abs', 'x', or 'y'. If 'abs', then comparison is done by taking the absolute difference. If either 'x' or 'y', then relative differences are used, and this specifies which to use as the reference (target) value. For strings, this uses the Hamming distance (number of errors), normalized by the length of the reference string for relative differences.

nomatch

The value to be returned in the case when no match is found, coerced to an integer. (Ignored if nearest = TRUE.)

nearest

Should the index of the closest match be returned if no exact matches are found?

Details

The algorithm is implemented in C and currently only works for 'integer', 'numeric', and 'character' vectors. If there are multiple matches, then the first match that is found will be returned, with no guarantees. If a nonzero tolerance is provided, the closest match will be returned.

The "nearest" match for strings when there are no exact matches is decided by the match with the most initial matching characters. Tolerance is ignored for strings and integers. Behavior is undefined and results may be unexpected if values includes NAs.

Value

A vector of the same length as x, giving the indexes of the matches in table.

Author(s)

Kylie A. Bemis

See Also

asearch, match, pmatch, findInterval

Examples

a <- c(1.11, 2.22, 3.33, 5.0, 5.1)

bsearch(2.22, a) # 2
bsearch(3.0, a) # NA
bsearch(3.0, a, nearest=TRUE) # 3
bsearch(3.0, a, tol=0.1, tol.ref="values") # 3

b <- c("hello", "world!")
bsearch("world!", b) # 2
bsearch("worl", b) # NA
bsearch("worl", b, nearest=TRUE) # 2

Calculate Checksums and Cryptographic Hashes

Description

This is a generic function for applying cryptographic hash functions and calculating checksums for externally-stored R objects.

Usage

checksum(x, ...)

## S4 method for signature 'character'
checksum(x, algo = "sha1", ...)

## S4 method for signature 'matter_'
checksum(x, algo = "sha1", ...)

Arguments

x

A file path or an object to be hashed.

algo

The hash function to use.

...

Additional arguments to be passed to the hash function.

Details

The method for matter objects calculates checksums of each of the files in the object's paths.

Value

A character vector giving the hash or hashes of the object.

Author(s)

Kylie A. Bemis

See Also

digest

Examples

x <- matter(1:10)
y <- matter(1:10)

checksum(x)
checksum(y) # should be the same

Apply Functions Over Chunks of a List, Vector, or Matrix

Description

Perform equivalents of apply, lapply, and mapply, but over parallelized chunks of data. This is most useful if accessing the data is potentially time-consuming, such as for file-based matter objects. Operating on chunks reduces the number of I/O operations.

Usage

## Operate on elements/rows/columns
chunkApply(X, MARGIN, FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())

chunkLapply(X, FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())

chunkMapply(FUN, ...,
    simplify = FALSE, outpath = NULL,
    verbose = NA, BPPARAM = bpparam())


## Operate on complete chunks
chunk_rowapply(X, FUN, ...,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

chunk_colapply(X, FUN, ...,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

chunk_lapply(X, FUN, ...,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

chunk_mapply(FUN, ..., MoreArgs = NULL,
    simplify = "c", depends = NULL, permute = FALSE,
    RNG = FALSE, verbose = NA, chunkopts = list(),
    BPPARAM = bpparam())

Arguments

X

A matrix for chunkApply(), a list or vector for chunkLapply(), or lists for chunkMapply(). These may be any class that implements suitable methods for [, [[, dim, and length().

MARGIN

If the object is matrix-like, which dimension to iterate over. Must be 1 or 2, where 1 indicates rows and 2 indicates columns. The dimension names can also be used if X has dimnames set.

FUN

The function to be applied.

MoreArgs

A list of other arguments to FUN.

...

Additional arguments to be passed to FUN.

simplify

Should the result be simplified into a vector, matrix, or higher dimensional array?

outpath

If non-NULL, a file path where the results should be written as they are processed. If specified, FUN must return a 'raw', 'logical', 'integer', or 'numeric' vector. The result will be returned as a matter object.

verbose

Should user messages be printed with the current chunk being processed? If NA (the default), this is taken from getOption("matter.default.verbose").

chunkopts

An (optional) list of chunk options including nchunks, chunksize, and serialize. See "Details".

depends

A list with length equal to the extent of X. Each element of depends should give a vector of indices which correspond to other elements of X on which each computation depends. These elements are passed to FUN. For time efficiency, no attempt is made to verify these indices are valid.

permute

Should the order of items be randomized? This may be useful for iterating over random subsets. No attempt is made to re-order the results.

RNG

Should the local random seed (as set by set.seed) be forwarded to the worker processes? If RNGkind is set to "L'Ecuyer-CMRG", then the random seed will be set to appropriate substreams for each chunk or for each element/row/column. Note that forwarding the local random seed incurs additional overhead.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply.

Details

For chunkApply(), chunkLapply(), and chunkMapply():

For vectors and lists, the vector is broken into some number of chunks according to chunks. The individual elements of the chunk are then passed to FUN.

For matrices, the matrix is chunked along rows or columns, based on the number of chunks. The individual rows or columns of the chunk are then passed to FUN.

In this way, the first argument of FUN is analogous to using the base apply, lapply, and mapply functions.

For chunk_rowapply(), chunk_colapply(), chunk_lapply(), and chunk_mapply():

In this situation, the entire chunk is passed to FUN, and FUN is responsible for knowing how to handle a sub-vector or sub-matrix of the original object. This may be useful if FUN is already a function that could be applied to the whole object such as rowSums or colSums.

When this is the case, it may be useful to provide a custom simplify function.

For convenience to the programmer, several attributes are made available when operating on a chunk.

  • "chunkid": The index of the chunk currently being processed by FUN.

  • "chunklen": The number of elements in the chunk that should be processed.

  • "index": The indices of the elements of the chunk, as elements/rows/columns in the original matrix/vector.

  • "depends" (optional): If depends is given, then this is a list of indices within the chunk. The length of the list is equal to the number of elements/rows/columns in the chunk. Each list element is either NULL or a vector of indices giving the elements/rows/columns of the chunk that should be processed for that index. The indices that should be processed will be non-NULL, and indices that should be ignored will be NULL.

The depends argument can be used to iterate over dependent elements of a vector, or dependent rows/columns of a matrix. This can be useful if the calculation for a particular row/column/element depends on the values of others.

When depends is provided, multiple rows/columns/elements will be passed to FUN. Each element of the depends list should be a vector giving the indices that should be passed to FUN.

For example, this can be used to implement a rolling apply function.

Several options are supported by chunkopts to override the global options:

  • nchunks: The number of chunks to use. If omitted, this is taken from getOption("matter.default.nchunks"). For IO-bound operations, using fewer chunks will often be faster, but use more memory.

  • chunksize: The approximate chunk size in bytes. If omitted, this is taken from getOption("matter.default.chunksize"). For IO-bound operations, using larger chunks will often be faster, but use more memory. If set to NA_real_, then the chunk size is determined by the number of chunks.

  • serialize: Whether data in virtual memory should be realized on the manager and serialized to the workers (TRUE), passed to the workers in virtual memory as-is (FALSE), or if matter should decide the behavior based on the cluster configuration (NA). If omitted, this is taken from getOption("matter.default.serialize"). If all workers have access to the same virtual memory resources (whether file storage or shared memory), then it can be significantly faster to avoid serializing the data.

Value

Typically, a list if simplify=FALSE. Otherwise, the results may be coerced to a vector or array.

Author(s)

Kylie A. Bemis

See Also

apply, lapply, mapply, RNGkind, RNGStreams, SnowfastParam

Examples

register(SerialParam())

set.seed(1)
x <- matrix(rnorm(1000^2), nrow=1000, ncol=1000)

out <- chunkApply(x, 1L, mean, chunkopts=list(nchunks=10))
head(out)

Chunked Vectors, Arrays, and Lists

Description

The chunked class implements chunked wrappers for vectors, arrays, and lists for parallel iteration.

Usage

## Instance creation
chunked_vec(x, nchunks = NA, chunksize = NA,
    verbose = FALSE, permute = FALSE, depends = NULL, drop = FALSE)

chunked_mat(x, margin, nchunks = NA, chunksize = NA,
    verbose = FALSE, permute = FALSE, depends = NULL, drop = FALSE)

chunked_list(..., nchunks = NA, chunksize = NA,
    verbose = FALSE, permute = FALSE, depends = NULL, drop = FALSE)

## Additional methods documented below

Arguments

x, ...

The data to be chunked. If multiple objects are passed via ..., then they are all recycled to the same length.

nchunks

The number of chunks to use. If NA (the default), this is taken from getOption("matter.default.nchunks"). For IO-bound operations, using fewer chunks will often be faster, but use more memory. If both nchunks and chunksize are specified, then nchunks takes priority.

chunksize

The approximate chunk size in bytes. If NA (the default), this is taken from getOption("matter.default.chunksize"). For IO-bound operations, using larger chunks will often be faster, but use more memory. If both nchunks and chunksize are specified, then nchunks takes priority.

verbose

Should messages be printed whenever a chunk is extracted?

permute

Should the order of items be randomized? Alternatively, an integer vector or a list of integer vectors can be specified. If an integer vector is provided, then x is chunked in the exact order of the provided indices. If a list of indices is provided, then these are taken as strata (i.e., subpopulations). Each stratum will be chunked separately and then merged (without randomization), so that each chunk will contain examples from every stratum.

depends

A list with length equal to the extent of X. Each element of depends should give a vector of indices which correspond to other elements of X on which each computation depends. These elements are passed to FUN. For time efficiency, no attempt is made to verify these indices are valid.

margin

Which array margin should be chunked.

drop

The value passed to drop when subsetting the chunks.

Value

An object derived from class chunked.

Slots

data:

The data.

index:

The chunk indices.

verbose:

Print messages on chunk extraction?

drop:

The value passed to drop when subsetting the chunks.

margin:

The array margin for the chunks.

Creating Objects

chunked_vec, chunked_mat and chunked_list instances can be created through chunked_vec(), chunked_mat(), and chunked_list(), respectively.

Methods

Standard generic methods:

length(x):

Get number of chunks.

lengths(x):

Get chunk sizes for each chunk.

x[i, ...]:

Get chunks.

x[[i]]:

Get a single chunk.

Author(s)

Kylie A. Bemis

See Also

chunkApply

Examples

x <- matrix(runif(200), nrow=10, ncol=20)

y <- chunked_mat(x, margin=2, nchunks=5)
print(y)

Scaling and Centering by Row or Column Based on Grouping

Description

Apply the equivalent of scale to either columns or rows of a matrix, using a grouping variable.

Usage

## S4 method for signature 'ANY'
colscale(x, center=TRUE, scale=TRUE,
    group = NULL, ..., BPPARAM = bpparam())

## S4 method for signature 'ANY'
rowscale(x, center=TRUE, scale=TRUE,
    group = NULL, ..., BPPARAM = bpparam())

Arguments

x

A matrix-like object.

center

Either a logical value or a numeric vector of length equal to the number of columns of 'x' (for colscale()) or the number of the rows of 'x' (for rowscale()). If a grouping variable is given, then this must be a matrix with number of columns equal to the number of groups.

scale

Either a logical value or a numeric vector of length equal to the number of columns of 'x' (for colscale()) or the number of the rows of 'x' (for rowscale()). If a grouping variable is given, then this must be a matrix with number of columns equal to the number of groups.

group

A vector or factor giving the groupings with length equal to the number of rows of 'x' (for colscale()) or the number of the columns of 'x' (for rowscale()).

...

Arguments passed to rowStats() or colStats() respectively, if center or scale must be calculated.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply.

Details

See scale for details.

Value

A matrix-like object (usually of the same class as x) with either ‘col-scaled:center’ and ‘col-scaled:scale’ attributes or ‘row-scaled:center’ and ‘row-scaled:scale’ attributes.

Author(s)

Kylie A. Bemis

See Also

scale

Examples

x <- matter(1:100, nrow=10, ncol=10)

colscale(x)

Row and Column Summary Statistics Based on Grouping

Description

These functions perform calculation of summary statistics over matrix rows and columns for each level of a grouping variable.

Usage

## S4 method for signature 'ANY'
rowStats(x, stat, ..., BPPARAM = bpparam())

## S4 method for signature 'ANY'
colStats(x, stat, ..., BPPARAM = bpparam())

## S4 method for signature 'matter_mat'
rowStats(x, stat, ..., BPPARAM = bpparam())

## S4 method for signature 'matter_mat'
colStats(x, stat, ..., BPPARAM = bpparam())

## S4 method for signature 'sparse_mat'
rowStats(x, stat, ..., BPPARAM = bpparam())

## S4 method for signature 'sparse_mat'
colStats(x, stat, ..., BPPARAM = bpparam())

.rowStats(x, stat, group = NULL,
    na.rm = FALSE, simplify = TRUE, drop = TRUE,
    iter.dim = 1L, BPPARAM = bpparam(), ...)

.colStats(x, stat, group = NULL,
    na.rm = FALSE, simplify = TRUE, drop = TRUE,
    iter.dim = 2L, BPPARAM = bpparam(), ...)

Arguments

x

A matrix on which to calculate summary statistics.

stat

The name of summary statistics to compute over the rows or columns of a matrix. Allowable values include: "min", "max", "prod", "sum", "mean", "var", "sd", "any", "all", and "nnzero".

group

A factor or vector giving the grouping. If not provided, no grouping will be used.

na.rm

If TRUE, remove NA values before summarizing.

simplify

Simplify the results from a list to a vector or array. This also drops any additional attributes (besides names).

drop

If only a single summary statistic is calculated, return the results as a vector (or matrix) rather than a list.

iter.dim

The dimension to iterate over. Must be 1 or 2, where 1 indicates rows and 2 indicates columns.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply.

...

Additional arguments passed to chunk_rowapply() or chunk_colapply(), such as the number of chunks.

Details

The summary statistics methods are calculated over chunks of the matrix using s_colstats and s_rowstats. For matter objects, the iteration is performed over the major dimension for IO efficiency.

Value

A list for each stat requested, where each element is either a vector (if no grouping variable is provided) or a matrix where each column corresponds to a different level of group.

If drop=TRUE, and only a single statistic is requested, then the result will be unlisted and returned as a vector or matrix.

Author(s)

Kylie A. Bemis

See Also

colSums

Examples

register(SerialParam())

set.seed(1)

x <- matrix(runif(100^2), nrow=100, ncol=100)

g <- as.factor(rep(letters[1:5], each=20))

colStats(x, "mean", group=g)

Sweep out Matrix Summaries Based on Grouping

Description

Apply the equivalent of sweep to either columns or rows of a matrix, using a grouping variable.

Usage

## S4 method for signature 'ANY'
colsweep(x, STATS, FUN = "-", group = NULL, ...)

## S4 method for signature 'matter_mat'
colsweep(x, STATS, FUN = "-", group = NULL, ...)

## S4 method for signature 'sparse_mat'
colsweep(x, STATS, FUN = "-", group = NULL, ...)

## S4 method for signature 'ANY'
rowsweep(x, STATS, FUN = "-", group = NULL, ...)

## S4 method for signature 'matter_mat'
rowsweep(x, STATS, FUN = "-", group = NULL, ...)

## S4 method for signature 'sparse_mat'
rowsweep(x, STATS, FUN = "-", group = NULL, ...)

Arguments

x

A matrix-like object.

STATS

The summary statistic to be swept out, with length equal to the number of columns of 'x' (for colsweep()) or the number of the rows of 'x' (for rowsweep()). If a grouping variable is given, then this must be a matrix with number of columns equal to the number of groups.

FUN

The function to be used to carry out the sweep.

group

A vector or factor giving the groupings with length equal to the number of rows of 'x' (for colsweep()) or the number of the columns of 'x' (for rowsweep()).

...

Ignored.

Details

See sweep for details.

Value

A matrix-like object (usually of the same class as x) with the statistics swept out.

Author(s)

Kylie A. Bemis

See Also

sweep

Examples

set.seed(1)

x <- matrix(1:100, nrow=10, ncol=10)

colsweep(x, colStats(x, "mean"))

Convolution at Arbitrary Indices

Description

Convolve a signal with weights at arbitrary indices.

Usage

convolve_at(x, index, weights, margin = NULL, na.rm = FALSE)

Arguments

x

A numeric vector or matrix.

index

A list or matrix of numeric vectors giving the indices to convolve. If index is a list, then it must have the same length as x and element lengths that match weights. If x is a matrix, then its rows must correspond to x and its columns must correspond to weights.

weights

A list giving the weights of the kernels to convolve for each element of x. Lengths must match index.

margin

If x is a matrix, then the margin to convolve over.

na.rm

Should missing values (including NaN) be removed?

Details

This is essentially just a weighted sum defined by x[i] = sum(weights[[i]] * x[index[[i]]]).

Value

A numeric vector the same length as x with the smoothed result.

Author(s)

Kylie A. Bemis

Examples

set.seed(1)
t <- seq(from=0, to=6 * pi, length.out=5000)
y <- sin(t) + 0.6 * sin(2.6 * t)
x <- y + runif(length(y))

i <- roll(seq_along(x), width=15)
wt <- dnorm((-7):7, sd=7/2)
wt <- wt / sum(wt)

xs <- convolve_at(x, i, wt)

plot(x, type="l")
lines(xs, col="red")

Colocalization Coefficients

Description

Compute Manders overlap coefficient (MOC), and Manders colocalization coefficients (M1 and M2), and Dice similarity coefficient.

Usage

coscore(x, y, threshold = NA)

Arguments

x, y

Images to be compared. These can be numeric or logical. If numeric, then the "overlap" is defined where both images are nonzero.

threshold

The intensity threshold to use when comparing the images. If NA, this will be determined automatically from the technique described in Costes et al. (2004). Alternatively, this can be a function such as median that can be applied to return a suitable threshold.

Details

The Dice coefficient and Manders overlap coefficient are symmetric between images, while M1 and M2 measure the overlap relative to x and y respectively.

Value

A numeric vector with elements named "MOC", "M1", "M2", and "Dice", and an attribute named "threshold" giving the numeric thresholds (if applicable) for converting each image to a logical mask.

Author(s)

Kylie A. Bemis

References

K. W. Dunn, M. M. Kamocka, and J. H. McDonald. “A practical guide to evaluating colocalization in biological microscop.” American Journal of Physiology: Cell Physiology, vol. 300, no. 4, pp. C732-C742, 2011.

S. V. Costes, D. Daelemans, E. H. Cho, Z. Dobbin, G. Pavlakis, and S. Lockett. “Automatic and Quantitative Measurement of Protein-Protein Colocalization in Live Cells.” Biophysical Journal, vol. 86, no. 6, pp. 3993-4003, 2004.

K. H. Zou, S. K. Warfield, A. Bharatha, C. M. C. Tempany, M. R. Kaus, S. J. Haker, W. M. Wells, III, F. A. Jolesz, and R. Kikinis. “Statistical Validation of Image Segmentation Quality Based on a Spatial Overlap Index.” Academic Radiology, vol. 11, issue 2, pp. 178-189, 2004.

Examples

set.seed(1)
y <- x <- matrix(0, nrow=32, ncol=32)
x[5:16,5:16] <- 1
x[17:28,17:28] <- 1
x <- x + runif(length(x))
y[4:15,4:15] <- 1
y[18:29,18:29] <- 1
y <- y + runif(length(y))
xl <- x > median(x)
yl <- y > median(y)

coscore(x, x)
coscore(x, y)
coscore(x, y, threshold=median)
coscore(xl, yl)

Color Palettes

Description

These functions provide simple color palettes.

Usage

## Continuous color palettes
cpal(palette = "Viridis")

## Discrete color palettes
dpal(palette = "Tableau 10")

## List palettes
cpals()
dpals()

# Add transparency to colors
add_alpha(colors, alpha = 1, pow = 1)

Arguments

palette

The name of a color palette. See palette.pals and hcl.pals.

colors

A character vector of colors to add transparency to.

alpha

A numeric vector giving the level of transparency in the range [0, 1] where 0 is fully transparent and 1 is fully opaque.

pow

The power scaling of the alpha channel. A linear alpha scale often results in poor interpretability for superposed images, so raising the alpha channel (already in range [0, 1]) to a power > 1 can improve interpretability in these cases. Conversely, for highly skewed data, using a power < 1 can reduce the impact of extreme values.

Value

A character vector of colors or a function for generating n colors.

Author(s)

Kylie A. Bemis

See Also

vizi, image

Examples

f <- cpal("viridis")
cols <- f(10)
add_alpha(cols, 1:10/10)

Perform Cross Validation

Description

Perform k-fold cross-validation with an arbitrary modeling function.

Usage

cv_do(fit., x, y, folds, ...,
	mi = !is.null(bags), bags = NULL, pos = 1L,
	predict. = predict, transpose = FALSE, keep.models = TRUE,
	trainProcess = NULL, trainArgs = list(),
	testProcess = NULL, testArgs = list(),
	verbose = NA, chunkopts = list(),
	BPPARAM = bpparam())

## S3 method for class 'cv'
fitted(object, type = c("response", "class"),
	simplify = TRUE, ...)

Arguments

fit.

The function used to fit the model.

x, y

The data and response variable.

folds

A vector coercible to a factor giving the fold for each row or column of x.

mi

Should mi_learn be called with fit. for multiple instance learning?

bags

If provided, subsetted and passed to fit. or mi_learn if mi=TRUE.

pos

The positive class for multiple instance learning. Only used if mi=TRUE.

...

Additional arguments passed to fit. and predict..

predict.

The function used to predict on new data from the fitted model. The fitted model is passed as the 1st argument and the test data is passed as the 2nd argument.

transpose

A logical value indicating whether x should be considered transposed or not. This can be useful if the input matrix is (P x N) instead of (N x P) and storing the transpose is expensive. This is not necessary for matter_mat and sparse_mat objects, but can be useful for large in-memory (P x N) matrices.

keep.models

Should the models be kept and returned?

trainProcess, trainArgs

A function and arguments used for processing the training sets. The training set is passed as the 1st argument to trainProcess.

testProcess, testArgs

A function and arguments used for processing the test sets. The test set is passed as the 1st argument to trainProcess, and the processed training set is passed as the 2nd argument.

verbose

Should progress be printed for each iteration?

chunkopts

Passed to fit., predict., trainProcess and testProcess. See chunkApply for details.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply. Passed to fit., predict., trainProcess and testProcess.

object

An object inheriting from cv.

type

The type of prediction, where "response" means the fitted response matrix and "class" will be the vector of class predictions (only for classification).

simplify

Should the predictions be simplified (from a list) to an array (type="response") or data frame (type="class")?

Details

The cross-validation is not performed in parallel, because it is assumed the pre-processing functions, modeling function, and prediction function may make use of parallelization. Therefore, these functions need to be able to handle (or ignore) the arguments nchunks and BPPARAM, which will be passed to them.

If bags is specified, then multiple instance learning is assumed, where observations from the same bag are all assumed to have the same label. The labels for bags are automatically pooled (from y) so that if any observation in a bag is pos, then the entire bag is labeled pos. If mi=TRUE then mi_learn will be called by cv_do; otherwise it is assumed fn will handle the multiple instance learning. The accuracy metrics are calculated with the original y labels.

Value

An object of class cv, with the following components:

  • average: The average accuracy metrics.

  • scores: The fold-specific accuracy metrics.

  • folds: The fold memberships.

  • fitted.values: The fold-specific predictions.

  • models: (Optional) The fitted models.

Author(s)

Kylie A. Bemis

See Also

predscore

Examples

register(SerialParam())

set.seed(1)
n <- 100
p <- 5
nfolds <- 3
y <- rep(c(rep.int("yes", 60), rep.int("no", 40)), nfolds)
x <- matrix(rnorm(nfolds * n * p), nrow=nfolds * n, ncol=p)
x[,1L] <- x[,1L] + 2 * ifelse(y == "yes", runif(n), -runif(n))
x[,2L] <- x[,2L] + 2 * ifelse(y == "no", runif(n), -runif(n))
folds <- rep(paste0("set", seq_len(nfolds)), each=n)

cv_do(pls_nipals, x, y, k=1:5, folds=folds)

Deferred Operations on “matter” Objects

Description

Some arithmetic, comparison, and logical operations are available as delayed operations on matter_arr and sparse_arr objects.

Details

Currently the following delayed operations are supported:

‘Arith’: ‘+’, ‘-’, ‘*’, ‘/’, ‘^’, '

‘Math’: ‘exp’, ‘log’, ‘log2’, ‘log10’

Arithmetic operations are applied in C++ layer immediately after the elements are read from virtual memory. This means that operations that are implemented in C and/or C++ for efficiency (such as summary statistics) will also reflect the execution of the deferred arithmetic operations.

Value

A new matter object with the registered deferred operation. Data in storage is not modified; only object metadata is changed.

Author(s)

Kylie A. Bemis

See Also

Arith, Compare, Logic, Ops, Math

Examples

x <- matter(1:100)
y <- 2 * x + 1

x[1:10]
y[1:10]

mean(x)
mean(y)

Deprecated and defunct objects in matter

Description

These functions are provided for compatibility with older versions of matter, and will be removed in the future.


Downsample a Signal

Description

Downsamples a signal for the purposes of visualization. A subset of the original samples are used to represent the signal. The downsampled signal is intended to resemble the original when visualized and should not typically be used for downstream processing.

Usage

downsample(x, n = length(x) / 10L, domain = NULL,
	method = c("lttb", "ltob", "dynamic"))

Arguments

x

A numeric vector.

n

The length of the downsampled signal.

domain

The domain variable of the signal.

method

The downsampling method to be used. Must be one of "lttb", "ltob", or "dynamic".

Details

This function implements the downsampling methods from Sveinn Steinarsson's 2013 MSc thesis Downsampling Time Series for Visual Representation, including largest-triangle-three-buckets (LTTB), largest-triangle-one-bucket (LTOB), and dynamic binning.

Value

A vector of length n, giving the downsampled signal.

Author(s)

Kylie A. Bemis

References

S. Steinarsson. “Downsampling Time Series for Visual Representation.” MSc, School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland, June 2013.

See Also

approx1

Examples

set.seed(1)
t <- seq(from=0, to=6 * pi, length.out=2000)
x <- sin(t) + 0.6 * sin(2.6 * t)
x <- x + runif(length(x))
xs <- downsample(x, n=200)
s <- attr(xs, "sample")

plot(x, type="l")
points(s, xs, col="red", type="b")

Delta Run Length Encoding

Description

The drle class stores delta-run-length-encoded vectors. These differ from other run-length-encoded vectors provided by other packages in that they allow for runs of values that each differ by a common difference (delta).

Usage

## Instance creation
drle(x, type = "drle", cr_threshold = 0)

is.drle(x)
## Additional methods documented below

Arguments

x

An integer or numeric vector to convert to delta run length encoding for drle(); an object to test if it is of class drle for is.drle().

type

The type of compression to use. Must be "drle", "rle", or "seq". The default ("drle") allows arbitrary deltas. Using type "rle" means that runs must consist of a single value (i.e., deltas must be 0). Using type "seq" means that deltas must be 1, -1, or 0.

cr_threshold

The compression ratio threshold to use when converting a vector to delta run length encoding. The default (0) always converts the object to drle. Values of cr_threshold < 1 correspond to compressing even when the output will be larger than the input (by a certain ratio). For values > 1, compression will only take place when the output is (approximately) at least cr_threshold times smaller.

Value

An object of class drle.

Slots

values:

The values that begin each run.

lengths:

The length of each run.

deltas:

The difference between the values of each run.

Creating Objects

drle instances can be created through drle().

Methods

Standard generic methods:

x[i]:

Get the elements of the uncompressed vector.

length(x):

Get the length of the uncompressed vector.

c(x, ...):

Combine vectors.

Author(s)

Kylie A. Bemis

See Also

rle

Examples

## Create a drle vector
x <- c(1,1,1,1,1,6,7,8,9,10,21,32,33,34,15)
y <- drle(x)

# Check that their elements are equal
x == y[]

Contrast Enhancement

Description

Enhance the contrast in a 2D signal.

Usage

# Adjust by extreme values
enhance_adj(x, frac = 0.01)

# Histogram equalization
enhance_hist(x, nbins = 256L)

# Contrast-limited adaptive histogram equalization (CLAHE)
enhance_adapt(x, width = sqrt(nrow(x) * ncol(x)) %/% 5L,
    clip = 0.1, nbins = 256L)

Arguments

x

A numeric matrix.

frac

The fraction of the highest and lowest pixel values to adjust by clamping the overall intensity range.

nbins

The number of gray levels in the output image.

clip

The normalized clip limit, expressed as a fraction of the neighborhood size. This is used to limit the maximum value of any bin in the adaptive histograms, in order avoid amplifying local noise.

width

The width of the sliding window used when calculating the local adaptive histograms.

Details

enhance_adj() performs a simple adjustment of the overall image intensity range by clamping a fraction of the highest and lowest pixel values. This is useful for suppressing very bright hotspots, but may not be sufficient for images with globally poor contrast.

enhance_hist() performs histogram equalization. Histogram equalization transforms the pixel values so that the histogram of the image is approximately flat. This is done by replacing the original pixel values with their associated probability in the image's empirical cumulative distribution.

enhance_adapt() performs contrast-limited adaptive histogram equalization (CLAHE) from Zuiderveld (1994). While ordinary histogram equalization performs a global transformation on the image, adaptive histogram equalization calculates a histogram in a local neighborhood around each pixel to perform the transformation, thereby enhancing the local contrast across the image. However, this can amplify local noise, so to avoid this, the histogram is clipped to a maximum allowed bin value before transforming the pixel values. To speed up the computation, it is implemented here using a sliding window technique as described by Wang and Tao (2006).

These methods rescale the output image so that its median equals the median of the original image and it has equal interquartile range (IQR).

Value

A numeric matrix the same dimensions as x with the smoothed result.

Author(s)

Kylie A. Bemis

References

K. Zuiderveld. “Contrast Limited Adaptive Histogram Equalization.” Graphics Gems IV, Academic Press, pp. 474-485, 1994.

Z. Wang and J. Tao. “A Fast Implementation of Adaptive Histogram Equalization.” IEEE 8th international Conference on Signal Processing, Nov 2006.

Examples

set.seed(1)
x <- matrix(0, nrow=32, ncol=32)
x[9:24,9:24] <- 10
x <- x + runif(length(x))
y <- x + rlnorm(length(x))
z <- enhance_hist(y)

par(mfcol=c(1,3))
image(x, col=hcl.colors(256), main="original")
image(y, col=hcl.colors(256), main="multiplicative noise")
image(z, col=hcl.colors(256), main="histogram equalization")

Continuum Estimation

Description

Estimate the continuum (baseline) of a signal.

Usage

# Continuum based on local extrema
estbase_loc(x,
    smooth = c("none", "loess", "spline"),
    span = 1/10, spar = NULL, upper = FALSE)

# Convex hull
estbase_hull(x, upper = FALSE)

# Sensitive nonlinear iterative peak clipping (SNIP)
estbase_snip(x, width = 100L, decreasing = TRUE)

# Running medians
estbase_med(x, width = 100L)

Arguments

x

A numeric vector.

smooth

A smoothing method to be applied after linearly interpolating the continuum.

span, spar

Smoothing parameters for loess and spline smoothing, respectively.

upper

Should the upper continuum be estimated instead of the lower continuum?

width

The width of the smoothing window in number of samples.

decreasing

Use a decreasing clipping window instead of an increasing window.

Details

estbase_loc() uses a simple method based on linearly interpolating from local extrema. It typically performs well enough for most situations. Signals with strong noise or wide peaks may require stronger smoothing after the interpolation step.

estbase_hull() estimates the continuum by finding the lower or upper convex hull using the monotonic chain algorithm of A. M. Andrew (1979).

estbase_snip() performs sensitive nonlinear iterative peak (SNIP) clipping using the adaptive clipping window from M. Morhac (2009).

estbase_med() estimates the continuum from running medians.

Value

A numeric vector the same length as x with the estimated continuum.

Author(s)

Kylie A. Bemis

References

A. M. Andrew. “Another efficient algorithm for convex hulls in two dimensions.” Information Processing Letters, vol. 9, issue 5, pp. 216-219, Dec. 1979.

M. Morhac. “An algorithm for determination of peak regions and baseline elimination in spectroscopic data.” Nuclear Instruments and Methods in Physics Research A, vol. 600, issue 2, pp. 478-487, Mar. 2009.

Examples

set.seed(1)
t <- seq(from=0, to=6 * pi, length.out=2000)
x <- sin(t) + 0.6 * sin(2.6 * t)
lo <- estbase_hull(x)
hi <- estbase_hull(x, upper=TRUE)

plot(x, type="l")
lines(lo, col="red")
lines(hi, col="blue")

Estimate Raster Dimensions

Description

Estimate the raster dimensions of a scattered 2D signal based on its pixel coordinates.

Usage

estdim(x, tol = 1e-6)

Arguments

x

A numeric matrix or data frame where each column gives the pixel coordinates for a different dimension. Only 2 or 3 dimensions are supported if the coordinates are irregular. Otherwise, any number of dimensions are supported.

tol

The tolerance allowed when estimating the resolution (i.e., pixel sizes) using estres(). If estimating the resolution this way fails, then it is estimated from the coordinate ranges instead.

Value

A numeric vector giving the estimated raster dimensions.

Author(s)

Kylie A. Bemis

Examples

co <- expand.grid(x=1:12, y=1:9)
co$x <- jitter(co$x)
co$y <- jitter(co$y)

estdim(co)

Local Noise Estimation

Description

Estimate the noise across a signal.

Usage

# Difference-based noise estimation
estnoise_diff(x, nbins = 1L, overlap = 0.5,
    index = NULL)

# Dynamic noise level filtering
estnoise_filt(x, nbins = 1L, overlap = 0.5,
    msnr = 2, threshold = 0.5, isPeaks = FALSE, index = NULL)

# SD-based noise estimation
estnoise_sd(x, nbins = 1, overlap = 0.5,
    k = 9, index = NULL)

# MAD-based noise estimation
estnoise_mad(x, nbins = 1, overlap = 0.5,
    k = 9, index = NULL)

# Quantile-based noise estimation
estnoise_quant(x, nbins = 1, overlap = 0.5,
    k = 9, prob = 0.95, index = NULL)

Arguments

x

A numeric vector.

nbins

The number of bins to use if estimating a variable noise level.

overlap

The fraction of overlap between bins if estimating a variable noise level.

msnr

The minimum signal-to-noise ratio for distinguishing signal peaks from noise peaks.

threshold

The required signal-to-noise difference for the first non-noise peak.

isPeaks

Does x represent a profile signal (FALSE) or its peaks (TRUE)?

index

A matrix or list of domain vectors giving the sampling locations of x if it is a multidimensional signal.

k

The size of the smoothing window when calculating the difference between the raw signal and a smoothed version of the signal.

prob

The quantile used when estimating the noise.

Details

estnoise_diff() estimates the local noise from the mean absolute differences between the signal and the average of its derivative. For noisy signals, the derivative is dominated by the noise, making it a useful estimator of the noise.

estnoise_sd(), estnoise_mad(), and estnoise_quant() all estimate the local noise by first smoothing the signal with a Gaussian filter, and then subtracting the raw signal from the smoothed signal to isolate the noise component. The noise is then summarized with the corresponding statistic.

estnoise_filt() uses the dynamic noise level filtering algorithm of Xu and Freitas (2010) based on the local signal in an approach similar to Gallia et al. (2013). The peaks in the signal are sorted, and the smallest peak is assumed to be noise and is used to estimate the noise level. Each peak is then compared to the previous peak. A peak is labeled a signal peak only if it exceeds a minimum signal-to-noise ratio. Otherwise, the peak is labeled noise, and the noise level is re-estimated. This process continues until a signal peak is found, and the noise level is estimated from the noise peaks.

Value

A numeric vector the same length as x with the estimated local noise level.

Author(s)

Kylie A. Bemis

References

H. Xu and M. A. Freitas. “A Dynamic Noise Level Algorithm for Spectral Screening of Peptide MS/MS Spectra.” BMC Bioinformatics, vol. 11, no. 436, Aug. 2010.

J. Gallia, K. Lavrich, A. Tan-Wilson, and P. H. Madden. “Filtering of MS/MS data for peptide identification.” BMC Genomics, vol. 14, suppl. 7, Nov. 2013.

Examples

# simple signal
set.seed(1)
n <- 500
x <- rnorm(n)
x <- x + 90 * dnorm(seq_along(x), mean=n/4)
x <- x + 80 * dnorm(seq_along(x), mean=n/2)
x <- x + 70 * dnorm(seq_along(x), mean=3*n/4)

ns <- estnoise_quant(x)
plot(x, type="l")
lines(ns, col="blue")

# simulated spectrum
set.seed(1)
x <- simspec(size=5000)

ns <- estnoise_quant(x, nbins=20)
plot(x, type="l")
lines(ns, col="blue")

Estimate Signal Resolution

Description

Estimate the resolution (approximate sampling rate) of a signal based on its domain values.

Usage

estres(x, tol = NA, ref = NA_character_)

Arguments

x

A numeric vector giving the domain values of the signal.

tol

The tolerance allowed when determining if the estimated resolution is valid (i.e., actually matches the given domain values). Noise in the sampling rate will be allowed up to this amount. If NA (the default), then the resolution is simply calculated as the smallest difference between sorted domain values.

ref

If 'abs', then comparison is done by taking the absolute difference. If 'x', then relative differences are used. If missing, then the funciton will try to determine which gives a better fit to the domain values.

Value

A single number named "absolute" or "relative" giving the approximate constant sampling rate matching the given domain values. NA if a sampling rate could not be determined.

Author(s)

Kylie A. Bemis

Examples

x <- seq_rel(501, 600, by=1e-3)

estres(x)

FastMap Projection

Description

The FastMap algorithm performs approximate multidimensional scaling (MDS) based on any distance function. It is faster and more efficient than traditional MDS algorithms, scaling as O(n) rather than O(n^2). FastMap accomplishes this by finding two distant pivot objects on some hyperplane for each projected dimension, and then projecting all other objects onto the line between these pivots.

Usage

# FastMap projection
fastmap(x, k = 3L, group = NULL, distfun = NULL,
	transpose = FALSE, pivots = 10L, niter = 10L, verbose = NA, ...)

## S3 method for class 'fastmap'
predict(object, newdata, ...)

Arguments

x

A numeric matrix-like object.

k

The number of FastMap components to project.

group

Grouping variable if pivots should be guaranteed to come from different groups.

distfun

A distance function with the same usage (i.e., supports the same arguments and return values) as rowDists or colDists.

transpose

A logical value indicating whether x should be considered transposed or not. This only used internally to indicate whether the input matrix is (P x N) or (N x P), and therefore extract the number of objects and their names.

pivots

The number of pivot candidates to attempt each iteration. Using more pivot candidates can help improve the quality of the pivot selection. Using fewer can help speed up computation.

niter

The maximum number of iterations for selecting the pivots.

...

Additional options passed to distfun.

object

An object inheriting from fastmap.

newdata

An optional data matrix to use for the prediction.

verbose

Should progress be printed for each pivot iteration and FastMap component projection?

Details

The pivots are initialized randomly for each new dimension, so the selection of pivots (and therefore the resulting projection) can be sensitive to the random seed for some datasets.

A custom distance function can be passed via distfun. If not provided, then this defaults to rowDists if transpose=FALSE or colDists if transpose=TRUE.

If a custom function is passed, it must support the same arguments and return values as rowDists and colDists.

Value

An object of class fastmap, with the following components:

  • x: The projected variable matrix.

  • sdev: The standard deviations of each column of the projected matrix x.

  • pivots: A matrix giving the indices of the pivots and the distances between them.

  • pivot.array: A subset of the original data matrix containing only the pivots.

  • distfun: The function used to generate the distance function.

Author(s)

Kylie A. Bemis

References

C. Faloutsos, and D. Lin. “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets.” Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pp. 163 - 174, June 1995.

See Also

rowDists, colDists, cmdscale, prcomp

Examples

register(SerialParam())
set.seed(1)

a <- matrix(sort(runif(500)), nrow=50, ncol=10)
b <- matrix(rev(sort(runif(500))), nrow=50, ncol=10)
x <- cbind(a, b)

fm <- fastmap(x, k=2)

Move Data Between Shared Memory and File Storage

Description

These are generic functions for moving data in R objects between shared memory and file storage for sharing with other R processes.

Usage

fetch(object, ...)

flash(object, ...)

Arguments

object

An object with data to move.

...

Additional arguments to the matter constructor, chunk_colapply, chunk_rowapply, or chunk_lapply.

Details

The fetch methods for matter objects return new matter objects that use shared memory storage.

The flash methods for matter objects return new matter objects that use file storage.

Value

A new object (typically of the same class but not necessarily) with data in the specified storage format.

Author(s)

Kylie A. Bemis

See Also

matter

Examples

set.seed(1)
x <- as.matter(runif(10))
path(x)

# copy x into shared memory
y <- fetch(x)
path(y)

# copy y into file storage
z <- flash(y)
path(z)

Smoothing Filters in 1D

Description

Smooth a uniformly sampled 1D signal.

Usage

# Moving average filter
filt1_ma(x, width = 5L)

# Linear convolution filter
filt1_conv(x, weights)

# Gaussian filter
filt1_gauss(x, width = 5L, sd = (width %/% 2) / 2)

# Bilateral filter
filt1_bi(x, width = 5L, sddist = (width %/% 2) / 2,
    sdrange = mad(x, na.rm = TRUE))

# Bilateral filter with adaptive parameters
filt1_adapt(x, width = 5L, spar = 1)

# Nonlinear diffusion
filt1_diff(x, niter = 3L, kappa = 50,
    rate = 0.25, method = 1L)

# Guided filter
filt1_guide(x, width = 5L, guide = x,
    sdreg = mad(x, na.rm = TRUE))

# Peak-aware guided filter
filt1_pag(x, width = 5L, guide = NULL,
    sdreg = mad(x, na.rm = TRUE), ftol = 1/10)

# Savitzky-Golay filter
filt1_sg(x, width = 5L, order = min(3L, width - 2L),
    deriv = 0, delta = 1)

Arguments

x

A numeric vector.

width

The width of the smoothing window in number of samples. Must be positive. Must be odd.

weights

The weights of the linear convolution kernel. Length must be odd.

sd, sddist

The spatial parameter for kernel-based filters. This controls the strength of smoothing for samples farther from the center of the smoothing window.

sdrange

The range parameter for kernel-based filters. This controls the strength of the smoothing for samples with signal values very different from the center of the smoothing window.

spar

The strength of the smoothing when calculating the adaptive bilateral filtering parameters. The larger the number, the stronger the smoothing. Must be positive.

kappa

The constant for the conduction coefficient for nonlinear diffusion. Must be positive.

rate

The rate of diffusion. Must be between 0 and 0.25 for stability.

method

The diffusivity method, where 1 and 2 correspond to the two diffusivity functions proposed by Perona and Malik (1990). For 1, this is exp(-(|grad x|/K)^2), and 2 is 1/(1+(|grad x|/K)^2). An additional method 3 implements the peak-aware weighting 1/(1+(|x|/K)^2), which does not use the gradient, but is used to create the guidance signal for peak-aware guided filtering.

niter

The number of iterations for nonlinear diffusion. Must be positive.

guide

The guide signal for guided filtering. This is the signal used to determine the degree of filtering for different regions of the sample. By default, it is the same as the signal to be smoothed.

sdreg

The regularization parameter for guided filtering. This is analagous to the range parameter for kernel-based filters. Signal regions with variance much smaller than this value are smoothed, while signal regions with varaince much larger than this value are preserved.

ftol

Specifies how large the signal value must be before it is considered a peak, expressed as a fraction of the maximum value in the signal.

order

The polynomial order for the Savitzky-Golay filter coefficients.

deriv

The order of the derivative for the Savitzky-Golay filter coefficients.

delta

The sample spacing for the Savitzky-Golay filter. Only used if deriv > 0.

Details

filt1_ma() performs mean filtering in O(n) time. This is fast and especially useful for calculating other filters that can be constructed as a combination of mean filters.

filt1_gauss() performs Gaussian filtering.

filt1_bi() and filt1_adapt() perform edge-preserving bilateral filtering. The latter calculates the kernel parameters adapatively based on the local signal, using a strategy adapted from Joseph and Periyasamy (2018).

filt1_diff() performs the nonlinear diffusion filtering of Perona and Malik (1990). Rather than relying on a filter width, it progressively diffuses (smooths) the signal over multiple iterations. More iterations will result in a smoother image.

filt1_guide() performs edge-preserving guided filtering. Guided filtering uses a local linear model based on the structure of a so-called "guidance signal". By default, the guidance signal is often the same as the input signal. Guided filtering performs similarly to bilateral filtering, but is often faster (though with more memory use), as it is implemented as a combination of mean filters.

filt1_pag() performs peak-aware guided filtering using a regularization parameter that focuses on preserving peaks rather than edges, using a strategy adapted from Liu and He (2022). By default, the guidance signal is generated by smoothing the input signal with nonlinear diffusion.

filt1_sg() performs traditional Savitzky-Golay filtering, which uses a local least-squares polynomial approximation to perform the smoothing. It reduces noise while attempting to retain the peak shape and height.

Value

A numeric vector the same length as x with the smoothed result.

Author(s)

Kylie A. Bemis

References

J. Joseph and R. Perisamy. “An image driven bilateral filter with adaptive range and spatial parameters for denoising Magnetic Resonance Images.” Computers and Electrical Engineering, vol. 69, pp. 782-795, July 2018.

P. Perona and J. Malik. “Scale-space and edge detection using anisotropic diffusion.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, issue 7, pp. 629-639, July 1990.

K. He, J. Sun, and X. Tang. “Guided Image Filtering.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1397-1409, June 2013.

D. Liu and C. He. “Peak-aware guided filtering for spectrum signal denoising.” Chemometrics and Intelligent Laboratory Systems, vol. 222, March 2022.

A. Savitzky and M. J. E. Golay. “Smoothing and Differentiation of Data by Simplified Least Squares Procedures.” Analytical Chemistry, vol. 36, no. 8, pp. 1627-1639, July 1964.

Examples

set.seed(1)
t <- seq(from=0, to=6 * pi, length.out=5000)
y <- sin(t) + 0.6 * sin(2.6 * t)
x <- y + runif(length(y))
xs <- filt1_gauss(x, width=25)

plot(x, type="l")
lines(xs, col="red")

Smoothing Filters in 2D

Description

Smooth a uniformly sampled 2D signal.

Usage

# Moving average filter
filt2_ma(x, width = 5L)

# Linear convolution filter
filt2_conv(x, weights)

# Gaussian filter
filt2_gauss(x, width = 5L, sd = (width %/% 2) / 2)

# Bilateral filter
filt2_bi(x, width = 5L, sddist = (width %/% 2) / 2,
    sdrange = mad(x, na.rm = TRUE))

# Bilateral filter with adaptive parameters
filt2_adapt(x, width = 5L, spar = 1)

# Nonlinear diffusion
filt2_diff(x, niter = 3L, kappa = 50,
    rate = 0.25, method = 1L)

# Guided filter
filt2_guide(x, width = 5L, guide = x,
    sdreg = mad(x, na.rm = TRUE))

Arguments

x

A numeric matrix.

width

The width of the smoothing window in number of samples. Must be positive. Must be odd.

weights

A matrix of weights for the linear convolution kernel. Dimensions must be odd.

sd, sddist

The spatial parameter for kernel-based filters. This controls the strength of smoothing for samples farther from the center of the smoothing window.

sdrange

The range parameter for kernel-based filters. This controls the strength of the smoothing for samples with signal values very different from the center of the smoothing window.

spar

The strength of the smoothing when calculating the adaptive bilateral filtering parameters. The larger the number, the stronger the smoothing. Must be positive.

kappa

The constant for the conduction coefficient for nonlinear diffusion. Must be positive.

rate

The rate of diffusion. Must be between 0 and 0.25 for stability.

method

The diffusivity method, where 1 and 2 correspond to the two diffusivity functions proposed by Perona and Malik (1990). For 1, this is exp(-(|grad x|/K)^2), and 2 is 1/(1+(|grad x|/K)^2).

niter

The number of iterations for nonlinear diffusion. Must be positive.

guide

The guide signal for guided filtering. This is the signal used to determine the degree of filtering for different regions of the sample. By default, it is the same as the signal to be smoothed.

sdreg

The regularization parameter for guided filtering. This is analagous to the range parameter for kernel-based filters. Signal regions with variance much smaller than this value are smoothed, while signal regions with varaince much larger than this value are preserved.

Details

filt2_ma() performs mean filtering in O(n) time. This is fast and especially useful for calculating other filters that can be constructed as a combination of mean filters.

filt2_gauss() performs Gaussian filtering.

filt2_bi() and filt2_adapt() perform edge-preserving bilateral filtering. The latter calculates the kernel parameters adapatively based on the local signal, using a strategy adapted from Joseph and Periyasamy (2018).

filt2_diff() performs the nonlinear diffusion filtering of Perona and Malik (1990). Rather than relying on a filter width, it progressively diffuses (smooths) the signal over multiple iterations. More iterations will result in a smoother image.

filt2_guide() performs edge-preserving guided filtering. Guided filtering uses a local linear model based on the structure of a so-called "guidance signal". By default, the guidance signal is often the same as the input signal. Guided filtering performs similarly to bilateral filtering, but is often faster (though with more memory use), as it is implemented as a combination of mean filters.

Value

A numeric matrix the same dimensions as x with the smoothed result.

Author(s)

Kylie A. Bemis

References

J. Joseph and R. Perisamy. “An image driven bilateral filter with adaptive range and spatial parameters for denoising Magnetic Resonance Images.” Computers and Electrical Engineering, vol. 69, pp. 782-795, July 2018.

P. Perona and J. Malik. “Scale-space and edge detection using anisotropic diffusion.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, issue 7, pp. 629-639, July 1990.

K. He, J. Sun, and X. Tang. “Guided Image Filtering.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1397-1409, June 2013.

Examples

set.seed(1)
i <- seq(-4, 4, length.out=12)
j <- seq(1, 3, length.out=9)
co <- expand.grid(i=i, j=j)
x <- matrix(atan(co$i / co$j), nrow=12, ncol=9)
x <- 10 * (x - min(x)) / diff(range(x))
x <- x + 2.5 * runif(length(x))
xs <- filt2_gauss(x)

par(mfcol=c(1,2))
image(x, col=hcl.colors(256), main="original")
image(xs, col=hcl.colors(256), main="smoothed")

Smoothing Filters using K-nearest neighbors

Description

Smooth a nonuniformly sampled signal in K-nearest neighbors.

Usage

# Moving average filter
filtn_ma(x, index, k = 5L, metric = "euclidean", p = 2)

# Linear convolution filter
filtn_conv(x, index, weights, metric = "euclidean", p = 2)

# Gaussian filter
filtn_gauss(x, index, k = 5L, sd = median(r) / 2,
    metric = "euclidean", p = 2)

# Bilateral filter
filtn_bi(x, index, k = 5L, sddist = median(r) / 2,
    sdrange = mad(x, na.rm=TRUE), metric = "euclidean", p = 2)

# Bilateral filter with adaptive parameters
filtn_adapt(x, index, k = 5L, spar = 1,
    metric = "euclidean", p = 2)

Arguments

x

A numeric vector.

index

A matrix or list of domain vectors giving the sampling locations of x.

k

The number of K-nearest neighbors to use in the smoothing (including the sample being smoothed). Must be positive.

weights

A vector of weights for the linear convolution kernel, in order from nearest to farthest neighbors. The first weight applies to the sample being smoothed.

sd, sddist

The spatial parameter for kernel-based filters. This controls the strength of smoothing for samples farther from the center of the smoothing window. The default is calculated as half the median smoothing radius (i.e., the distance to the Kth nearest neighbor).

sdrange

The range parameter for kernel-based filters. This controls the strength of the smoothing for samples with signal values very different from the center of the smoothing window.

spar

The strength of the smoothing when calculating the adaptive bilateral filtering parameters. The larger the number, the stronger the smoothing. Must be positive.

metric

Distance metric to use when finding the nearest neighbors. Supported metrics include "euclidean", "maximum", "manhattan", and "minkowski".

p

The power for the Minkowski distance.

Details

These functions must first perform a K-nearest neighbors search on the sample locations (index) before smoothing the signal (x), but they can be applied to nonuniformly sampled signals in an arbitrary number of dimensions. The nearest neighbor search is performed using a kd-tree. It is efficient for low dimensional signals, but performance will degrade in high dimensions. In general, the complexity of these filters is O(k * n log(n)).

Value

A numeric vector the same dimensions as x with the smoothed result.

Author(s)

Kylie A. Bemis

References

J. Joseph and R. Perisamy. “An image driven bilateral filter with adaptive range and spatial parameters for denoising Magnetic Resonance Images.” Computers and Electrical Engineering, vol. 69, pp. 782-795, July 2018.

Examples

set.seed(1)

# signal intensities
x <- rlnorm(100)

# 3D sampling locations
index <- replicate(3, runif(100))

filtn_ma(x, index, k=5)

Peak Detection

Description

Find peaks in a signal based on its local maxima, as determined by a sliding window.

Usage

# Find peaks
findpeaks(x, width = 5L, prominence = NULL,
    snr = NULL, noise = "quant", bounds = TRUE,
    relheight = 0.005, ...)

# Local maxima
locmax(x, width = 5L)

# Local minima
locmin(x, width = 5L)

Arguments

x

A numeric vector.

width

The number of signal elements to consider when determining if the center of the sliding window is a local extremum.

prominence

The minimum peak prominence used for filtering the peaks. The prominence of a peak is the height that the peak rises above the higher of its bases (i.e., its lowest contour line). A peak's bases are found as the local minima between the peak and the next higher peaks on either side.

snr

The minimum signal-to-noise ratio used for filtering the peaks.

noise

The method used to estimate the noise. See Details.

bounds

Whether the boundaries of each peak should be calculated and returned. A peak's boundaries are found as the nearest local minima on either side.

relheight

The minimum relative height (proportion of the maximum peak value) used for filtering the peaks.

...

Arguments passed to the noise estimation function.

Details

For locmax() and locmin(), a local extremum is defined as an element greater (or less) than all of the elements within width / 2 elements to the left of it, and greater (or less) than or equal to all of the elements within width / 2 elements to the right of it. That is, ties are resolved such that only the leftmost sample is considered the local extremum.

For findpeaks(), the peaks are simply the local maxima of the signal. The peak boundaries are found by descending a local maximum until a local minimum is found on either side, using the same criteria as above. The peaks are optionally filtered based on their prominences.

Optionally, the signal-to-noise ratio (SNR) can be estimated and used for filtering the peaks. These use the functions estnoise_quant, estnoise_diff, estnoise_filt, etc., to estimate the noise in the signal.

Value

For locmax() and locmin(), an logical vector indicating whether each element is a local extremum.

For findpeaks(), an integer vector giving the indices of the peaks, with attributes 'left_bounds' and 'right_bounds' giving the left and right boundaries of the peak as determined using the rule above.

Author(s)

Kylie A. Bemis

See Also

findpeaks_cwt, findpeaks_knn, estnoise_quant, estnoise_sd, estnoise_mad, estnoise_diff, estnoise_filt, peakwidths, peakareas, peakheights, binpeaks, mergepeaks

Examples

# simple signal
x <- c(0, 1, 1, 2, 3, 2, 1, 4, 5, 1, 1, 0)
locmax(x)
findpeaks(x)

# simulated spectrum
set.seed(1)
x <- simspec(size=5000)

# find peaks with snr >= 3
p <- findpeaks(x, snr=3, noise="quant")
plot(x, type="l")
points(p, x[p], col="red")

# find peaks with derivative-based noise
p <- findpeaks(x, snr=3, noise="diff")
plot(x, type="l")
points(p, x[p], col="red")

CWT-based Peak Detection

Description

Find peaks in a signal using continuous wavelet transform (CWT).

Usage

# Find peaks with CWT
findpeaks_cwt(x, snr = 2, wavelet = ricker, scales = NULL,
    maxdists = scales, ngaps = 3L, ridgelen = length(scales) %/% 4L,
    qnoise = 0.95, width = length(x) %/% 20L, bounds = TRUE)

# Find ridges lines in a matrix
findridges(x, maxdists, ngaps)

# Continuous Wavelet Transform
cwt(x, wavelet = ricker, scales = NULL)

Arguments

x

A numeric vector for findpeaks_cwt() and cwt(). A matrix of CWT coefficients for findridges().

snr

The minimum signal-to-noise ratio used for filtering the peaks.

wavelet

The wavelet to be convolved with the signal. Must be a function that takes two arguments: the number of points in the wavelet n as the first argument and the scale a of the wavelet as the second argument. The default ricker() function satisfies this.

scales

The scales at which to perform CWT. A reasonable sequence is generated automatically if not provided.

maxdists

The maximum allowed shift distance between local maxima allowed when connecting maxima into ridge lines. Should be a vector the same length as scales.

ngaps

The number of gaps allowed in a ridge line before it is removed from the search space.

ridgelen

The minimum ridge length allowed when filtering peaks.

qnoise

The quantile of the CWT coefficients at the smallest scale used to estimate the noise.

width

The width of the rolling estimation of noise quantile.

bounds

Whether the boundaries of each peak should be calculated and returned. A peak's boundaries are found as the nearest local minima on either side.

Details

findpeaks_cwt() uses the peak detection method based on continuous wavelet transform (CWT) proposed by Du, Kibbe, and Lin (2006).

The raw signal is convolved with a wavelet (by default, a Ricker wavelet is used) at a range of different scales. This produces a matrix of CWT coefficients with a number of rows equal to the length of the original signal and each column representing a different scale of convolution.

The convolution at the smallest scales represent a good estimate of noise and peak location. The larger scales represent a smoother signal where larger peaks are prominent and smaller peaks are removed.

The method proceeds by identifying ridge lines in the CWT coefficient matrix using findridges(). Local maxima are identified at each scale and connected across each scale, forming the ridge lines.

Finally, the local noise is estimated from the CWT coefficients at the smallest scale. The peaks are filtered based on signal-to-noise ratio and the length of their ridge lines.

Value

For findpeaks_cwt(), an integer vector giving the indices of the peaks, with attributes 'left_bounds' and 'right_bounds' giving the left and right boundaries of the peak as determined using the rule above.

For findridges(), a list of matrices giving the row and column indices of the entries of each detected ridge line.

Author(s)

Kylie A. Bemis

See Also

findpeaks, peakwidths, peakareas, peakheights, binpeaks, mergepeaks

Examples

# simple signal
x <- c(0, 1, 1, 2, 3, 2, 1, 4, 5, 1, 1, 0)
locmax(x)
findpeaks(x)

# simulated spectrum
set.seed(1)
x <- simspec(size=5000)

# find peaks with snr >= 3
p <- findpeaks_cwt(x, snr=3)
plot(x, type="l")
points(p, x[p], col="red")

# plot ridges
ridges <- attr(p, "ridges")
plot(c(0, length(x)), c(0, 25), type="n")
for ( ri in ridges )
    lines(ri, type="o", pch=20, cex=0.5)

Peak Detection using K-nearest neighbors

Description

Find peaks in a signal based on its local maxima, as determined by K-nearest neighbors.

Usage

# Find peaks with KNN
findpeaks_knn(x, index, k = 5L,
    snr = NULL, noise = "quant", relheight = 0.005,
    arr.ind = is.array(x), useNames = TRUE, ...)

# Local maxima
knnmax(x, index, k = 5L)

# Local minima
knnmin(x, index, k = 5L)

Arguments

x

A numeric vector or array.

index

A matrix or list of domain vectors giving the sampling locations of x if it is a multidimensional signal. May be omitted if x is an array.

k

The number of nearest neighbors to consider when determining whether a sample is a local extremum.

snr

The minimum signal-to-noise ratio used for filtering the peaks.

noise

The method used to estimate the noise. See Details.

relheight

The minimum relative height (proportion of the maximum peak value) used for filtering the peaks.

arr.ind

Should array indices be returned? Otherwise, return linear indices.

useNames

Passed to arrayInd.

...

Arguments passed to the noise estimation function.

Details

For knnmax() and knnmin(), a local extremum is defined as an element greater (or less) than all of its nearest neighbors with a lesser index, and greater (or less) than or equal to all of its nearest neighbors with a greater or equal index. That is, ties are resolved such that the sample with the lowest index is considered the local extremum.

For findpeaks_knn(), the peaks are simply the local maxima of the signal. The peaks are optionally filtered based on their relative heights.

Optionally, the signal-to-noise ratio (SNR) can be estimated and used for filtering the peaks. These use the functions estnoise_quant, estnoise_diff, estnoise_filt, etc., to estimate the noise in the signal.

Value

For knnmax() and knnmin(), an logical vector or array indicating whether each element is a local extremum.

For findpeaks_knn(), an integer vector (or matrix if arr.ind=TRUE) giving the indices of the peaks.

Author(s)

Kylie A. Bemis

See Also

findpeaks, estnoise_quant, estnoise_sd, estnoise_mad, estnoise_diff, estnoise_filt

Examples

# simple 2D signal
x <- c(
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
    0, 0, 0, 2, 0, 0, 1, 4, 2, 0, 1, 1, 0, 0, 0,
    0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 3, 2, 1, 0, 0,
    0, 0, 0, 0, 1, 3, 3, 0, 0, 1, 4, 4, 3, 1, 0,
    0, 0, 0, 0, 0, 3, 2, 0, 1, 0, 3, 2, 3, 0, 0,
    0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 3, 0, 0)
x <- matrix(x, nrow=7, ncol=15, byrow=TRUE)

# find peaks in 2D using KNN
findpeaks_knn(x)

Point in polygon

Description

Check if a series of x-y points are contained in a closed 2D polygon.

Usage

inpoly(points, poly)

Arguments

points

A 2-column numeric matrix with the points to check.

poly

A 2-column numeric matrix with the vertices of the polygon.

Details

This function works by extending a horizontal ray from each point and counting the number of times it crosses an edge of the polygon.

Value

A logical vector that is TRUE for points that are fully inside the polygon, a vertex, or on an edge, and FALSE for points fully outside the polygon.

Note

There are various public implementations of this function with no clear original source. The version implemented here is loosely based on code by W. Randolph Franklin with modifications so that vertices and points on edges are considered inside the polygon.

Author(s)

W. R. Franklin and Kylie A. Bemis

References

W. R. Franklin. “PNPOLY - Point Inclusion in Polygon Test.” https://wrfranklin.org/Research/Short_Notes/pnpoly.html, 1970.

M. Shimrat, "Algorithm 112, Position of Point Relative to Polygon", Comm. ACM 5(8), pp. 434, Aug 1962.

E. Haines. “Point in Polygon Strategies.” http://www.acm.org/pubs/tog/editors/erich/ptinpoly/, 1994.

See Also

kdsearch

Examples

poly <- data.frame(
        x=c(3,5,5,3),
        y=c(3,3,5,5))
xy <- data.frame(
    x=c(4,6,4,2,3,5,5,3,4,5,4,4,
        3.5,4.5,4.0,3.5),
    y=c(2,4,6,4,3,3,5,5,3,4,5,3,
        4.0,4.0,4.5,4.0),
    ref=c(
        rep("out", 4),
        rep("vertex", 4),
        rep("edge", 4),
        rep("in", 4)))

xy$test <- inpoly(xy[,1:2], poly)
xy

Quote Identifiers

Description

Quote strings representing variable identifiers with backticks for use in formulas.

Usage

iQuote(x, q = "`")

Arguments

x

An object coercible to a character vector.

q

The kind of quotes to be used.

Value

A character vector of the same length as x.

Author(s)

Kylie A. Bemis

See Also

sQuote, dQuote

Examples

x1 <- "This is a non-syntactic variable name"
x2 <- "This is another variable name"
fm <- paste0(iQuote(x1), "~", iQuote(x2))
as.formula(fm)

Isolate Closures for Serialization

Description

These functions modify the environment of a function to isolate a closure from its original environment.

Usage

isofun(fun, envir = baseenv())

isoclos(fun, envir = baseenv())

Arguments

fun

A function to isolate.

envir

The new parent environment for the closure.

Details

A common challenge with parallel programming in R is unintentionally serializing large environments that have been captured by closures. The functions isofun and isoclos provide straightforward ways to isolate functions and closures from their original parent environments which may contain large objects.

isofun simply replaces the environment of fun with envir, which defaults to the base environment. This is appropriate for simple functions that do not need to enclose any variables, instead taking all of their inputs as formal arguments.

isoclos creates a new closure that is isolated from fun's original environment. All objects in environment(fun) are first copied into a new environment with parent envir, and then this new environment is assigned to fun.

Note that the default envir=baseenv() means that any functions not in the base environment must be fully qualified (e.g., stats::sd versus sd).

Value

A new function or closure, isolated from environment(fun).

Author(s)

Kylie A. Bemis

See Also

environment

Examples

register(SerialParam())

bigfun <- function(x)
{
    # isolate 'smallfun' from 'x'
    smallfun <- isofun(function(xi) {
        (xi - mean(xi)) / stats::sd(xi)
    })
    chunkApply(x, 2L, smallfun)
}

set.seed(1)
x <- matrix(runif(100), nrow=10, ncol=10)
bigfun(x)

K-Dimensional Nearest Neighbor Search

Description

Search a matrix of K-dimensional data points and return the indices of the nearest neighbors or of all data points that are within a specified tolerance in each dimension.

Usage

# Nearest neighbor search
knnsearch(x, data, k = 1L, metric = "euclidean", p = 2)

# Range search
kdsearch(x, data, tol = 0, tol.ref = "abs")

# K-D tree
kdtree(data)

Arguments

x

A numeric matrix of coordinates to be matched. Each column should represent a dimension. Each row should be a query point.

data

Either a kdtree object returned by kdtree(), or a numeric matrix of coordinates to search, where each column is a different dimension. If this is missing, then the query x will be used as the data.

k

The number of nearest neighbors to find for each point (row) in x.

metric

Distance metric to use when finding the nearest neighbors. Supported metrics include "euclidean", "maximum", "manhattan", and "minkowski".

p

The power for the Minkowski distance.

tol

The tolerance for finding neighboring points in each dimension. May be a vector with the same length as the number of dimensions. Must be positive.

tol.ref

One of 'abs', 'x', or 'y'. If 'abs', then comparison is done by taking the absolute difference. If either 'x' or 'y', then relative differences are used, and this specifies which to use as the reference (target) value.

Details

knnsearch() performs k-nearest neighbor searches. kdsearch() performs range searches for points within a given tolerance of the query points.

The algorithms are implemented in C and work by building a kd-tree to perform the search. If multiple calls to kdsearch() or knnsearch() are expected on the same data, it can be much faster to build the tree once with kdtree().

A kd-tree is essentially a multidimensional generalization of a binary search tree. Building the search tree is O(n * log n) and searching for a single data point is O(log n).

For knnsearch(), ties are broken based on the original ordering of the rows in data.

Value

For knnsearch(), a matrix with rows equal to the number of rows of x and columns equal to k giving the indices of the k-nearest neighbors.

For kdsearch(), a list with length equal to the number of rows of x, where each list element is a vector of indexes of the matches in data.

Author(s)

Kylie A. Bemis

See Also

bsearch, approx2,

Examples

d <- expand.grid(x=1:10, y=1:10)
x <- rbind(c(1.11, 2.22), c(3.33, 4.44))

knnsearch(x, d, k=3)

Out-of-Core Arrays

Description

The matter_arr class implements out-of-core arrays.

Usage

## Instance creation
matter_arr(data, type = "double", path = NULL,
    dim = NA_integer_, dimnames = NULL, offset = 0, extent = NA_real_,
    readonly = NA, append = FALSE, rowMaj = FALSE, ...)

matter_mat(data, type = "double", path = NULL,
    nrow = NA_integer_, ncol = NA_integer_, dimnames = NULL,
    offset = 0, extent = NA_real_, readonly = NA,
    append = FALSE, rowMaj = FALSE, ...)

matter_vec(data, type = "double", path = NULL,
    length = NA_integer_, names = NULL, offset = 0, extent = NA_real_,
    readonly = NA, append = FALSE, rowMaj = FALSE, ...)

## Additional methods documented below

Arguments

data

An optional data vector which will be initially written to virtual memory if provided.

type

A 'character' vector giving the storage mode of the data in virtual memory. See ?"matter-types" for possible values.

path

A 'character' vector of the path(s) to the file(s) where the data are stored. If NULL, then a temporary file is created using tempfile, which will be managed according the getOption("matter.temp.gc").

dim, nrow, ncol, length

The dimensions of the array, or the number of rows and columns, or the length.

dimnames, names

The names of the matrix dimensions or vector elements.

offset

A vector giving the offsets in number of bytes from the beginning of each file in 'path', specifying the start of the data to be accessed for each file.

extent

A vector giving the length of the data for each file in 'path', specifying the number of elements of size 'type' to be accessed from each file.

readonly

Whether the data and file(s) should be treated as read-only or read/write.

append

If TRUE, then all offsets will be adjusted to be from the end-of-file (for all files in path), and readonly will be set to FALSE.

rowMaj

Whether the data is stored in row-major or column-major order. The default is to use column-major order, which is the same as native R matrices.

...

Additional arguments to be passed to constructor.

Value

An object of class matter_arr.

Slots

data:

This slot stores any information necessary to access the data for the object (which may include the data itself and/or paths to file locations, etc.).

type:

The storage mode of the accessed data when read into R. This is a 'factor' with levels 'raw', 'logical', 'integer', 'numeric', or 'character'.

dim:

Either 'NULL' for vectors, or an integer vector of length one of more giving the maximal indices in each dimension for matrices and arrays.

names:

The names of the data elements for vectors.

dimnames:

Either 'NULL' or the names for the dimensions. If not 'NULL', then this should be a list of character vectors of the length given by 'dim' for each dimension. This is always 'NULL' for vectors.

ops:

Deferred arithmetic operations.

transpose:

Indicates whether the data is stored in row-major order (TRUE) or column-major order (FALSE). For a matrix, switching the order that the data is read is equivalent to transposing the matrix (without changing any data).

indexed:

For matter_mat only. Indicates whether the pointers to rows or columns are indexed for quick access or not.

Extends

matter

Creating Objects

matter_arr instances can be created through matter_arr() or matter(). Matrices and vectors can also be created through matter_mat() and matter_vec().

Methods

Class-specific methods:

path(x), path(x) <- value:

Get or set the data source names, i.e., file path(s).

fetch(x, ...):

Pull data into shared memory.

flash(x, ...):

Push data to a temporary file.

Standard generic methods:

length(x), length(x) <- value:

Get or set length.

dim(x), dim(x) <- value:

Get or set 'dim'.

names(x), names(x) <- value:

Get or set 'names'.

dimnames(x), dimnames(x) <- value:

Get or set 'dimnames'.

x[...], x[...] <- value:

Get or set the elements of the array.

as.vector(x):

Coerce to a vector.

as.array(x):

Coerce to an array.

cbind(x, ...), rbind(x, ...):

Combine matrices by row or column.

t(x):

Transpose a matrix. This is a quick operation which only changes metadata and does not touch the data representation.

rowMaj(x):

Check the data orientation.

Author(s)

Kylie A. Bemis

See Also

matter

Examples

x <- matter_arr(1:1000, dim=c(10,10,10))
x

Out-of-Core Factors

Description

The matter_fct class implements out-of-core factors.

Usage

## Instance creation
matter_fct(data, levels, type = typeof(levels), path = NULL,
    length = NA_integer_, names = NULL, offset = 0, extent = NA_real_,
    readonly = NA, append = FALSE, labels = as.character(levels), ...)

## Additional methods documented below

Arguments

data

An optional data vector which will be initially written to the data in virtual memory if provided.

levels

The levels of the factor. These should be of the same type as the data. (Use labels for the string representation of the levels.)

type

A 'character' vector giving the storage mode of the data in virtual memory. See ?"matter-types" for possible values.

path

A 'character' vector of the path(s) to the file(s) where the data are stored. If NULL, then a temporary file is created using tempfile, which will be managed according the getOption("matter.temp.gc").

length

The length of the factor.

names

The names of the data elements.

offset

A vector giving the offsets in number of bytes from the beginning of each file in 'path', specifying the start of the data to be accessed for each file.

extent

A vector giving the length of the data for each file in 'path', specifying the number of elements of size 'type' to be accessed from each file.

readonly

Whether the data and file(s) should be treated as read-only or read/write.

append

If TRUE, then all offsets will be adjusted to be from the end-of-file (for all files in path), and readonly will be set to FALSE.

labels

An optional character vector of labels for the factor levels.

...

Additional arguments to be passed to constructor.

Value

An object of class matter_fct.

Slots

data:

This slot stores any information necessary to access the data for the object (which may include the data itself and/or paths to file locations, etc.).

type:

The storage mode of the accessed data when read into R. This is a 'factor' with levels 'raw', 'logical', 'integer', 'numeric', or 'character'.

dim:

Either 'NULL' for vectors, or an integer vector of length one of more giving the maximal indices in each dimension for matrices and arrays.

names:

The names of the data elements for vectors.

dimnames:

Either 'NULL' or the names for the dimensions. If not 'NULL', then this should be a list of character vectors of the length given by 'dim' for each dimension. This is always 'NULL' for vectors.

levels:

The levels of the factor.

labels:

The labels for the levels.

Extends

matter, matter_vec

Creating Objects

matter_fct instances can be created through matter_fct() or matter().

Methods

Standard generic methods:

length(x), length(x) <- value:

Get or set length.

names(x), names(x) <- value:

Get or set 'names'.

x[i], x[i] <- value:

Get or set the elements of the factor.

levels(x), levels(x) <- value:

Get or set the levels of the factor.

Author(s)

Kylie A. Bemis

See Also

matter, matter_vec

Examples

x <- matter_fct(rep(c("a", "a", "b"), 5), levels=c("a", "b", "c"))
x

Out-of-Core Lists of Vectors

Description

The matter_list class implements out-of-core lists.

Usage

## Instance creation
matter_list(data, type = "double", path = NULL,
    lengths = NA_integer_, names = NULL, offset = 0, extent = NA_real_,
    readonly = NA, append = FALSE, ...)

## Additional methods documented below

Arguments

data

An optional data vector which will be initially written to virtual memory if provided.

type

A 'character' vector giving the storage mode of the data in virtual memory. See ?"matter-types" for possible values.

path

A 'character' vector of the path(s) to the file(s) where the data are stored. If NULL, then a temporary file is created using tempfile, which will be managed according the getOption("matter.temp.gc").

lengths

The lengths of the list elements.

names

The names of the list elements.

offset

A vector giving the offsets in number of bytes from the beginning of each file in 'path', specifying the start of the data to be accessed for each file.

extent

A vector giving the length of the data for each file in 'path', specifying the number of elements of size 'type' to be accessed from each file.

readonly

Whether the data and file(s) should be treated as read-only or read/write.

append

If TRUE, then all offsets will be adjusted to be from the end-of-file (for all files in path), and readonly will be set to FALSE.

...

Additional arguments to be passed to constructor.

Value

An object of class matter_list.

Slots

data:

This slot stores any information necessary to access the data for the object (which may include the data itself and/or paths to file locations, etc.).

type:

The storage mode of the accessed data when read into R. This is a 'factor' with levels 'raw', 'logical', 'integer', 'numeric', or 'character'.

dim:

Either 'NULL' for vectors, or an integer vector of length one of more giving the maximal indices in each dimension for matrices and arrays.

names:

The names of the data elements for vectors.

dimnames:

Either 'NULL' or the names for the dimensions. If not 'NULL', then this should be a list of character vectors of the length given by 'dim' for each dimension. This is always 'NULL' for vectors.

Extends

matter

Creating Objects

matter_list instances can be created through matter_list() or matter().

Methods

Class-specific methods:

path(x), path(x) <- value:

Get or set the data source names, i.e., file path(s).

fetch(x, ...):

Pull data into shared memory.

flash(x, ...):

Push data to a temporary file.

Standard generic methods:

x[[i]], x[[i]] <- value:

Get or set a single element of the list.

x[[i, j]]:

Get the jth sub-elements of the ith element of the list.

x[i], x[i] <- value:

Get or set the ith elements of the list.

lengths(x):

Get the lengths of all elements in the list.

Author(s)

Kylie A. Bemis

See Also

matter

Examples

x <- matter_list(list(c(TRUE,FALSE), 1:5, c(1.11, 2.22, 3.33)), lengths=c(2,5,3))
x[]
x[1]
x[[1]]

x[[3,1]]
x[[2,1:3]]

Out-of-Core Strings

Description

The matter_str class implements out-of-core strings.

Usage

## Instance creation
matter_str(data, encoding, type = "character", path = NULL,
    nchar = NA_integer_, names = NULL, offset = 0, extent = NA_real_,
    readonly = NA, append = FALSE, ...)

## Additional methods documented below

Arguments

data

An optional data vector which will be initially written to virtual memory if provided.

encoding

The character encoding to use (if known).

type

A 'character' vector giving the storage mode of the data in virtual memory. See ?"matter-types" for possible values.

path

A 'character' vector of the path(s) to the file(s) where the data are stored. If NULL, then a temporary file is created using tempfile, which will be managed according the getOption("matter.temp.gc").

nchar

A vector giving the length of each element of the character vector.

names

The names of the data elements.

offset

A vector giving the offsets in number of bytes from the beginning of each file in 'path', specifying the start of the data to be accessed for each file.

extent

A vector giving the length of the data for each file in 'path', specifying the number of elements of size 'type' to be accessed from each file.

readonly

Whether the data and file(s) should be treated as read-only or read/write.

append

If TRUE, then all offsets will be adjusted to be from the end-of-file (for all files in path), and readonly will be set to FALSE.

...

Additional arguments to be passed to constructor.

Value

An object of class matter_str.

Slots

data:

This slot stores any information necessary to access the data for the object (which may include the data itself and/or paths to file locations, etc.).

type:

The storage mode of the accessed data when read into R. This is a 'factor' with levels 'raw', 'logical', 'integer', 'numeric', or 'character'.

dim:

Either 'NULL' for vectors, or an integer vector of length one of more giving the maximal indices in each dimension for matrices and arrays.

names:

The names of the data elements for vectors.

dimnames:

Either 'NULL' or the names for the dimensions. If not 'NULL', then this should be a list of character vectors of the length given by 'dim' for each dimension. This is always 'NULL' for vectors.

encoding:

The string encoding used.

Extends

matter, matter_list

Creating Objects

matter_str instances can be created through matter_str() or matter().

Methods

Standard generic methods:

x[i], x[i] <- value:

Get or set the string elements of the vector.

lengths(x):

Get the number of characters (in bytes) of all string elements in the vector.

Author(s)

Kylie A. Bemis

See Also

matter

Examples

x <- matter_str(rep(c("hello", "world!"), 50))
x

Vectors, Matrices, and Arrays Stored in Virtual Memory

Description

The matter class and its subclasses are designed for easy on-demand read/write access to virtual memory data structures stored in files and/or shared memory and allow working with them as vectors, matrices, arrays, and lists.

Usage

## Instance creation
matter(data, ...)

# Check if an object is a matter object
is.matter(x)

# Coerce an object to a matter object
as.matter(x)

# Check if an object uses shared memory
is.shared(x)

# Coerce an object to use shared memory
as.shared(x)

## Additional methods documented below

Arguments

data

Data passed to the subclasse constructor.

...

Arguments passed to subclasses.

x

An object to check if it is a matter object or coerce to a matter object.

Value

An object of class matter.

Slots

data:

This slot stores any information necessary to access the data for the object (which may include the data itself and/or paths to file locations, etc.).

type:

The storage mode of the accessed data when read into R. This is a 'factor' with levels 'raw', 'logical', 'integer', 'numeric', or 'character'.

dim:

Either 'NULL' for vectors, or an integer vector of length one of more giving the maximal indices in each dimension for matrices and arrays.

names:

The names of the data elements for vectors.

dimnames:

Either 'NULL' or the names for the dimensions. If not 'NULL', then this should be a list of character vectors of the length given by 'dim' for each dimension. This is always 'NULL' for vectors.

Creating Objects

matter is a virtual class and cannot be instantiated directly, but instances of its subclasses can be created through matter().

Methods

Class-specific methods:

atomdata(x):

Access the 'data' slot.

adata(x):

An alias for atomdata(x).

type(x), type(x) <- value:

Get or set data 'type'.

Standard generic methods:

length(x), length(x) <- value:

Get or set length.

dim(x), dim(x) <- value:

Get or set 'dim'.

names(x), names(x) <- value:

Get or set 'names'.

dimnames(x), dimnames(x) <- value:

Get or set 'dimnames'.

Author(s)

Kylie A. Bemis

See Also

matter_arr, matter_mat, matter_vec, matter_fct, matter_list, matter_str

Examples

## Create a matter_vec vector
x <- matter(1:100, length=100)
x

## Create a matter_mat matrix
y <- matter(1:100, nrow=10, ncol=10)
y

Options for “matter” Objects

Description

Set global parameters for matter.

Usage

## Set defaults for common arguments
matter_defaults(nchunks = 20L, chunksize = NA_real_,
    serialize = NA, verbose = FALSE)

Arguments

nchunks

The number of chunks to use for chunk processing (e.g., in chunkApply. This sets getOption("matter.default.nchunks"). For IO-bound operations, using fewer chunks will often be faster, but use more memory.

chunksize

The approximate chunk size in bytes for chunk processing (e.g., in chunkApply. This sets getOption("matter.default.chunksize"). For IO-bound operations, using larger chunks will often be faster, but use more memory. If set to NA_real_, then the chunk size is determined by the number of chunks.

serialize

Whether data in virtual memory should be realized on the manager and serialized to the workers (TRUE), passed to the workers in virtual memory as-is (FALSE), or if matter should decide the behavior based on the cluster configuration (NA). This sets getOption("matter.default.serialize"). If all workers have access to the same virtual memory resources (whether file storage or shared memory), then it can be significantly faster to avoid serializing the data.

verbose

Whether progress messages should be printed. This sets getOption("matter.default.verbose").

Details

The matter package provides the following options:

  • options(matter.compress.atoms=3): The compression ratio threshold to be used to determine when to compress atoms in a matter object. Setting to 0 or FALSE means that atoms are never compressed.

  • options(matter.default.nchunks=20L): The default number of chunks to use when iterating over matter objects. For IO-bound operations, using fewer chunks will often be faster, but use more memory.

  • options(matter.default.chunksize=NA_real_): The default chunk size in bytes to use when iterating over matter objects. For IO-bound operations, using larger chunks will often be faster, but use more memory. If set to NA_real_, then the chunk size is determined by the number of chunks.

  • options(matter.default.serialize=NA): Whether virtual memory chunks should be realized on the manager and serialized to the workers (TRUE), passed to the workers as-is FALSE, or if matter should decide based on the cluster configuration (NA). If all workers have access to the same virtual memory resources (whether file storage or shared memory), then it can be significantly faster to avoid serializing the data.

  • options(matter.default.verbose=FALSE): The default verbosity for printing progress messages.

  • options(matter.matmul.bpparam=NULL): An optional BiocParallelParam passed to bplapply when performing matrix multiplication with matter_mat and sparse_mat objects.

  • options(matter.show.head=TRUE): Should a preview of the beginning of the data be displayed when the object is printed?

  • options(matter.show.head.n=6): The number of elements, rows, and/or columns to be displayed by the object preview.

  • options(matter.coerce.altrep=FALSE): When coercing matter objects to native R objects (such as matrix), should a matter-backed ALTREP object be returned instead? The initial coercion will be cheap, and the result will look like a native R object. This does not guarantee that the full data is never read into memory. Not all functions are ALTREP-aware at the C-level, so some operations may still trigger the full data to be read into memory. This should only ever happen once, as long as the object is not duplicated, though.

  • options(matter.wrap.altrep=FALSE): When coercing to a matter-backed ALTREP object, should the object be wrapped in an ALTREP wrapper? (This is always done in cases where the coercion preserves existing attributes.) This allows setting of attributes without triggering a (potentially expensive) duplication of the object when safe to do so.

  • options(matter.temp.dir=tempdir()): Temporary directory where anonymous matter object files (i.e., those created with path=NULL) should be created.

  • options(matter.temp.gc=TRUE): If TRUE, then anonymous matter object files (i.e., those created with path=NULL) are automatically cleaned up when all R objects referencing them have been garbage collected. If FALSE, then they are only removed at the end of the R session (and only if they are in R's default temporary directory).


Data Types for “matter” Objects

Description

The matter package defines a number of data types for translating between data elements stored in virtual memory and data elements loaded into R. These are typically set and stored via the datamode argument and slot.

At the R level, matter objects may be any of the following data modes:

  • raw: matter objects of this mode are typically vectors of raw bytes.

  • logical: matter object represented logical vector in R.

  • integer: matter objects represented as integer vectors in R.

  • numeric: matter objects represented as double vectors in R.

  • character: matter objects representated as character vectors in R.

  • list: Not used. This type exists for inferring conversions between R types and C types, but matter_list objects instead report the types of their elements.

In virtual memory, matter objects may be composed of atomic units of the following data types:

  • char: 8-bit signed integer; defined as char.

  • uchar: 8-bit unsigned integer; used for ‘Rbyte’ or ‘raw’; defined as unsigned char.

  • int16: 16-bit signed integer; defined as int16_t. May be aliased as ‘short’ and ‘16-bit integer’.

  • uint16: 16-bit unsigned integer; defined as uint16_t. May be aliased as ‘ushort’ and ‘16-bit unsigned integer’.

  • int32: 32-bit signed integer; defined as int32_t. May be aliased as ‘int’ and ‘32-bit integer’.

  • uint32: 32-bit unsigned integer; defined as uint32_t. May be aliased as ‘uint’ and ‘32-bit unsigned integer’.

  • int64: 64-bit signed integer; defined as int64_t. May be aliased as ‘long’ and ‘64-bit integer’.

  • uint64: 64-bit unsigned integer; defined as uint64_t. May be aliased as ‘ulong’ and ‘64-bit unsigned integer’.

  • float32: 32-bit float; defined as float. May be aliased as ‘float’ and ‘32-bit float’.

  • float64: 64-bit float; defined as double. May be aliased as ‘double’ and ‘64-bit float’.

While a substantial effort is made to coerce data elements properly between data types, sometimes this cannot be done losslessly. Loss of precision is silent, while values outside of the representable range will generate a warning (sometimes many such warnings) and will be set to NA if available or 0 otherwise.

Note that the unsigned data types do not support NA; coercion between signed integer types attempts to preserve missingness. The special values NaN, Inf, and -Inf are only supported by the floating-point types, and will be set to NA for signed integral types, and to 0 for unsigned integral types.


Monitor Memory Use

Description

These are utility functions for checking memory used by R and matter in the current session and/or during the timed execution of an expression.

Usage

mem(x, reset = FALSE)

memcl(cl = bpparam(), reset = FALSE)

memtime(expr, verbose = NA, BPPARAM = NULL)

Arguments

x

An object, to summarize how much memory it is using.

reset

Should the maximum memory values be reset?

expr

An expression to be evaluated.

verbose

Should timing messages be printed?

cl, BPPARAM

A cluster or a SnowParam object with a cluster backend. Used to collect memory usage from the cluster during timing.

Details

These functions summarize the memory used by both traditional R objects and out-of-memory matter objects. "Real" memory managed by R is summarized using gc. "Virtual" memory managed by matter includes shared memory allocated by matter and temporary files created by matter in getOption("matter.temp.dir").

For timing parallel code, it is useful to use memtime in combination with its BPPARAM argument to monitor the amount of memory used by the cluster.

Value

For mem called with an x argument, a named vector summarizing the memory and storage used by the object. The named elements include:

  • "real": The amount of R memory used.

  • "shared": The amount of shared memory used.

  • "virtual": The amount of virtual memory used, including both file storage and shared memory.

For mem called without an x argument, a named vector summarizing the memory and storage used by the current R session. The named elements include:

  • "real": The amount of R memory used.

  • "shared": The amount of shared memory used.

  • "max real": The maximum amount of R memory used since the last reset.

  • "max shared": The maximum amount of shared memory used since the last reset.

  • "temp": The total size of temporary files managed by matter in getOption("matter.temp.dir").

For memcl, a data frame with columns corresponding to the elements described above and rows corresponding to cluster nodes.

For memtime, a list including:

  • "start": Memory use at the start of timing.

  • "end": Memory use at the end of timing.

  • "cluser": (Optional.) A summary of the memory used by the cluster if BPPARAM is specified.

  • "overhead": The amount of "real" memory used during the execution of expr that is freed by the end of timing.

  • "change": The difference in "real" memory used before and after timing.

  • "total": For the current R session, the sum of "max real" and "max shared" memory used and total cluster memory used. For a cluster, the sum of "max real" memory used by all workers (not including shared memory or the managing R process).

  • "time": The execution time.

Author(s)

Kylie A. Bemis

See Also

gc

Examples

x <- 1:100

mem(x)

memtime(mean(x + 1))

Multiple Instance Learning

Description

Multiple instance learning is a strategy for training classifiers when the class labels are observed at a coarser level than the individual data points. For example, if an entire image is classified as "positive" or "negative" but the classifier is trained and predicts at the pixel level.

Usage

mi_learn(fn, x, y, bags, pos = 1L, ...,
	score = fitted, threshold = 0.01, verbose = NA)

Arguments

fn

The function used to train the classifier.

x

The data matrix.

y

The response. This can be the same length as the data, or the same length as the number of bags. Must have exactly two levels when coerced to a factor.

bags

The bags to which the data points belong. The class labels are observed per-bag rather than per data point.

pos

The positive class label, as a string matching one of the levels of y, or the index of the level.

...

Additional options passed to fn.

score

The function used to extract the scores for prediction.

threshold

The stopping criterion. The learning stops when the proportion of updated labels between iterations is less than this value.

verbose

Should progress be printed for each iteration? Not passed to fn.

Details

This is a generic wrapper for applying a multiple instance learning strategy for any classifier that satisfies certain criteria. The labels must be binary (positive and negative).

The multiple instance learning algorithm here assumes that if a single data point is positive, then the entire bag to which it belongs is labeled as positive. For example, if a single pixel in an image indicates the presence of disease, then the entire image is labeled as disease.

The model returned by fn must support returning either a vector of probabilities or a 2-column score matrix when passed to score.

Value

A model object returned by fn.

Author(s)

Kylie A. Bemis

References

D. Guo, M. C. Foell, V. Volkmann, K. Enderle-Ammour, P. Bronsert, O. Shilling, and O. Vitek. “Deep multiple instance learning classifies subtissue locations in mass spectrometry images from tissue-level annotations.” Bioinformatics, vol. 36, issue Supplement_1, pp. i300-i308, 2020.

See Also

nscentroids

Examples

register(SerialParam())
set.seed(1)
n <- 100
p <- 5
g <- 5
bags <- rep(paste0("s", seq_len(g)), each=n %/% g)
bags <- factor(rep_len(bags, n))
x <- matrix(rnorm(n * p), nrow=n, ncol=p)
colnames(x) <- paste0("x", seq_len(p))

# create bagged labels
y <- ifelse(bags %in% c("s1", "s2", "s3"), "pos", "neg")
y <- factor(y, levels=c("pos", "neg"))
ipos <- which(y == "pos")
ineg <- which(y == "neg")
z <- y

# create "true" labels (with some within-bag noise)
z[ipos] <- sample(c("pos", "neg"), length(ipos), replace=TRUE)
jpos <- which(z == "pos")
jneg <- which(z == "neg")

# create data
x[jpos,] <- x[jpos,] + rnorm(p * length(jpos), mean=1)
x[jneg,] <- x[jneg,] - rnorm(p * length(jneg), mean=1)

# fit ordinary NSC and mi-NSC
fit0 <- nscentroids(x=x, y=y)
fit1 <- mi_learn(nscentroids, x=x, y=y, bags=bags, priors=1)

# improved performance on "true" labels
mean(fitted(fit0, "class") == z)
mean(fitted(fit1, "class") == z)

Nonnegative Matrix Factorization

Description

Nonnegative matrix factorization (NMF) decomposes a nonnegative data matrix into a matrix of basis variables and a matrix of activations (or coefficients). The factorization is approximate and may be less accurate than alternative methods such as PCA, but can greatly improve the interpretability of the reduced dimensions.

Usage

# Alternating least squares
nnmf_als(x, k = 3L, niter = 100L, transpose = FALSE,
	eps = 1e-9, tol = 1e-5, verbose = NA, ...)

# Multiplicative updates
nnmf_mult(x, k = 3L, niter = 100L, cost = c("euclidean", "KL", "IS"),
	transpose = FALSE, eps = 1e-9, tol = 1e-5, verbose = NA, ...)

## S3 method for class 'nnmf'
predict(object, newdata, ...)

# Nonnegative double SVD
nndsvd(x, k = 3L, ...)

Arguments

x

A nonnegative matrix.

k

The number of NMF components to extract.

niter

The maximum number of iterations.

transpose

A logical value indicating whether x should be considered transposed or not. This can be useful if the input matrix is (P x N) instead of (N x P) and storing the transpose is expensive. This is not necessary for matter_mat and sparse_mat objects, but can be useful for large in-memory (P x N) matrices.

eps

A small regularization parameter to prevent singularities.

tol

The tolerance for convergence, as measured by the Frobenius norm of the differences between the W and H matrices in successive iterations.

verbose

Should progress be printed for each iteration?

cost

The cost function (i.e., error measure between the reconstructed matrix and original x) to optimize, where 'euclidean' is the Frobenius norm, 'KL' is the Kullback-Leibler divergence, and 'IS' is the Itakura-Saito divergence. See Details.

...

Additional options passed to irlba.

object

An object inheriting from nmf.

newdata

An optional data matrix to use for the prediction.

Details

These functions implement nonnegative matrix factorization (NMF) using either alternating least squares as described by Berry at al. (2007) or multiplicative updates from Lee and Seung (2000) and further described by Burred (2014). The algorithms are initialized using nonnegative double singular value decomposition (NNDSVD) from Boutsidis and Gallopoulos (2008).

The algorithm using multiplicative updates (nnmf_mult()) tends to be more stable but converges more slowly. The alternative least squares algorithm (nnmf_als()) tends to converge faster to more accurate results, but can be less numerically stable than the multiplicative updates algorithm.

Note for nnmf_mult() that method = "euclidean" is the only method that can handle out-of-memory matter_mat and sparse_mat matrices. x will be coerced to an in-memory matrix for other methods.

Value

An object of class nnmf, with the following components:

  • activation: The (transposed) coefficient matrix (H).

  • x: The basis variable matrix (W).

  • iter: The number of iterations performed.

Author(s)

Kylie A. Bemis

References

M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, R. J. Plemmons. “Algorithms and applications for approximate nonnegative matrix factorization.” Computational Statistics and Data Analysis, vol. 52, issue 1, pp. 155-173, Sept. 2007.

D. D. Lee and H. S. Seung. “Algorithms for non-negative matrix factorization.” Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS), pp. 535-541, Jan. 2000.

C. Boutsidis and E. Gallopoulos. “SVD based initialization: A head start for nonnegative matrix factorization.” Pattern Recognition, vol. 41, issue 4, pp. 1350-1362, Apr. 2008.

J. J. Burred. “Detailed derivation of multiplicative update rules for NMF.” Techical report, Paris, March 2014.

See Also

svd, prcomp

Examples

set.seed(1)

a <- matrix(sort(runif(500)), nrow=50, ncol=10)
b <- matrix(rev(sort(runif(500))), nrow=50, ncol=10)
x <- cbind(a, b)

mf <- nnmf_als(x, k=3)

Nearest Shrunken Centroids

Description

Nearest shrunken centroids performs regularized classification of high-dimensional data. Originally developed for classification of microarrays, it calculates test statistics for each feature/dimension based on the deviation between the class centroids and the global centroid. It applies regularization (via soft thresholding) to these test statistics to produce shrunken centroids for each class.

Usage

# Nearest shrunken centroids
nscentroids(x, y, s = 0, distfun = NULL,
	priors = table(y), center = NULL, transpose = FALSE,
	verbose = NA, chunkopts=list(),
	BPPARAM = bpparam(), ...)

## S3 method for class 'nscentroids'
fitted(object, type = c("response", "class"), ...)

## S3 method for class 'nscentroids'
predict(object, newdata,
	type = c("response", "class"), priors = NULL, ...)

## S3 method for class 'nscentroids'
logLik(object, ...)

Arguments

x

The data matrix.

y

The response. (Coerced to a factor.)

s

The sparsity (soft thresholding) parameter used to shrink the test statistics. May be a vector.

distfun

A distance function with the same usage (i.e., supports the same arguments and return values) as rowDists or colDists. In particular, it must support an argument called weights that takes a vector of feature weights used to scale the feature-wise distance components.

priors

The prior probabilities or sample sizes for each class. (Will be normalized.)

center

An optional vector giving the pre-calculated global centroid.

transpose

A logical value indicating whether x should be considered transposed or not. This can be useful if the input matrix is (P x N) instead of (N x P) and storing the transpose is expensive. This is not necessary for matter_mat and sparse_mat objects, but can be useful for large in-memory (P x N) matrices.

verbose

Should progress be printed for the initial centroid calculations and for each fitted model (i.e., each value of s)?

chunkopts

An (optional) list of chunk options including nchunks, chunksize, and serialize. See chunkApply.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply. Passed to distfun.

...

Additional options passed to distfun.

object

An object inheriting from nscentroids.

newdata

An optional data matrix to use for the prediction.

type

The type of prediction, where "response" means the posterior probability matrix and "class" will be the vector of class predictions.

Details

This functions implements nearest shrunken centroids based on the original algorithm by Tibshirani et al. (2002). It provides a sparse strategy for classification based on regularized class centroids. The class centroids are shrunken toward the global centroid. The shrunken test statistics used to perform the regularization can then be interpreted to determine which features are relevant to the classification. (Important features will have nonzero test statitistics after soft thresholding.)

A custom distance function can be passed via distfun. If not provided, then this defaults to rowDists if transpose=FALSE or colDists if transpose=TRUE.

If a custom function is passed, it must support the same arguments and return values as rowDists and colDists.

Value

An object of class nscentroids, with the following components:

  • class: The predicted classes.

  • probability: A matrix of posterior class probabilities.

  • centers: The shrunken class centroids used for classification.

  • statistic: The shrunken test statistics.

  • sd: The pooled within-class standard deviations for each feature.

  • priors: The prior class probabilities.

  • s: The regularization (soft thresholding) parameter.

  • distfun: The function used to generate the dissimilarity function.

Author(s)

Kylie A. Bemis

References

R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. “Diagnosis of multiple cancer types by shrunken centroids of gene expression.” Proceedings of the National Academy of Sciences of the USA, vol. 99, no. 10, pp. 6567-6572, 2002.

R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. “Class prediction by nearest shrunken with applications to DNA microarrays.” Statistical Science, vol. 18, no. 1, pp. 104-117, 2003.

See Also

rowDists, colDists

Examples

register(SerialParam())

set.seed(1)
n <- 100
p <- 5
x <- matrix(rnorm(n * p), nrow=n, ncol=p)
colnames(x) <- paste0("x", seq_len(p))
y <- ifelse(x[,1L] > 0 | x[,2L] < 0, "a", "b")

nscentroids(x, y, s=1.5)

Peak Summarization

Description

Summarize peaks based on their shapes and properties.

Usage

# Get peak widths
peakwidths(x, peaks, domain = NULL,
    fmax = 0.5, ref = c("height", "prominence"))

# Get peak areas
peakareas(x, peaks, domain = NULL)

# Get peak heights
peakheights(x, peaks)

Arguments

x

A numeric vector.

peaks

The indices (or domain values) of peaks for which the widths or areas should be calculated.

domain

The domain variable of the signal.

fmax

The fraction of the peak's height used for determining the peak's width.

ref

The reference value of the peak for determining the peak width: either the peak height of the peak prominence.

Value

A numeric vector giving the widths, areas, or heights of the peaks with attributes 'left_bounds' and 'right_bounds' giving the left and right boundaries of the peak.

Author(s)

Kylie A. Bemis

See Also

findpeaks, findpeaks_cwt, binpeaks, mergepeaks

Examples

x <- c(0, 1, 1, 2, 3, 2, 1, 4, 5, 1, 1, 0)

p <- findpeaks(x)

peakareas(x, p)
peakheights(x, p)

Pseudoinverse of a Matrix

Description

Calculate the Moore-Penrose pseudoinverse of a matrix.

Usage

pinv(x, tol = sqrt(.Machine$double.eps))

Arguments

x

A matrix.

tol

The relative tolerance for detecting non-zero singular values.

Value

A matrix giving the pseudoinverse of x.

Examples

set.seed(1)
x <- diag(10) + rnorm(100)

pinv(x)

Plot a Signal or Image

Description

Plot a list of superposed or faceted signals or images (in 2D or 3D).

Usage

plot_signal(x, y, z, by, group = NULL, byrow = FALSE,
    xlim = NULL, ylim = NULL, col = NULL, alphapow = 1,
    xlab = NULL, ylab = NULL, layout = NULL, free = "",
    n = Inf, downsampler = "lttb", key = TRUE, grid = TRUE,
    isPeaks = FALSE, annPeaks = 0, engine = NULL, ...)

plot_image(x, y, z, vals, by, group = NULL, byrow = FALSE,
    zlim = NULL, xlim = NULL, ylim = NULL, col = NULL, alphapow = 1,
    zlab = NULL, xlab = NULL, ylab = NULL, layout = NULL, free = "",
    enhance = NULL, smooth = NULL, scale = NULL, key = TRUE,
    rasterImages = NULL, rasterParams = NULL, useRaster = !is3d,
    grid = TRUE, asp = 1, engine = NULL, ...)

Arguments

x, y, z, vals

Lists of vectors to plot such that x[[i]], y[[i]], etc. indicate the plot values for the ith signal or image. Attempts are made to flexibly coerce these into the expected format. If only x is provided, it is interpreted as the signal or image, and the indices or coordinates are inferred from the length or dimensions, respectively. Specifying z will plot a 2D signal. For 2D images, only one of z or vals should be provided. For 3D images, both should be provided.

by

A vector of labels indicating facets (i.e., which values should be plotted as separate sub-plots).

group

A vector of labels indicating groups (i.e., which values should be indicated by color as belonging to the same group).

byrow

If y (for plot_signal) or vals (for plot_image) is a matrix, should its rows or columns be plotted?

xlim, ylim, zlim

The plot limits. See plot.window

xlab, ylab, zlab

Plotting labels.

col

A vector giving the color map for encoding the image, or a function that returns a vector of n colors.

alphapow

The power scaling of the alpha channel (if used).

layout

A vector of the form c(nrow, ncol) specifying the number of rows and columns in the facet grid.

free

A string specifying the free spatial dimensions during faceting. E.g., "", "x", "y", or "xy".

n, downsampler

See downsample for details.

key

Should a color key be generated for the image?

grid

Should a rectangular grid be included in the plot?

isPeaks

Whether the signal should be plotted as peaks or as a continuous signal.

annPeaks

If isPeaks is TRUE, either an integer giving the number of peaks to annotate (i.e., label with their x-value), or a plotting symbol (e.g., "circle", "cross", etc.) to indicate the peak locations.

engine

The plotting engine. Default is to use base graphics. Using "plotly" requires the plotly package to be installed.

...

Additional graphical parameters (as in par) or arguments to the vizi plotting method.

enhance

The name of a contrast enhancement method, such as "hist" or "adapt" for enhance_hist() and enhance_adapt(), etc. See enhance for details.

smooth

The name of a smoothing method, such as "gauss" or "bi" for filt2_gauss() and filt2_bi(), etc. See filt2 for details.

scale

If TRUE, then all image values will be scaled to the range [0, 100]. This is useful for comparing images with differing intensity levels across facets or layers.

asp

The aspect ratio. See plot.window.

rasterImages, rasterParams

A list of rasters and raster parameters (e.g., xmin, xmax, etc.) to plot before plotting vals. These should be numeric arrays of 3 or 4 color channels with values from 0 to 1. If the raster parameters are omitted, then the raster limits are taken from the range of x and y. If the raster list has names, then these are matched against the levels of by and plotted accordingly.

useRaster

Should a bitmap raster be used for plotting? For 2D images, this is typically faster on supported devices. A fallback to polygon-based plotting is used if raster plotting is not supported. For 3D images, TRUE means to plot raster surfaces, and FALSE means to plot individual voxels as points.

Value

An object of class vizi_plot.

Author(s)

Kylie A. Bemis

See Also

vizi, vizi_pixels

Examples

require(datasets)

# plot signals
set.seed(1)
s <- simspec(6)
plot_signal(domain(s), s, group=colnames(s))

# volcano image
pos <- expand.grid(x=1:nrow(volcano), y=1:ncol(volcano))
plot_image(pos$x, pos$y, volcano, col=cpal("plasma"))

# plot original and transformed images
volcano2 <- trans2d(volcano, rotate=15, translate=c(-5, 5))
plot_image(list(original=volcano, transformed=volcano2))

Plotting Graphical Marks

Description

These functions provide plotting methods for various graphical marks. They are not intended to be called directly.

Usage

## S3 method for class 'vizi_points'
plot(x, plot = NULL, ...,
    n = Inf, downsampler = "lttb", jitter = "",
    sort = is.finite(n))

## S3 method for class 'vizi_lines'
plot(x, plot = NULL, ...,
    n = Inf, downsampler = "lttb", jitter = "",
    sort = is.finite(n))

## S3 method for class 'vizi_peaks'
plot(x, plot = NULL, ...,
    n = Inf, downsampler = "lttb", jitter = "",
    sort = is.finite(n))

## S3 method for class 'vizi_text'
plot(x, plot = NULL, ...,
    adj = NULL, pos = NULL, offset = 0.5)

## S3 method for class 'vizi_intervals'
plot(x, plot = NULL, ...,
    length = 0.25, angle = 90)

## S3 method for class 'vizi_rules'
plot(x, plot = NULL, ...)

## S3 method for class 'vizi_bars'
plot(x, plot = NULL, ...,
    width = 1, stack = FALSE)

## S3 method for class 'vizi_boxplot'
plot(x, plot = NULL, ...,
    range = 1.5, notch = FALSE, width = 0.8)

## S3 method for class 'vizi_image'
plot(x, plot = NULL, ...,
    alpha = NA, interpolate = TRUE, maxColorValue = 1)

## S3 method for class 'vizi_pixels'
plot(x, plot = NULL, ...,
    enhance = FALSE, smooth = FALSE, scale = FALSE,
    useRaster = TRUE)

## S3 method for class 'vizi_voxels'
plot(x, plot = NULL, ...,
    xslice = NULL, yslice = NULL, zslice = NULL)

Arguments

x

A graphical mark.

plot

A vizi_plot object.

...

Additional graphical parameters passed to the underlying base graphics plotting function.

n

Transformation. Maximum number of points to plot. This is useful for downsampling series with far more data points than are useful to plot. See downsample for details.

downsampler

Transformation. If n is less than the number of points, then this is the downsampling method to use. See downsample for details.

jitter

Transformation. Should jitter be applied to one or more position channels? One of "", "x", "y", or "xy".

sort

Transformation. Should the data be sorted (along the x-axis) before plotting? Mostly useful for line charts.

width

The width of the bars or boxplots.

stack

Should bars be stacked versus grouped side-by-side?

adj, pos, offset

See text.

length, angle

See arrows.

range, notch

See boxplot.

alpha

Opacity level from 0 to 1.

interpolate

See rasterImage.

maxColorValue

See col2rgb.

enhance

Transformation. The name of a contrast enhancement method, such as "hist" or "adapt" for enhance_hist() and enhance_adapt(), etc. See enhance for details.

smooth

Transformation. The name of a smoothing method, such as "gauss" or "bi" for filt2_gauss() and filt2_bi(), etc. See filt2 for details.

scale

Transformation. If TRUE, then all image values will be scaled to the range [0, 100]. This is useful for comparing images with differing intensity levels across facets or layers.

useRaster

Should a bitmap raster be used for plotting? This is typically faster on supported devices. A fallback to polygon-based plotting is used if raster plotting is not supported.

xslice, yslice, zslice

Numeric vectors giving the x, y, and/or z coordinates of the volumetric slices to plot. If none are provided, defaults to plotting all z-slices.

Details

These methods are not intended to be called directly. They are presented here to document the transformations and parameters they accept. These should be passed a list to the trans and params arguments in add_mark.

See add_mark for supported encodings.

Author(s)

Kylie A. Bemis

See Also

vizi, add_mark


Partial Least Squares

Description

Partial least squares (PLS), also called projection to latent structures, performs multivariate regression between a data matrix and a response matrix by decomposing both matrixes in a way that explains the maximum amount of covariation between them. It is especially useful when the number of predictors is greater than the number of observations, or when the predictors are highly correlated. Orthogonal partial least squares (OPLS) is also provided.

Usage

# NIPALS algorithm
pls_nipals(x, y, k = 3L, center = TRUE, scale. = FALSE,
	transpose = FALSE, niter = 100L, tol = 1e-5,
	verbose = NA, BPPARAM = bpparam(), ...)

# SIMPLS algorithm
pls_simpls(x, y, k = 3L, center = TRUE, scale. = FALSE,
	transpose = FALSE, method = 1L, retscores = TRUE,
	verbose = NA, BPPARAM = bpparam(), ...)

# Kernel algorithm
pls_kernel(x, y, k = 3L, center = TRUE, scale. = FALSE,
	transpose = FALSE, method = 1L, retscores = TRUE,
	verbose = NA, BPPARAM = bpparam(), ...)

## S3 method for class 'pls'
fitted(object, type = c("response", "class"), ...)

## S3 method for class 'pls'
predict(object, newdata, k,
	type = c("response", "class"), simplify = TRUE, ...)

# O-PLS algorithm
opls_nipals(x, y, k = 3L, center = TRUE, scale. = FALSE,
	transpose = FALSE, niter = 100L, tol = 1e-9, regression = TRUE,
	verbose = NA, BPPARAM = bpparam(), ...)

## S3 method for class 'opls'
coef(object, ...)

## S3 method for class 'opls'
residuals(object, ...)

## S3 method for class 'opls'
fitted(object, type = c("response", "class", "x"), ...)

## S3 method for class 'opls'
predict(object, newdata, k,
	type = c("response", "class", "x"), simplify = TRUE, ...)

# Variable importance in projection
vip(object, type = c("projection", "weights"))

Arguments

x

The data matrix of predictors.

y

The response matrix. (Can also be a factor.)

k

The number of PLS components to use. (Can be a vector for the predict method.)

center

A logical value indicating whether the variables should be shifted to be zero-centered, or a centering vector of length equal to the number of columns of x. The centering is performed implicitly and does not change the out-of-memory data in x.

scale.

A logical value indicating whether the variables should be scaled to have unit variance, or a scaling vector of length equal to the number of columns of x. The scaling is performed implicitly and does not change the out-of-memory data in x.

transpose

A logical value indicating whether x should be considered transposed or not. This can be useful if the input matrix is (P x N) instead of (N x P) and storing the transpose is expensive. This is not necessary for matter_mat and sparse_mat objects, but can be useful for large in-memory (P x N) matrices.

niter

The maximum number of iterations (per component).

tol

The tolerance for convergence (per component).

verbose

Should progress be printed for each iteration?

method

The kernel algorithm to use, where 1 and 2 correspond to the two kernel algorithms described by Dayal and MacGregor (1997). For 1, only of the covariance matrix t(X) %*% Y is computed. For 2, the variance matrix t(X) %*% X is also computed. Typically 1 will be faster if the number of predictors is large. For a smaller number of predictors, 2 will be more efficient.

retscores

Should the scores be computed and returned? This also computes the amount of explained covariance for each component. This is done automatically for NIPALS, but requires additional computation for the kernel algorithms.

regression

For O-PLS, should a 1-component PLS regression be fit to the processed data (for each orthogonal component removed).

...

Not currently used.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply. Currently only used for centering and scaling. Use options(matter.matmul.bpparam=TRUE) to enable parallel matrix multiplication for matter_mat and sparse_mat matrices.

object

An object inheriting from pls or opls.

newdata

An optional data matrix to use for the prediction.

type

The type of prediction, where "response" means the fitted response matrix and "class" will be the vector of class predictions (only valid for discriminant analyses).

simplify

Should the predictions be simplified (from a list) to an array (type="response") or data frame (type="class") when k is a vector?

Details

These functions implement partial least squares (PLS) using the original NIPALS algorithm by Wold et al. (1983), the SIMPLS algorithm by de Jong (1993), or the kernel algorithms by Dayal and MacGregor (1997). A function for calculating orthogonal partial least squares (OPLS) processing using the NIPALS algorithm by Trygg and Wold (2002) is also provided.

Both regression and classification can be performed. If passed a factor, then partial least squares discriminant analysis (PLS-DA) will be performed as described by M. Barker and W. Rayens (2003).

The SIMPLS algorithm (pls_simpls()) is relatively fast as it does not require the deflation of the data matrix. However, the results will differ slightly from the NIPALS and kernel algorithms for multivariate responses. In these cases, only the first component will be identical. The differences are not meaningful in most cases, but it is worth noting.

The kernel algorithms (pls_kernel()) tend to be faster than NIPALS for larger data matrices. The original NIPALS algorithm (pls_nipals()) is the reference implementation. The results from these algorithms are proven to be equivalent for both univariate and multivariate responses.

Note that the NIPALS algorithms cannot handle out-of-memory matter_mat and sparse_mat matrices due to the need to deflate the data matrix for each component. x will be coerced to an in-memory matrix.

Variable importance in projection (VIP) scores proposed by Wold et al. (1993) measure of the influence each variable has on the PLS model. They can be calculated with vip(). Note that non-NIPALS models must have retscores = TRUE for VIP to be calculated. In practice, a VIP score greater than ~1 is a useful criterion for variable selection, although there is no statistical basis for this rule.

Value

An object of class pls, with the following components:

  • coefficients: The regression coefficients.

  • projection: The projection weights of the regression used to calculate the coefficients from the y-loadings or to project the data to the scores.

  • residuals: The residuals from regression.

  • fitted.values: The fitted y matrix.

  • weights: (Optional) The x-weights of the regression.

  • loadings: The x-loadings of the latent variables.

  • scores: (Optional) The x-scores of the latent variables.

  • y.loadings: The y-loadings of the latent variables.

  • y.scores: (Optional) The y-scores of the latent variables.

  • cvar: (Optional) The covariance explained by each component.

Or, an object of class opls, with the following components:

  • weights: The orthogonal x-weights.

  • loadings: The orthogonal x-loadings.

  • scores: The orthogonal x-scores.

  • ratio: The ratio of the orthogonal weights to the PLS loadings for each component. This provides a measure of how much orthogonal variation is being removed by each component and can be interpreted as a scree plot similar to PCA.

  • x: The processed data matrix with orthogonal variation removed.

  • regressions: (Optional.) The PLS 1-component regressions on the processed data.

Author(s)

Kylie A. Bemis

References

S. Wold, H. Martens, and H. Wold. “The multivariate calibration method in chemistry solved by the PLS method.” Proceedings on the Conference on Matrix Pencils, Lecture Notes in Mathematics, Heidelberg, Springer-Verlag, pp. 286 - 293, 1983.

S. de Jong. “SIMPLS: An alternative approach to partial least squares regression.” Chemometrics and Intelligent Laboratory Systems, vol. 18, issue 3, pp. 251 - 263, 1993.

B. S. Dayal and J. F. MacGregor. “Improved PLS algorithms.” Journal of Chemometrics, vol. 11, pp. 73 - 85, 1997.

M. Barker and W. Rayens. “Partial least squares for discrimination.” Journal of Chemometrics, vol. 17, pp. 166-173, 2003.

J. Trygg and S. Wold. “Orthogonal projections to latent structures.” Journal of Chemometrics, vol. 16, issue 3, pp. 119 - 128, 2002.

S. Wold, A. Johansson, and M. Cocchi. “PLS: Partial least squares projections to latent structures.” 3D QSAR in Drug Design: Theory, Methods and Applications, ESCOM Science Publishers: Leiden, pp. 523 - 550, 1993.

See Also

prcomp

Examples

register(SerialParam())

x <- cbind(
		c(-2.18, 1.84, -0.48, 0.83),
		c(-2.18, -0.16, 1.52, 0.83))
y <- as.matrix(c(2, 2, 0, -4))

pls_nipals(x, y, k=2)

Principal Components Analysis for “matter” Matrices

Description

This method allows computation of a truncated principal components analysis of matter_mat and sparse_mat matrices using the implicitly restarted Lanczos method from the “irlba” package.

Usage

## S4 method for signature 'matter_mat'
prcomp(x, k = 3L, retx = TRUE, center = TRUE, scale. = FALSE, ...)

## S4 method for signature 'sparse_mat'
prcomp(x, k = 3L, retx = TRUE, center = TRUE, scale. = FALSE, ...)

prcomp_lanczos(x, k = 3L, retx = TRUE,
    center = TRUE, scale. = FALSE, transpose = FALSE,
    verbose = NA, BPPARAM = bpparam(), ...)

Arguments

x

A matter matrix, or any matrix-like object for prcomp_lanczos.

k

The number of principal components to return, must be less than min(dim(x)).

retx

A logical value indicating whether the rotated variables should be returned.

center

A logical value indicating whether the variables should be shifted to be zero-centered, or a centering vector of length equal to the number of columns of x. The centering is performed implicitly and does not change the out-of-memory data in x.

scale.

A logical value indicating whether the variables should be scaled to have unit variance, or a scaling vector of length equal to the number of columns of x. The scaling is performed implicitly and does not change the out-of-memory data in x.

transpose

A logical value indicating whether x should be considered transposed or not. This can be useful if the input matrix is (P x N) instead of (N x P) and storing the transpose is expensive. This is not necessary for matter_mat and sparse_mat objects, but can be useful for large in-memory (P x N) matrices.

verbose

Should progress messages be printed?

...

Additional options passed to rowStats or colStats.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply. Currently only used for centering and scaling. Use options(matter.matmul.bpparam=TRUE) to enable parallel matrix multiplication for matter_mat and sparse_mat matrices.

Value

An object of class ‘prcomp’. See ?prcomp for details.

Note

The built-in predict() method (from the stats package) is not compatible with the argument transpose=TRUE.

Author(s)

Kylie A. Bemis

See Also

irlba prcomp_irlba

Examples

register(SerialParam())
set.seed(1)

x <- matter_mat(rnorm(1000), nrow=100, ncol=10)

prcomp(x)

Score predictive performance

Description

Calculate performance metrics for predictions from classification or regression.

Usage

predscore(x, ref)

Arguments

x

The predicted values.

ref

The reference (observed) values.

Value

For classification, a numeric matrix with a row for each class and columns called "Recall" and "Precision".

For regression, a numeric vector with elements named "RMSE", "MAE", and "MAPE".

Author(s)

Kylie A. Bemis

Examples

set.seed(1)
n <- 250
s <- c("a", "b", "c")
x <- sample(s, n, replace=TRUE)
pred <- ifelse(runif(n) > 0.1, x, sample(s, n, replace=TRUE))
predscore(pred, x)

Signal Normalization

Description

Normalize or rescale a signal.

Usage

# Rescale the root-mean-squared of the signal
rescale_rms(x, scale = 1)

# Rescale the sum of (absolute values of) the signal
rescale_sum(x, scale = length(x))

# Rescale according to a specific sample
rescale_ref(x, ref = 1L, scale = 1, domain = NULL)

# Rescale the lower and upper limits of the signal
rescale_range(x, limits = c(0, 1))

# Rescale the interquartile range
rescale_iqr(x, width = 1, center = 0)

Arguments

x

A numeric vector.

scale

The scaling value.

ref

The sample (index or domain value) to use.

domain

The domain variable of the signal.

limits

The lower and upper limits to use.

width

The interquartile range to use.

center

The center to use.

Details

rescale_rms() and rescale_sum() simply divide the signal by its root-mean-square or absolute sum, respectively, before multiplying by the given scaling factor.

rescale_ref() finds the closest matching sample to ref in domain if given (it is treated as an index if domain is NULL), and then scales the entire signal to make that sample equal to scale.

rescale_range() simply rescales the signal to match the given upper and lower limits.

rescale_iqr() attempts to rescale the signal so that its interquartile range is approximately width.

Value

A rescaled numeric vector the same length as x.

Author(s)

Kylie A. Bemis

Examples

set.seed(1)
x <- rnorm(100)

sqrt(mean(x^2))
y <- rescale_rms(x, 1)
sqrt(mean(y^2))

range(x)
z <- rescale_range(x, c(-1, 1))
range(z)

Parallel RNG Streams

Description

These functions provide utilities for working with multiple streams of pseudo-random numbers.

Usage

RNGStreams(n = length(size), size = 1L)

getRNGStream()

setRNGStream(seed = NULL, kind = NULL)

Arguments

n

The number of RNG streams to create.

size

If RNGkind is set to "L'Ecuyer-CMRG", then iterate this number of RNG substreams between each returned stream.

seed

A valid RNG stream to assign to .Random.seed, or a list with elements named seed and kind.

kind

The RNGkind to use when setting the RNG stream.

Value

For RNGStreams, a list of length n with RNG streams (with elements named seed and kind) for use with setRNGStream.

Author(s)

Kylie A. Bemis

See Also

RNGkind, nextRNGStream

Examples

# create parallel RNG streams
register(SerialParam())
RNGkind("L'Ecuyer-CMRG")
set.seed(1)
seeds <- RNGStreams(4)
seeds

Compute area under ROC curve

Description

Calculate the area under the receiver operating characteristic curve (ROC AUC).

Usage

rocscore(x, ref, n = 32L)

Arguments

x

The prediction scores.

ref

The (logical) binary response.

n

The number of points in the curve.

Value

A single number between 0 and 1 giving the ROC AUC, with an attribute called ROC which is a data frame giving the full ROC curve.

Author(s)

Kylie A. Bemis

Examples

set.seed(1)
x <- runif(100)
y <- ifelse(x > 0.5 & runif(100) > 0.2, TRUE, FALSE)
rocscore(x, y)

Rolling Summaries of a Vector

Description

Summarize a vector in rolling windows.

Usage

rollvec(x, width, stat = "sum", prob = 0.5)

Arguments

x

A numeric vector.

width

The width of the rolling window. Must be odd.

stat

The statistic used to summarize the values in each bin. Must be one of "sum", "mean", "max", "min", "sd", "var", "mad", or "quantile".

prob

The quantile for stat = "quantile".

Value

An numeric vector with the same length as x with the summarized values from each rolling window.

Author(s)

Kylie A. Bemis

Examples

set.seed(1)

x <- sort(runif(20))

rollvec(x, 5L, "mean")

Compute Distances between Rows or Columns of a Matrix

Description

Compute and return the distances between specific rows or columns of matrices according to a specific distance metric.

Usage

## S4 method for signature 'matrix,matrix'
rowDists(x, y, ..., BPPARAM = bpparam())

## S4 method for signature 'matrix,matrix'
colDists(x, y, ..., BPPARAM = bpparam())

## S4 method for signature 'matter_mat,matrix'
rowDists(x, y, ..., BPPARAM = bpparam())

## S4 method for signature 'matrix,matter_mat'
colDists(x, y, ..., BPPARAM = bpparam())

## S4 method for signature 'sparse_mat,matrix'
rowDists(x, y, ..., BPPARAM = bpparam())

## S4 method for signature 'matrix,sparse_mat'
colDists(x, y, ..., BPPARAM = bpparam())

## S4 method for signature 'ANY,missing'
rowDists(x, at, ..., simplify = TRUE, BPPARAM = bpparam())

## S4 method for signature 'ANY,missing'
colDists(x, at, ..., simplify = TRUE, BPPARAM = bpparam())

.rowDists(x, y, metric = "euclidean", p = 2,
	weights = NULL, iter.dim = 1L, BPPARAM = bpparam(), ...)

.colDists(x, y, metric = "euclidean", p = 2,
	weights = NULL, iter.dim = 1L, BPPARAM = bpparam(), ...)

# Low-level row/col distance functions
rowdist(x, y = x, metric = "euclidean", p = 2, weights = NULL, ...)
coldist(x, y = x, metric = "euclidean", p = 2, weights = NULL, ...)

rowdist_at(x, ix, y = x, iy = list(1L:nrow(y)),
	metric = "euclidean", p = 2, weights = NULL, ...)

coldist_at(x, ix, y = x, iy = list(1L:ncol(y)),
	metric = "euclidean", p = 2, weights = NULL, ...)

Arguments

x, y

Numeric matrices for which distances should be calculated according to rows or columns. If a parallel backend is provided, then the calculation is parallelized over x, and y is passed to each worker.

at

A list, matrix, or vector of specific row or column indices for which to calculate the distances. Each row or column of x will be compared to the rows or columns indicated by the corresponding element of at.

simplify

Should the result be simplified into a matrix if possible?

metric

Distance metric to compute. Supported metrics include "euclidean", "maximum", "manhattan", and "minkowski".

p

The power for the Minkowski distance.

weights

A numeric vector of weights for the distance components if calculating weighted distances. For example, the weighted Euclidean distance is sqrt(sum(w * (x - y)^2)).

iter.dim

The dimension to iterate over. Must be 1 or 2, where 1 indicates rows and 2 indicates columns.

ix, iy

A list of specific row or column indices for which to calculate the pairwise distances. Numeric vectors will be coerced to lists. Each list element should give a vector of indices to use for a distance computation. Elements of length 1 will be recycled to an appropriate length.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply.

...

Additional arguments passed to rowdist() or coldist().

Details

rowdist() and coldist() calculate straightforward distances between each row or column, respectively in x and y. If y = x (the default), then the output of rowdist() will match the output of dist (except it will be an ordinary matrix).

rowdist_at() and coldist_at() allow passing a list of specific row or column indices for which to calculate the distances.

rowDists() and colDists() are S4 generics. The current methods provide (optionally parallelized) versions of rowdist() and coldist() for matter_mat and sparse_mat matrices.

Value

For rowdist() and coldist(), a matrix with rows equal to the number of observations in x and columns equal to the number of observations in y.

For rowdist_at() and coldist_at(), a list where each element gives the pairwise distances corresponding to the indices given by ix and iy.

rowDists() and colDists() have corresponding return values depending on whether at has been specified (and the value of simplify).

Author(s)

Kylie A. Bemis

See Also

dist

Examples

register(SerialParam())

set.seed(1)

x <- matrix(runif(25), nrow=5, ncol=5)
y <- matrix(runif(25), nrow=3, ncol=5)

rowDists(x) # same as as.matrix(dist(x))
rowDists(x, y)

# distances between:
# x[1,] vs x[,]
# x[5,] vs x[,]
rowdist_at(x, c(1,5))

# distances between:
# x[1,] vs x[1:3,]
# x[5,] vs x[3:5,]
rowdist_at(x, ix=c(1,5), iy=list(1:3, 3:5))

# distances between:
# x[i,] vs x[(i-1):(i+1),]
rowDists(x, at=roll(1:5, width=3))

# distances between:
# x[,] vs x[1,]
rowDists(x, at=1)

Relative Sequence Generation

Description

Generate a sequence with spacing based on relative differences.

Usage

seq_rel(from, to, by)

Arguments

from

The start of the sequence.

to

The end of the sequence.

by

The relative difference between consecutive elements.

Details

Because the relative differences depend on whether we treat the lesser or greater term as the "reference", the function actually treats each element as the center of a bin of width by (relative to that element).

Value

A numeric vector of some (algorithmically-determined) length where consecutive elements are separated by consistent relative differences.

Author(s)

Kyle Dahlin and Kylie A. Bemis

See Also

seq

Examples

# create a sequence w/ 1e-4 relative diff spacing
x <- seq_rel(500, 505, by=1e-4)
x

# relative spacing is averaged between lesser/greater element
head(diff(x) / x[-1])
head(diff(x) / x[-length(x)])
dx <- 0.5 * ((diff(x) / x[-1]) + (diff(x) / x[-length(x)]))
dx

Spatial Gaussian Mixture Model

Description

Spatially segment a single-channel image using a Dirichlet Gaussian mixture model (DGMM).

Usage

# Spatial Gaussian mixture model
sgmix(x, y, vals, r = 1, k = 2, group = NULL,
	weights = c("gaussian", "bilateral", "adaptive"),
	metric = "maximum", p = 2, neighbors = NULL,
	annealing = TRUE, niter = 10L, tol = 1e-3,
	compress = FALSE, byrow = FALSE,
	verbose = NA, chunkopts=list(),
	BPPARAM = bpparam(), ...)

## S3 method for class 'sgmix'
fitted(object,
	type = c("mu", "sigma", "class"), channel = NULL, ...)

## S3 method for class 'sgmix'
logLik(object, ...)

Arguments

x, y, vals

Pixel coordinates (x, y) and their intensity values (vals). If multiple image channels should be segmented, then vals can be a list of images or a matrix of flattened image vectors. Alternatively, x can be an array of images, in which case the x and y coordinates are generated from the first 2 dimensions.

r

The spatial smoothing radius.

k

The number of segments (per group, if applicable).

group

A vector of pixel groups. Pixels belonging to each group will be segmented independently, and will be assigned to different segments.

weights

The type of spatial weights to use.

metric

Distance metric to use when finding neighboring pixels. Supported metrics include "euclidean", "maximum", "manhattan", and "minkowski".

p

The power for the Minkowski distance.

neighbors

An optional list giving the neighboring pixel indices for each pixel.

annealing

Should simulated annealing be attempted every iteration? (If FALSE, simulated annealing will still be attempted if the log-likelihood decreases instead of increases during an iteration.)

niter

The maximum number of iterations.

tol

The tolerance for convergence, as measured by the change in log-likelihood in successive iterations.

compress

Should the results be compressed? The resulting sgmix object will be larger than the original image, so compression can be useful. If TRUE, then the class component is compressed using drle, and the probability component is not returned.

byrow

If vals is a matrix of flattened image vectors, should its rows or columns be plotted?

verbose

Should progress be printed for each iteration?

chunkopts

An (optional) list of chunk options including nchunks, chunksize, and serialize. See chunkApply.

BPPARAM

An optional instance of BiocParallelParam. See documentation for bplapply.

...

Ignored.

object

An object inheriting from sgmix.

type

The type of fitted values to extract.

channel

The channel of fitted values to extract.

Details

Spatial segmentation is performed using a Gaussian mixture model from Guo et al. (2019) that uses Dirichlet priors to incorporate spatial dependence. The strength of the spatial smoothing depends on the smoothing radius (r) and the type of spatial weights. The "bilateral" and "adaptive" weights can preserve edges better than the standard "gaussian" weights at the expense of a (potentially) noisier segmentation.

The segmentation is initialized using k-means clustering. An expectation-maximization (E-M) algorithm with gradient descent is then used to estimate the model parameters based on log-likelihood. Optionally, simulated annealing can be used to prevent the model from getting stuck in local maxima.

Value

An object of class sgmix, with the following components:

  • class: A list of class assignments for each channel.

  • probability: (Optional) A matrix or array of posterior class probabilities.

  • mu: The fitted class means.

  • sigma: The fitted class standard deviations.

  • alpha: The fitted Dirichlet priors.

  • beta: The estimated strength of the spatial dependence.

  • group: (Optional) The pixel groups.

Author(s)

Kylie A. Bemis

References

D. Guo, K. Bemis, C. Rawlins, J. Agar, and O. Vitek. “Unsupervised segmentation of mass spectrometric ion images characterizes morphology of tissues” Bioinformatics, vol. 35, issue 14, pp. i208-i217, 2019.

See Also

kmeans

Examples

require(datasets)

set.seed(1)
seg <- sgmix(volcano, k=3)

image(fitted(seg, "class", channel=1L))

Cleveland-Style Shingles

Description

Shingles are a generalization of factors to continuous variables. Cleveland-style shingles distribute the range of a continuous variable into overlapping discrete intervals, which can be more useful for visualization than mutually-exclusive bins.

Usage

shingles(x, breaks, overlap = 0.5, labels = NULL)

Arguments

x

A numeric vector.

breaks

Either the number of intervals or a matrix of breaks (as returned by co.intervals).

overlap

The fraction of overlap between intervals.

labels

The names of the intervals.

Value

A list giving the indices of x in each shingle with the following attributes:

  • breaks: A matrix where each row gives the lower and upper limits for each shingle.

  • counts: The number of observations in each shingle.

  • mids: The center of each shingle.

Author(s)

Kylie A. Bemis

References

W. S. Cleveland. Visualizing Data. New Jersey: Summit Press. 1993.

See Also

co.intervals

Examples

set.seed(1)
x <- rnorm(100)

shingles(x, 6)

Simple Logging

Description

A simple logger that uses R's built-in signaling conditions.

Usage

## Instance creation
simple_logger(file = NULL, bufferlimit = 50L, domain = NULL)

## Get logger used by matter
matter_logger()

## Additional methods documented below

Arguments

file

The name of the log file. If NULL, then no log file is used, and the log is kept in memory.

bufferlimit

The maximum number of buffered log entries before they are flushed to the log file.

domain

See gettext for details. If NA, log entries will not be translated.

Value

An object of reference class simple_logger.

Fields

id:

An identifier for file synchronization.

buffer:

A character vector of buffered log entries that have not yet been flushed to the log file.

bufferlimit:

The maximum number of buffered log entries before they are flushed to the log file.

logfile:

The path to the log file.

domain:

See gettext for details. If NA, log entries will not be translated.

Methods

Class-specific methods:

$flush():

Flush buffered log entries to the log file.

$append(entry):

Append a plain text entry to the log.

$append_session():

Append the sessionInfo() to the log.

$append_trace():

Append the traceback() to the log.

$history(print = TRUE):

Print the complete log to the console (TRUE) or return it as a character vector (FALSE).

$log(..., signal = FALSE, call = NULL):

Record a log entry with a timestamp, optionally signaling a "message", "warning", or "error" condition. For "warning" and "error" conditions (only), the call may also be passed to be deparsed and recorded in the log entry.

$message(...):

Record a log entry with a timestamp and signal a message condition.

$warning(...):

Record a log entry with a timestamp and signal a warning condition.

$stop(...):

Record a log entry with a timestamp and signal an error condition.

$move(file):

Move the log file to a new location. Moving the log file automatically appends the current sessionInfo().

$copy(file):

Copy the log file to a different location.

$close():

Flush buffered log entries (including sessionInfo()) and detach the log file. This is called automatically at the end of an R session. Note that log files in the temporary directory will be deleted after the R session ends if they are not $move()ed to a persistent directory before quitting.

Standard generic methods:

path(x), path(x) <- value:

Get or set the path to the log file. The replacement method is the same as $move().

Author(s)

Kylie A. Bemis

Examples

sl <- simple_logger(tempfile(fileext=".log"))

sl$log("This is a silent log entry that doesn't signal a condition.")
sl$log("This log entry signals a message condition.", signal=TRUE)
sl$log("This log entry signals a message condition.", signal="message")
sl$message("This log entry also signals a message condition")
sl$flush()

readLines(path(sl))

Simulate Spectra

Description

Simulate spectra from noise and a list of peaks.

Usage

simspec(n = 1L, npeaks = 50L,
	x = rlnorm(npeaks, 7, 0.3), y = rlnorm(npeaks, 1, 0.9),
	domain = c(0.9 * min(x), 1.1 * max(x)), size = 10000,
	sdx = 1e-5, sdy = sdymult * log1p(y), sdymult = 0.2,
	sdnoise = 0.1, resolution = 1000, fmax = 0.5,
	baseline = 0, decay = 10, units = "relative")

simspec1(x, y, xout, peakwidths = NA_real_,
	sdnoise = 0, resolution = 1000, fmax = 0.5)

Arguments

n

The number of spectra to simulate.

npeaks

The number of peaks to simulate. Not used if x and y are provided.

x, y

The locations and values of the spectral peaks.

xout, domain

The output domain variable or its range.

size

The number of samples in each spectrum.

sdx

The standard deviation of the error in the observed locations of peaks, in units indicated by units.

sdy

The standard deviation(s) for the distributions of observed peak values on the log scale.

sdymult

A multiplier used to calculate sdy based on the mean values of the peaks; used to simulate multiplicative variance. Not used if sdy is provided.

sdnoise

The standard deviation of the random noise in the spectra on the log scale.

resolution

The resolution as defined by x / dx, where x is the observed peak location and dx is the width of the peak at a proportion of its maximum height defined by fmax (defaults to full-width-at-half-maximum – FWHM – definition).

fmax

The fraction of the maximum peak height to use when defining the resolution.

peakwidths

The peak widths at fmax. Typically, these are calculated automatically from resolution.

baseline

The maximum intensity of the baseline. Note that baseline=0 means there is no baseline.

decay

A constant used to calculate the exponential decay of the baseline. Larger values mean the baseline decays more sharply.

units

The units for sdx. Either "absolute" or "relative".

Value

Either a numeric vector of the same length as size, giving the simulated spectrum, or a size x n matrix of simulated spectra.

Author(s)

Kylie A. Bemis

Examples

set.seed(1)
y <- simspec(2)
x <- attr(y, "domain")

plot(x, y[,1], type="l", ylim=c(-max(y), max(y)))
lines(x, -y[,2], col="red")

Fast Simple Network of Workstations (SNOW)-style Parallel Backend

Description

This class provides a enhanced version of the SnowParam parallel backend from BiocParallel based on the newer PSOCK cluster implementation from the parallel package.

Usage

## Instance creation
SnowfastParam(workers = snowWorkers(),
    tasks = 0L, stop.on.error = TRUE, progressbar = FALSE,
    RNGseed = NULL, timeout = WORKER_TIMEOUT,
    exportglobals = TRUE, exportvariables = TRUE,
    resultdir = NA_character_, jobname = "BPJOB",
    force.GC = FALSE, fallback = TRUE, useXDR = FALSE,
    manager.hostname = NA_character_, manager.port = NA_character_, ...)

## Additional methods documented below

Arguments

workers

Either the number of workers or the names of cluster nodes.

tasks

The number of tasks per job. See SnowParam for details.

stop.on.error

Enable stop on error. See SnowParam for details.

progressbar

Enable text progress bar. See SnowParam for details.

RNGseed

The seed for random number generation. See SnowParam for details.

timeout

Time (in seconds) allowed for workers to complete a task. See SnowParam for details.

exportglobals

Export base::options() from manager to workers? See SnowParam for details.

exportvariables

Automatically export variables defined in the global environment and used by the function? See SnowParam for details.

resultdir

Job results directory. See SnowParam for details.

jobname

The name of the job for logging and results. See SnowParam for details.

force.GC

Whether to invoke the garbage collector after each call to FUN. See SnowParam for details.

fallback

Fall back to using SerialParam if the cluster is not started and the number of workers is no greater than 1? See SnowParam for details.

useXDR

Should data be converted to network byte order when serializing from the manager to the workers? The default (FALSE) assumes all nodes in the cluster are little endian. Passed to makePSOCKcluster.

manager.hostname

Host name of the manager node.

manager.port

Port on the manager with which workers communicate.

...

Additional arguments passed to makePSOCKcluster.

Details

SnowfastParam is a faster but somewhat more limited version of SnowParam. Like SnowParam, it uses simple network of workstations (SNOW)-style parallelism so that it is available on all operating systems.

The workers are initialized in parallel, so cluster startup can often be significantly faster than SnowParam if utilizing a large number of workers.

Because the workers are started using makePSOCKcluster from the parallel package rather than using BiocParallel's startup script, some features of BiocParallel-managed backends such as logging are unsupported or only partially available.

The default parameter useXDR=TRUE assumes that all nodes are little endian so that data can be sent to workers without the overhead of conversion to network byte order. This should result in overall faster performance for large datasets on compatible clusters.

SnowfastParam is intended to be used as a drop-in replacement for SnowParam in most situations and as a more stable alternative to MulticoreParam for long-running jobs in graphical environments like RStudio where forking the R process should be avoided.

Value

An parallel backend derived from class BiocParallelParam.

Methods

BiocParallelParam methods:

bpstart(x):

Start the cluster.

bpstop(x):

Stop the cluster.

Additional methods documented in BiocParallelParam.

Author(s)

Kylie A. Bemis

See Also

SnowParam, MulticoreParam

Examples

x <- replicate(1000, runif(200), simplify=FALSE)

bp <- SnowfastParam(workers=4, tasks=20, progressbar=TRUE)

ans <- chunkLapply(x, sum, BPPARAM=bp)

Sparse Vectors and Matrices

Description

The sparse_mat class implements sparse matrices, potentially stored out-of-memory. Both compressed-sparse-column (CSC) and compressed-sparse-row (CSR) formats are supported. Sparse vectors are also supported through the sparse_vec class.

Usage

## Instance creation
sparse_mat(data, index, type = "double",
    nrow = NA_integer_, ncol = NA_integer_, dimnames = NULL,
    pointers = NULL, domain = NULL, offset = 0L, rowMaj = FALSE,
    tolerance = c(abs=0), sampler = "none", ...)

sparse_vec(data, index, type = "double",
    length = NA_integer_, names = NULL,
    domain = NULL, offset = 0L, rowMaj = FALSE,
    tolerance = c(abs=0), sampler = "none", ...)

# Check if an object is a sparse matrix
is.sparse(x)

# Coerce an object to a sparse matrix
as.sparse(x, ...)

## Additional methods documented below

Arguments

data

Either the non-zero values of the sparse array, or (if index is missing) a numeric vector or matrix from which to create the sparse array. For a sparse_vec, these should be a numeric vector. For a sparse_mat these can be a numeric vector if pointers is supplied, or a list of numeric vectors if pointers is NULL.

index

For sparse_vec, the indices of the non-zero items. For sparse_mat, either the row-indices or column-indices of the non-zero items, depending on the value of rowMaj.

type

A 'character' vector giving the storage mode of the data in virtual memory such. See ?"matter-types" for possible values.

nrow, ncol, length

The number of rows and columns, or the length of the array.

dimnames

The names of the sparse matrix dimensions.

names

The names of the sparse vector elements.

pointers

The (zero-indexed) pointers to the start of either the rows or columns (depending on the value of rowMaj) in data and index when they are numeric vectors rather than lists.

domain

Either NULL or a vector with length equal to the number of rows (for CSC matrices) or the number of columns (for CSR matrices). If NULL, then index is assumed to be row or column indices. If a vector, then they define the how the non-zero elements are matched to rows or columns. The index value of each non-zero element is matched against this domain using binary search. Must be numeric.

offset

If domain is NULL (i.e., index represents the actual row/column indices), then this is the index of the first row/column. The default of 0 means that index is indexed from 0.

rowMaj

Whether the data should be stored using compressed-sparse-row (CSR) representation (as opposed to compressed-sparse-column (CSC) representation). Defaults to 'FALSE', for efficient access to columns. Set to 'TRUE' for more efficient access to rows instead.

tolerance

For non-NULL domain, the tolerance used for floating-point equality when matching index to the domain. The vector should be named. Use 'absolute' to use absolute differences, and 'relative' to use relative differences.

sampler

For non-zero tolerances, how the data values should be combined when there are multiple index values within the tolerance. Must be of 'none', 'sum', 'mean', 'max', 'min', 'area', 'linear', 'cubic', 'gaussian', or 'lanczos'. Note that 'none' means nearest-neighbor interpolation.

x

An object to check if it is a sparse matrix or coerce to a sparse matrix.

...

Additional arguments to be passed to constructor.

Value

An object of class sparse_mat.

Slots

data:

The non-zero data values. Can be a numeric vector or a list of numeric vectors.

type:

The storage mode of the accessed data when read into R. This is a 'factor' with levels 'raw', 'logical', 'integer', 'numeric', or 'character'.

dim:

Either NULL for vectors, or an integer vector of length one of more giving the maximal indices in each dimension for matrices and arrays.

names:

The names of the data elements for vectors.

dimnames:

Either NULL or the names for the dimensions. If not NULL, then this should be a list of character vectors of the length given by 'dim' for each dimension. This is always NULL for vectors.

index:

The indices of the non-zero items. Can be a numeric vector or a list of numeric vectors.

pointers:

The pointers to the beginning of the rows or columns if index and data use vector storage rather than list storage.

domain:

Either NULL or a vector with length equal to the number of rows (for CSC matrices) or the number of columns (for CSR matrices). If NULL, then index is assumed to be row or column indices. If a vector, then they define the how the non-zero elements are matched to rows or columns. The index value of each non-zero element is matched against this domain using binary search. Must be numeric.

offset:

If domain is NULL (i.e., index represents the actual row/column indices), then this is the index of the first row/column. The default of 0 means that index is indexed from 0.

tolerance:

For non-NULL domain, the tolerance used for floating-point equality when matching index to the domain. The vector should be named. Use 'absolute' to use absolute differences, and 'relative' to use relative differences.

sampler:

The type of summarization or interpolation performed when there are multiple index values within the tolerance of the requested domain value(s).

ops:

Deferred arithmetic operations.

transpose

Indicates whether the data is stored in row-major order (TRUE) or column-major order (FALSE). For a matrix, switching the order that the data is read is equivalent to transposing the matrix (without changing any data).

Extends

matter

Creating Objects

sparse_mat and sparse_vec instances can be created through sparse_mat() and sparse_vec(), respectively.

Methods

Class-specific methods:

atomdata(x):

Access the 'data' slot.

adata(x):

An alias for atomdata(x).

atomindex(x):

Access the 'index' slot.

aindex(x):

An alias for atomindex(x).

pointers(x):

Access the 'pointers' slot.

domain(x):

Access the 'domain' slot.

tolerance(x), tolerance(x) <- value:

Get or set resampling 'tolerance'.

sampler(x), sampler(x) <- value:

Get or set the 'sampler' method.

fetch(x, ...):

Pull data into shared memory.

flash(x, ...):

Push data to a temporary file.

Standard generic methods:

dim(x):

Get 'dim'.

dimnames(x), dimnames(x) <- value:

Get or set 'dimnames'.

x[i, j, ..., drop], x[i, j] <- value:

Get or set the elements of the sparse matrix. Use drop = NULL to return a subset of the same class as the object.

cbind(x, ...), rbind(x, ...):

Combine sparse matrices by row or column.

t(x):

Transpose a matrix. This is a quick operation which only changes metadata and does not touch the data representation.

rowMaj(x):

Check the data orientation.

Author(s)

Kylie A. Bemis

See Also

matter

Examples

x <- matrix(rbinom(100, 1, 0.2), nrow=10, ncol=10)

y <- sparse_mat(x)
y[]

Streaming Summary Statistics

Description

These functions allow calculation of streaming statistics. They are useful, for example, for calculating summary statistics on small chunks of a larger dataset, and then combining them to calculate the summary statistics for the whole dataset.

This is not particularly interesting for simpler, commutative statistics like sum(), but it is useful for calculating non-commutative statistics like running sd() or var() on pieces of a larger dataset.

Usage

# calculate streaming univariate statistics
s_stat(x, stat, group, na.rm = FALSE, ...)

s_range(x, ..., na.rm = FALSE)

s_min(x, ..., na.rm = FALSE)

s_max(x, ..., na.rm = FALSE)

s_prod(x, ..., na.rm = FALSE)

s_sum(x, ..., na.rm = FALSE)

s_mean(x, ..., na.rm = FALSE)

s_var(x, ..., na.rm = FALSE)

s_sd(x, ..., na.rm = FALSE)

s_any(x, ..., na.rm = FALSE)

s_all(x, ..., na.rm = FALSE)

s_nnzero(x, ..., na.rm = FALSE)

# calculate streaming matrix statistics
s_rowstats(x, stat, group, na.rm = FALSE, ...)

s_colstats(x, stat, group, na.rm = FALSE, ...)

# calculate combined summary statistics
stat_c(x, y, ...)

Arguments

x, y, ...

Object(s) on which to calculate a summary statistic, or a summary statistic to combine.

stat

The name of a summary statistic to compute over the rows or columns of a matrix. Allowable values include: "range", "min", "max", "prod", "sum", "mean", "var", "sd", "any", "all", and "nnzero".

group

A factor or vector giving the grouping. If not provided, no grouping will be used.

na.rm

If TRUE, remove NA values before summarizing.

Details

These summary statistics methods are intended to be applied to chunks of a larger dataset. They can then be combined either with the individual summary statistic functions, or with stat_c(), to produce the combined summary statistic for the full dataset. This is most useful for calculating running variances and standard deviations iteratively, which would be difficult or impossible to calculate on the full dataset.

The variances and standard deviations are calculated using running sum of squares formulas which can be calculated iteratively and are accurate for large floating-point datasets (see reference).

Value

For all univariate functions except s_range(), a single number giving the summary statistic. For s_range(), two numbers giving the minimum and the maximum values.

For s_rowstats() and s_colstats(), a vector of summary statistics.

Author(s)

Kylie A. Bemis

References

B. P. Welford. “Note on a Method for Calculating Corrected Sums of Squares and Products.” Technometrics, vol. 4, no. 3, pp. 1-3, Aug. 1962.

B. O'Neill. “Some Useful Moment Results in Sampling Problems.” The American Statistician, vol. 68, no. 4, pp. 282-296, Sep. 2014.

See Also

Summary

Examples

set.seed(1)
x <- sample(1:100, size=10)
y <- sample(1:100, size=10)

sx <- s_var(x)
sy <- s_var(y)

var(c(x, y))
stat_c(sx, sy) # should be the same

sxy <- stat_c(sx, sy)

# calculate with 1 new observation
var(c(x, y, 99))
stat_c(sxy, 99)

# calculate over rows of a matrix
set.seed(2)
A <- matrix(rnorm(100), nrow=10)
B <- matrix(rnorm(100), nrow=10)

sx <- s_rowstats(A, "var")
sy <- s_rowstats(B, "var")

apply(cbind(A, B), 1, var)
stat_c(sx, sy) # should be the same

C-Style Structs Stored in Virtual Memory

Description

This is a convenience function for creating and reading C-style structs in a single file stored in virtual memory.

Usage

struct(..., path = NULL, readonly = FALSE, offset = 0, filename)

Arguments

...

Named integers giving the members of the struct. They should be of the form name=c(type=length).

path, filename

A single string giving the name of the file.

readonly

Should the file be treated as read-only?

offset

A scalar integer giving the offset from the beginning of the file.

Details

This is simply a convenient wrapper around the wrapper around matter_list that allows easy specification of C-style structs in a file.

Value

A object of class matter_list.

Author(s)

Kylie A. Bemis

See Also

matter_list

Examples

x <- struct(first=c(int=1), second=c(double=1))

x$first <- 2L
x$second <- 3.33

x$first
x$second

Summary Statistics for “matter” Objects

Description

These functions efficiently calculate summary statistics for matter_arr objects. For matrices, they operate efficiently on both rows and columns.

Usage

## S4 method for signature 'matter_arr'
range(x, ..., na.rm)
## S4 method for signature 'matter_arr'
min(x, ..., na.rm)
## S4 method for signature 'matter_arr'
max(x, ..., na.rm)
## S4 method for signature 'matter_arr'
prod(x, ..., na.rm)
## S4 method for signature 'matter_arr'
mean(x, ..., na.rm)
## S4 method for signature 'matter_arr'
sum(x, ..., na.rm)
## S4 method for signature 'matter_arr'
sd(x, na.rm)
## S4 method for signature 'matter_arr'
var(x, na.rm)
## S4 method for signature 'matter_arr'
any(x, ..., na.rm)
## S4 method for signature 'matter_arr'
all(x, ..., na.rm)

## S4 method for signature 'matter_mat'
colMeans(x, na.rm, dims = 1, ...)
## S4 method for signature 'matter_mat'
colSums(x, na.rm, dims = 1, ...)

## S4 method for signature 'matter_mat'
rowMeans(x, na.rm, dims = 1, ...)
## S4 method for signature 'matter_mat'
rowSums(x, na.rm, dims = 1, ...)

Arguments

x

A matter_arr object.

...

Arguments passed to chunk_lapply(), chunk_rowapply(), or chunk_colapply().

na.rm

If TRUE, remove NA values before summarizing.

dims

Not used.

Details

These summary statistics methods operate on chunks of data which are loaded into memory and then freed before reading the next chunk.

For row and column summaries on matrices, the iteration scheme is dependent on the layout of the data. Column-major matrices will always be iterated over by column, and row-major matrices will always be iterated over by row. Row statistics on column-major matrices and column statistics on row-major matrices are calculated iteratively.

Variance and standard deviation are calculated using a running sum of squares formula which can be calculated iteratively and is accurate for large floating-point datasets (see reference).

Value

For mean, sd, and var, a single number. For the column summaries, a vector of length equal to the number of columns of the matrix. For the row summaries, a vector of length equal to the number of rows of the matrix.

Author(s)

Kylie A. Bemis

References

B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 1-3, Aug. 1962.

See Also

stream_stat

Examples

register(SerialParam())

x <- matter(1:100, nrow=10, ncol=10)

sum(x)
mean(x)
var(x)
sd(x)

colSums(x)
colMeans(x)

rowSums(x)
rowMeans(x)

Rasterize a Scattered 2D or 3D Signal

Description

Estimate the raster dimensions of a scattered 2D or 3D signal based on its pixel coordinates.

Usage

# Rasterize a 2D signal
to_raster(x, y, vals)

# Rasterize a 3D signal
to_raster3(x, y, z, vals)

# Check if coordinates are gridded
is_gridded(x, tol = sqrt(.Machine$double.eps))

Arguments

x, y, z

The coordinates of the data to be rasterized. For is_gridded(), a numeric matrix or data frame where each column gives the pixel coordinates for a different dimension.

vals

The data values to be rasterized.

tol

The tolerance allowed when estimating the resolution. Noise in the sampling rate will be allowed up to this amount when determining if the data is approximately gridded or not.

Details

This is meant to be a more efficient version of approx2() when the data is already (approximately) gridded. Otherwise, approx2() is used.

Value

A numeric vector giving the estimated raster dimensions.

Author(s)

Kylie A. Bemis

See Also

approx2

Examples

# create an image
set.seed(1)
i <- seq(-4, 4, length.out=12)
j <- seq(1, 3, length.out=9)
co <- expand.grid(x=i, y=j)
z <- matrix(atan(co$x / co$y), nrow=12, ncol=9)
vals <- 10 * (z - min(z)) / diff(range(z))

# scatter coordinates and flatten image
d <- expand.grid(x=jitter(1:12), y=jitter(1:9))
d$vals <- as.vector(z)

# rasterize
to_raster(d$x, d$y, d$vals)

2D Spatial Transformation

Description

Perform linear spatial transformations on a matrix, including rigid, similarity, and affine transformations.

Usage

trans2d(x, y, z, pmat,
	rotate = 0, translate = c(0, 0), scale = c(1, 1),
	interp = "linear", dimout = dim(z), ...)

Arguments

x, y, z

The data to be interpolated. Alternatively, x can be a matrix, in which case the matrix elements are used for z and x and y are generated from the matrix's dimensions.

pmat

A 3 x 2 transformation matrix for performing an affine transformation. Automatically generated from rotate, translate, and scale if not provided.

rotate

Rotation in degrees.

translate

Translation vector, in the same units as x and y, if given.

scale

Scaling factors.

interp

Interpolation method. See approx2.

dimout

The dimensions of the returned matrix.

...

Additional arguments passed to approx2.

Value

If x is a matrix or z is provided, returns a transformed matrix with the dimensions of dimout.

Otherwise, only the transformed coordinates are returned in a data.frame.

Author(s)

Kylie A. Bemis

See Also

approx2

Examples

set.seed(1)
x <- matrix(0, nrow=32, ncol=32)
x[9:24,9:24] <- 10
x <- x + runif(length(x))
xt <- trans2d(x, rotate=15, translate=c(-5, 5))

par(mfcol=c(1,2))
image(x, col=hcl.colors(256), main="original")
image(xt, col=hcl.colors(256), main="transformed")

Universally Unique Identifiers

Description

Generate a UUIDv4.

Usage

uuid(uppercase = FALSE)

hex2raw(x)

raw2hex(x, uppercase = FALSE)

Arguments

x

A vector to convert between raw bytes and hexadecimal strings.

uppercase

Should the result be in uppercase?

Details

uuid generates a random (version 4) universally unique identifier. A private RNG stream is used to avoid collisions due to set.seed and to avoid altering the global .Random.seed.

hex2raw converts a hexadecimal string to a raw vector.

raw2hex converts a raw vector to a hexadecimal string.

Value

For uuid, a list of length 2:

  • string: A character vector giving the UUID.

  • bytes: The raw bytes of the UUID.

For hex2raw, a raw vector.

For raw2hex, a character vector of length 1.

Author(s)

Kylie A. Bemis

Examples

id <- uuid()
id
hex2raw(id$string)
raw2hex(id$bytes)

A Simple Grammar of Base Graphics

Description

These functions provide a simple grammar of graphics approach to programming with R's base graphics system.

Usage

## Initialize a plot
vizi(data, ..., encoding = NULL, mark = NULL, params = NULL)

## Add plot components
add_mark(plot, mark, ..., encoding = NULL, data = NULL,
    trans = NULL, params = NULL)

add_facets(plot, by = NULL, data = NULL,
    nrow = NA, ncol = NA, labels = NULL,
    drop = TRUE, free = "")

## Set plot attributes
set_title(plot, title)

set_channel(plot, channel, label = NULL,
    limits = NULL, scheme = NULL, key = TRUE)

set_coord(plot, xlim = NULL, ylim = NULL, zlim = NULL,
    rev = "", log = "", asp = NA, grid = TRUE)

set_engine(plot, engine = c("base", "plotly"))

set_par(plot, ..., style = NULL)

## Combine plots
as_layers(plotlist, ...)

as_facets(plotlist, ..., nrow = NA, ncol = NA,
    labels = NULL, drop = TRUE, free = "")

Arguments

data

A data.frame.

...

For vizi and add_mark, these should be named arguments specifying the encoding for the plot. The argument names should specify channels, using either base R-style (e.g., pch, cex) or ggplot-style names (e.g., shape, size). One-sided formula arguments will be evaluated in the environment of data. Non-formula arguments will be used as-is. For set_par, these specify additional graphical parameters (as in par) or arguments to persp for 3D plots. For as_facets, these should be additional subplots.

encoding

Encodings specified as a list rather than using ....

mark

The name of a supported mark, such as "points", "lines", etc.

params

Additional graphical parameters that are not mapped to the data, using either base R-style (e.g., pch, cex) or ggplot-style names (e.g., shape, size)

plot, plotlist

A vizi_plot object, or a list of such objects, respectively.

title

The title of the plot.

channel

The channel to modify, using ggplot-style names (e.g., shape, size).

label, labels

Plotting labels.

limits

The limits for the channel, specified as c(min, max) for continuous variables or a character vector of possible levels for discrete variables. The data will be constrained to these limits before plotting.

scheme

A function or vector giving the scheme for encoding the channel. For example, a vector of colors, or a function that returns a vector of n colors.

key

Should a key be generated for the channel?

xlim, ylim, zlim

The plot limits. These only affect the plotting window, not the data. See plot.window

rev

A string specifying spatial dimensions that should be reversed. E.g., "", "x", "y", or "xy".

log, asp

See plot.window.

grid

Should a rectangular grid be included in the plot?

engine

The plotting engine. Default is to use base graphics. Using "plotly" requires the plotly package to be installed.

style

The visual style to use for plotting. Currently supported styles are "light", "dark", and "classic".

trans

A list providing parameters for any transformations the mark supports.

by

A vector or formula giving the facet specification.

nrow, ncol

The number of rows and columns in the facet grid.

drop

Should empty facets be dropped?

free

A string specifying the free spatial dimensions during faceting. E.g., "", "x", "y", or "xy".

Details

Currently supported marks include:

  • points: Points (i.e., scatterplots).

  • lines: Lines (i.e., line charts).

  • peaks: Height (histogram) lines.

  • text: Text labels.

  • rules: Reference lines.

  • bars: Bars for bar charts or histograms.

  • intervals: Line intervals for representing error bars or confidence intervals.

  • boxplot: Box-and-whisker plots.

  • image: Raster graphics.

  • pixels: 2D image from pixels.

  • voxels: 3D image from voxels.

Currently supported encodings include:

  • x, y, z: Positions.

  • xmin, xmax, ymin, ymax: Position limits for intervals and image.

  • image: Rasters for image.

  • shape: Shape of points (pch).

  • size: Size of points (cex).

  • color, colour: Color (col, fg).

  • fill: Fill color (bg).

  • alpha: Opacity.

  • linetype, linewidth, lineend, linejoin, linemetre: Line properties (lty, lwd, lend, ljoin, lmitre).

Value

An object of class vizi_plot.

Author(s)

Kylie A. Bemis

See Also

vizi_points, vizi_lines, vizi_peaks, vizi_text, vizi_intervals, vizi_rules, vizi_bars, vizi_boxplot, vizi_image, vizi_pixels, vizi_voxels

Examples

require(datasets)

mtcars <- transform(mtcars,
    am=factor(am, labels=c("auto", "manual")))

# faceted scatter plot
vizi(mtcars, x=~disp, y=~mpg) |>
    add_mark("points") |>
    add_facets(~mtcars$am)

# faceted scatter plot with color
vizi(mtcars, x=~disp, y=~mpg, color=~am) |>
    add_mark("points",
        params=list(shape=20, size=2, alpha=0.8)) |>
    add_facets(~mtcars$am)

coords <- expand.grid(x=1:nrow(volcano), y=1:ncol(volcano))

# volcano image
vizi(coords, x=~x, y=~y, color=volcano) |>
    add_mark("pixels") |>
    set_coord(grid=FALSE) |>
    set_par(xaxs="i", yaxs="i")

Set Graphical Parameters

Description

Set global parameters for plotting vizi graphics.

Usage

## Set style and palettes
vizi_style(style = "light", dpal = "Tableau 10", cpal = "Viridis")

## Set plotting engine
vizi_engine(engine = c("base", "plotly"))

## Set graphical parameters
vizi_par(..., style = getOption("matter.vizi.style"))

Arguments

style

The visual style to use for plotting. Currently supported styles are "light", "dark", "voidlight", "voiddark", and "transparent".

dpal, cpal

The name of discrete and continous color palettes. See palette.pals and hcl.pals.

engine

The plotting engine. Default is to use base graphics. Using "plotly" requires the plotly package to be installed.

...

These specify additional graphical parameters (as in par).

Value

A character vector or list with the current parameters.

Author(s)

Kylie A. Bemis

See Also

vizi


Warping to Align 1D Signals

Description

Two signals often need to be aligned in order to compare them. The signals may contain a similar pattern, but shifted or stretched by some amount. These functions warp the first signal to align it as closely as possible to the second signal.

Usage

# Signal warping based on local events
warp1_loc(x, y, tx = seq_along(x), ty = seq_along(y),
    events = c("maxmin", "max", "min"), n = length(y),
    interp = c("linear", "loess", "spline"),
    tol = NA_real_, tol.ref = "abs")

# Dynamic time warping
warp1_dtw(x, y, tx = seq_along(x), ty = seq_along(y),
    n = length(y), tol = NA_real_, tol.ref = "abs")

# Correlation optimized warping
warp1_cow(x, y, tx = seq_along(x), ty = seq_along(y),
    nbins = NA_integer_, n = length(y),
    tol = NA_real_, tol.ref = "abs")

Arguments

x, y

Signals to be aligned by warping x to match y.

tx, ty

The domain variable of the signals. If ty is specified but y is missing, then ty are interpreted as locations of reference peaks, and a dummy signal will be created for the warping (using simspec1).

events

The type of events to use for calculating the alignment.

n

The number of samples in the warped x to be returned. By default, it matches the length of y.

interp

The interpolation method used when warping the signal.

tol, tol.ref

A tolerance specifying the maximum allowed distance between aligned samples. See bsearch for details. If missing, the tolerance is estimated as 5% of the signal's domain range.

nbins

The number of signal segments used for warping. The correlation is maximized for each segment.

Details

warp1_loc() uses a simple event-based alignment. Events are defined as local extrema. The events are matched between each signal based on proximity. The shift between the events in x and y are calculated and interpolated to find shifts for each sample. The warped signal is then calculated from these shifts. This is a simple heuristic method, but it is relatively fast and typically good enough for aligning peak locations.

warp1_dtw() performs dynamic time warping. In dynamic time warping, each sample in x is matched to a corresponding sample in y using dynamic programming to find the optimal matches. The version implemented here is constrained by the given tolerance. This both reduces the necessary memory, and in practice tends to give more realistic (and therefore accurate) results than an unconstrained alignment. An unconstrained alignment can still be obtained by setting a high tolerance, but this may use a lot of memory.

warp1_cow() performs correlation optimized. In correlation optimized warping, each signal is divided into some number of segments. Dynamic programming is then used to find the placement of the segment boundaries that maximizes the correlation of all the segments.

Value

A numeric vector the same length as y with the warped x.

Author(s)

Kylie A. Bemis

References

G. Tomasi, F. van den Berg, and C. Andersson. “Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data.” Journal of Chemometrics, vol. 18, pp. 231-241, July 2004.

N. V. Nielsen, J. M. Carstensen, and J. Smedsgaard. “Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping.” Journal of Chromatography A, vol. 805, pp. 17-35, Jan. 1998.

Examples

set.seed(1)
t <- seq(from=0, to=6 * pi, length.out=2000)
dt <- 0.3 * (sin(t) + 0.6 * sin(2.6 * t))
x <- sin(t + dt) + 0.6 * sin(2.6 * (t + dt))
y <- sin(t) + 0.6 * sin(2.6 * t)
xw <- warp1_dtw(x, y)

plot(y, type="l")
lines(x, col="blue")
lines(xw, col="red", lty=2)

Warping to Align 2D Signals

Description

Register two images by warping a "moving" image to align with the "fixed" image.

Usage

# Transformation-based registration
warp2_trans(x, y, control = list(),
    trans = c("rigid", "similarity", "affine"),
    metric = c("cor", "mse", "mi"), nbins = 64L,
    scale = TRUE, dimout = dim(y))

# Mutual information
mi(x, y, n = 64L)

Arguments

x, y

Images to be aligned by warping x (which is "moving") to match y (which is "fixed").

control

A list of optimization control parameters. Passed to optim().

trans

The type of transformation allowed: "rigid" means translation and rotation, "similarity" means translation, rotation, and scaling, and "affine" means a general affine transformation.

metric

The metric to optimize.

nbins, n

The number of histogram bins to use when calculating mutual information.

scale

Should the images be normalized to have the same intensity range?

dimout

The dimensions of the returned matrix.

Details

warp2_trans() performs a simple transformation-based registration using optim for optimization.

Value

A numeric vector the same length as y with the warped x.

Author(s)

Kylie A. Bemis

See Also

trans2d, optim

Examples

set.seed(1)
x <- matrix(0, nrow=32, ncol=32)
x[9:24,9:24] <- 10
x <- x + runif(length(x))
xt <- trans2d(x, rotate=15, translate=c(-5, 5))
xw <- warp2_trans(xt, x)

par(mfcol=c(1,3))
image(x, col=hcl.colors(256), main="original")
image(xt, col=hcl.colors(256), main="transformed")
image(xw, col=hcl.colors(256), main="registered")