| Title: | Nearest Neighbor Detection for Bioconductor Packages |
|---|---|
| Description: | Implements exact and approximate methods for nearest neighbor detection, in a framework that allows them to be easily switched within Bioconductor packages or workflows. Exact searches can be performed using the k-means for k-nearest neighbors algorithm, vantage point trees, or an exhaustive search. Approximate searches can be performed using the Annoy or HNSW libraries. Each search can be performed with a variety of different distance metrics, parallelization, and variable numbers of neighbors. Range-based searches (to find all neighbors within a certain distance) are also supported. |
| Authors: | Aaron Lun [aut, cre, cph] |
| Maintainer: | Aaron Lun <[email protected]> |
| License: | GPL-3 |
| Version: | 2.7.2 |
| Built: | 2026-05-10 10:15:21 UTC |
| Source: | https://github.com/bioc/BiocNeighbors |
Implements exact and approximate methods for nearest neighbor detection, in a framework that allows them to be easily switched within Bioconductor packages or workflows. Exact searches can be performed using the k-means for k-nearest neighbors algorithm, vantage point trees, or an exhaustive search. Approximate searches can be performed using the Annoy or HNSW libraries. Each search can be performed with a variety of different distance metrics, parallelization, and variable numbers of neighbors. Range-based searches (to find all neighbors within a certain distance) are also supported.
Maintainer: Aaron Lun [email protected] [copyright holder]
Authors:
Aaron Lun [email protected] [copyright holder]
Useful links:
Report bugs at https://github.com/knncolle/BiocNeighbors/issues
A class to hold parameters for the Annoy algorithm for approximate nearest neighbor identification.
AnnoyParam( ntrees = 50, search.mult = ntrees, distance = c("Euclidean", "Manhattan", "Cosine") ) ## S4 method for signature 'AnnoyParam' defineBuilder(BNPARAM)AnnoyParam( ntrees = 50, search.mult = ntrees, distance = c("Euclidean", "Manhattan", "Cosine") ) ## S4 method for signature 'AnnoyParam' defineBuilder(BNPARAM)
ntrees |
Integer scalar, number of trees to use for index generation. |
search.mult |
Numeric scalar, multiplier for the number of points to search. |
distance |
String specifying the distance metric to use. Cosine distances are implemented as Euclidean distances on L2-normalized coordinates. |
BNPARAM |
An AnnoyParam instance. |
The Approximate nearest neighbors Oh Yeah (Annoy) algorithm is based on recursive hyperplane partitions. Briefly, a tree is constructed where a random hyperplane splits the points into two subsets at each internal node. Leaf nodes are defined when the number of points in a subset falls below a threshold (close to twice the number of dimensions for the settings used here). Multiple trees are constructed in this manner, each of which is different due to the random choice of hyperplanes. For a given query point, each tree is searched to identify the subset of all points in the same leaf node as the query point. The union of these subsets across all trees is exhaustively searched to identify the actual nearest neighbors to the query.
The ntrees parameter controls the trade-off between accuracy and computational work.
More trees provide greater accuracy at the cost of more computational work (both in terms of the indexing time and search speed in downstream functions).
The search.mult controls the parameter known as search_k in the original Annoy documentation.
Specifically, search_k is defined as k * search.mult where k is the number of nearest neighbors to identify in downstream functions.
This represents the number of points to search exhaustively and determines the run-time balance between speed and accuracy.
The default search.mult=ntrees is based on the Annoy library defaults.
Note that this parameter is not actually used in the index construction itself, and is only included here so that the output index fully parametrizes the search.
Technically, the index construction algorithm is stochastic but, for various logistical reasons, the seed is hard-coded into the C++ code. This means that the results of the Annoy neighbor searches will be fully deterministic for the same inputs, even though the theory provides no such guarantees.
The AnnoyParam constructor returns an instance of the AnnoyParam class.
The defineBuilder method returns a list that can be used in buildIndex to construct an Annoy index.
Aaron Lun
BiocNeighborParam, for the parent class and its available methods.
https://github.com/spotify/annoy, for details on the underlying algorithm.
(out <- AnnoyParam()) out[['ntrees']] out[['ntrees']] <- 20L out(out <- AnnoyParam()) out[['ntrees']] out[['ntrees']] <- 20L out
A virtual class for indexing structures of different nearest-neighbor search algorithms.
Developers should define subclasses for their own buildIndex and/or defineBuilder methods.
In general, the internal structure of a BiocNeighborIndex class is arbitrary and left to the discretion of the developer.
If an arbitrary structure is used, the associated methods should be written for all downstream generics like findKNN, etc.
Alternatively, developers may choose to derive from the BiocNeighborGenericIndex class. This expects:
A ptr slot containing an external pointer that refers to a BiocNeighbors::Prebuilt object
(see definition in system.file("include", "BiocNeighbors.h", package="BiocNeighbors")).
A names slot containing a character vector with the names of the observations, or NULL if no names are available.
This is used by subset= in the various find* generics.
In this case, no additional methods are required for the downstream generics.
Aaron Lun
A virtual class for specifying the type of nearest-neighbor search algorithm and associated parameters.
The BiocNeighborParam class is a virtual base class on which other parameter objects are built. There are currently 5 concrete subclasses in BiocNeighbors:
KmknnParam:Exact nearest-neighbor search with the KMKNN algorithm.
VptreeParam:Exact nearest-neighbor search with the tree algorithm.
ExhaustiveParam:Exact nearest-neighbor search via brute-force.
AnnoyParam:Approximate nearest-neighbor search with the Annoy algorithm.
HnswParam:Approximate nearest-neighbor search with the HNSW algorithm.
These objects hold parameters specifying how each algorithm should be run on an arbitrary data set. See the associated documentation pages for more details.
In the following code snippets, x and object are BiocNeighborParam objects.
show(object):Display the class and arguments of object.
bndistance(object):Return a string specifying the distance metric to be used for searching.
This should be one of "Euclidean", "Manhattan" or "Cosine".
x[[i]]:Return the value of slot i, as used in the constructor for x.
x[[i]] <- value:Set slot i to the specified value.
Aaron Lun
KmknnParam,
VptreeParam,
AnnoyParam,
and HnswParam for constructors.
buildIndex, findKNN and queryKNN for dispatch.
Build indices for nearest-neighbor searching with different algorithms.
buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'missing' buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'NULL' buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'BiocNeighborParam' buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'list' buildIndex(X, BNPARAM, transposed = FALSE, ..., .check.nonfinite = TRUE)buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'missing' buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'NULL' buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'BiocNeighborParam' buildIndex(X, BNPARAM, transposed = FALSE, ...) ## S4 method for signature 'list' buildIndex(X, BNPARAM, transposed = FALSE, ..., .check.nonfinite = TRUE)
X |
A numeric matrix or matrix-like object where rows correspond to data points and columns correspond to variables (i.e., dimensions). |
BNPARAM |
A BiocNeighborParam object specifying the type of index to be constructed.
If |
transposed |
Logical scalar indicating whether |
... |
Further arguments to be passed to individual methods.
If a method accepts arguments here, it should prefix these arguments with the algorithm name to avoid conflicts, e.g., |
.check.nonfinite |
Boolean indicating whether to check for non-finite values in |
Each buildIndex method is expected to return an instance of a BiocNeighborIndex subclass.
The structure of this subclass is arbitrary and left to the discretion of the method developer.
If a buildIndex method returns a BiocNeighborGenericIndex subclass,
the index can be used with the existing methods for findKnnFromIndex, etc. without further effort.
Otherwise, developers are responsible for defining methods for their subclass in each of the relevant generics.
Users should assume that the index is not serializable (i.e., saved or transferred between processes) via R's usual mechanisms.
If the index must be saved to disk, consider using saveIndex instead.
A BiocNeighborIndex object can be used in findKNN and related functions as the X= argument,
or in findKnnFromIndex and related generics as the BNINDEX= argument.
Aaron Lun
Y <- matrix(rnorm(100000), ncol=20) (k.out <- buildIndex(Y)) (a.out <- buildIndex(Y, BNPARAM=AnnoyParam()))Y <- matrix(rnorm(100000), ncol=20) (k.out <- buildIndex(Y)) (a.out <- buildIndex(Y, BNPARAM=AnnoyParam()))
Define a builder object that can construct C++ indices for neighbor searches.
defineBuilder(BNPARAM) ## S4 method for signature 'NULL' defineBuilder(BNPARAM) ## S4 method for signature 'missing' defineBuilder(BNPARAM)defineBuilder(BNPARAM) ## S4 method for signature 'NULL' defineBuilder(BNPARAM) ## S4 method for signature 'missing' defineBuilder(BNPARAM)
BNPARAM |
A BiocNeighborParam object specifying the type of index to be constructed.
If |
The external pointer returned in builder should refer to a BiocNeighbors::Builder object,
see the definition in system.file("include", "BiocNeighbors.h", package="BiocNeighbors") for details.
If a developer defines a defineBuilder method for a search algorithm, they do not have to define a new buildIndex method.
The existing buildIndex methods will automatically create an instance of the appropriate BiocNeighborGenericIndex subclass based on class,
which can be immediately used in all generics (e.g., findKNN, queryNeighbors) without further effort.
Note that the pointer returned by defineBuilder should not be used as the ptr in the BiocNeighborIndex subclasses.
The ptr slot is expected to contain a pointer referring to a BiocNeighbors::Prebuilt object, as returned by the default buildIndex.
Using the pointer from builder will probably crash the R session.
Needless to say, users should not attempt to serialize the external pointer returned by this generic.
Attempting to use a deserialized pointer in buildIndex will cause the R session to crash.
List containing:
builder, a pointer to a builder instance that can be used to construct a prebuilt index in buildIndex.
class, the constructor for a BiocNeighborGenericIndex subclass that accepts ptr and names arguments.
Aaron Lun
defineBuilder,KmknnParam-method,
defineBuilder,VptreeParam-method,
defineBuilder,AnnoyParam-method
and defineBuilder,HnswParam-method for specific methods.
(out <- defineBuilder()) (out2 <- defineBuilder(AnnoyParam()))(out <- defineBuilder()) (out2 <- defineBuilder(AnnoyParam()))
A class to hold parameters for the exhaustive algorithm for exact nearest neighbor identification.
ExhaustiveParam(distance = c("Euclidean", "Manhattan", "Cosine")) ## S4 method for signature 'ExhaustiveParam' defineBuilder(BNPARAM)ExhaustiveParam(distance = c("Euclidean", "Manhattan", "Cosine")) ## S4 method for signature 'ExhaustiveParam' defineBuilder(BNPARAM)
distance |
String specifying the distance metric to use. Cosine distances are implemented as Euclidean distances on L2-normalized coordinates. |
BNPARAM |
An ExhaustiveParam instance. |
The exhaustive search computes all pairwise distances between data and query points to identify nearest neighbors of the latter. It has quadratic complexity and is theoretically the worst-performing method; however, it has effectively no overhead from constructing or querying indexing structures, making it faster for in situations where indexing provides little benefit. This includes queries against datasets with few data points or very high dimensionality.
All that said, this algorithm is largely provided as a baseline for comparing against the other algorithms.
The ExhaustiveParam constructor returns an instance of the ExhaustiveParam class.
The defineBuilder method returns an external pointer that can be used in buildIndex to construct an exhaustive index.
Allison Vuong
BiocNeighborParam, for the parent class and its available methods.
(out <- ExhaustiveParam())(out <- ExhaustiveParam())
Find the distance to the k-th nearest neighbor for each point in a dataset.
findDistanceFromIndex(BNINDEX, k, num.threads = 1, subset = NULL, ...) ## S4 method for signature 'BiocNeighborGenericIndex' findDistanceFromIndex(BNINDEX, k, num.threads = 1, subset = NULL, ...) findDistance(X, k, num.threads = 1, subset = NULL, ..., BNPARAM = NULL)findDistanceFromIndex(BNINDEX, k, num.threads = 1, subset = NULL, ...) ## S4 method for signature 'BiocNeighborGenericIndex' findDistanceFromIndex(BNINDEX, k, num.threads = 1, subset = NULL, ...) findDistance(X, k, num.threads = 1, subset = NULL, ..., BNPARAM = NULL)
BNINDEX |
A BiocNeighborIndex object, typically created by |
k |
A positive integer scalar specifying the number of nearest neighbors to retrieve. Alternatively, an integer vector of length equal to the number of points in All |
num.threads |
Integer scalar specifying the number of threads to use for the search. |
subset |
An integer, logical or character vector specifying the indices of points in |
... |
For For |
X |
A numeric matrix or matrix-like object where rows correspond to data points and columns correspond to variables (i.e., dimensions).
Alternatively, a prebuilt BiocNeighborIndex object from |
BNPARAM |
A BiocNeighborParam object specifying how the index should be constructed.
If |
If multiple queries are to be performed to the same X, it may be beneficial to build the index from X with buildIndex.
The resulting pointer object can be supplied as X to multiple findDistance calls, avoiding the need to repeat index construction in each call.
Numeric vector of length equal to the number of points in X (or subset, if provided),
containing the distance from each point to its k-th nearest neighbor.
This is equivalent to but more memory efficient than using findKNN and subsetting to the last distance.
Aaron Lun
buildIndex, to build an index ahead of time.
Y <- matrix(rnorm(100000), ncol=20) out <- findDistance(Y, k=8) summary(out)Y <- matrix(rnorm(100000), ncol=20) out <- findDistance(Y, k=8) summary(out)
Find the k-nearest neighbors of each point in a dataset.
findKnnFromIndex( BNINDEX, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' findKnnFromIndex( BNINDEX, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) findKNN( X, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ..., BPPARAM = NULL, BNPARAM = NULL )findKnnFromIndex( BNINDEX, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' findKnnFromIndex( BNINDEX, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) findKNN( X, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ..., BPPARAM = NULL, BNPARAM = NULL )
BNINDEX |
A BiocNeighborIndex object, typically created by |
k |
A positive integer scalar specifying the number of nearest neighbors to retrieve. Alternatively, an integer vector of length equal to the number of points in All |
get.index |
A logical scalar indicating whether the indices of the nearest neighbors should be recorded.
Setting this to Alternatively, if |
get.distance |
A logical scalar indicating whether distances to the nearest neighbors should be recorded.
Setting this to Alternatively, if |
num.threads |
Integer scalar specifying the number of threads to use for the search. |
subset |
An integer, logical or character vector specifying the indices of points in |
... |
For For |
X |
A numeric matrix or matrix-like object where rows correspond to data points and columns correspond to variables (i.e., dimensions).
Alternatively, a prebuilt BiocNeighborIndex object from |
BPPARAM |
Soft-deprecated, use |
BNPARAM |
A BiocNeighborParam object specifying how the index should be constructed.
If |
List containing index (if get.index is not FALSE) and distance (if get.distance is not FALSE).
If get.index=TRUE or "normal" and k is an integer scalar,
index is an integer matrix with k columns where each row corresponds to a point (denoted here as ) in X.
The -th row contains the indices of points in X that are the nearest neighbors to point , sorted by increasing distance from .
will not be included in its own set of nearest neighbors.
If get.index=FALSE or "transposed" and k is an integer scalar,
index is as described above but transposed, i.e., the i-th column contains the indices of neighboring points in X.
If get.distance=TRUE or "normal" and k is an integer scalar,
distance is a numeric matrix of the same dimensions as index.
The -th row contains the distances of neighboring points in X to the point , sorted in increasing order.
If get.distance=FALSE or "transposed" and k is an integer scalar,
distance is as described above but transposed, i.e., the i-th column contains the distances to neighboring points in X.
If get.index is not FALSE and k is an integer vector,
index is a list of integer vectors where each vector corresponds to a point (denoted here as ) in X.
The -th vector has length k[i] and contains the indices of points in X that are the nearest neighbors to point , sorted by increasing distance from .
If get.distance is not FALSE and k is an integer vector,
distance is a list of numeric vectors of the same lengths as those in index.
The -th vector contains the distances of neighboring points in X to the point , sorted in increasing order.
Aaron Lun
buildIndex, to build an index ahead of time.
findDistance, to efficiently obtain the distance to the k-th nearest neighbor.
Y <- matrix(rnorm(100000), ncol=20) out <- findKNN(Y, k=8) head(out$index) head(out$distance)Y <- matrix(rnorm(100000), ncol=20) out <- findKNN(Y, k=8) head(out$index) head(out$distance)
Find mutual nearest neighbors (MNN) across two data sets.
findMutualNN(data1, data2, k1, k2 = k1, BNINDEX1 = NULL, BNINDEX2 = NULL, ...)findMutualNN(data1, data2, k1, k2 = k1, BNINDEX1 = NULL, BNINDEX2 = NULL, ...)
data1 |
A numeric matrix containing points in the rows and variables/dimensions in the columns. |
data2 |
A numeric matrix like |
k1 |
Integer scalar specifying the number of neighbors to search for in |
k2 |
Integer scalar specifying the number of neighbors to search for in |
BNINDEX1 |
A pre-built index for |
BNINDEX2 |
A pre-built index for |
... |
Other arguments to be passed to the underlying |
For each point in dataset 1, the set of k2 nearest points in dataset 2 is identified.
For each point in dataset 2, the set of k1 nearest points in dataset 1 is similarly identified.
Two points in different datasets are considered to be part of an MNN pair if each point lies in the other's set of neighbors.
This concept allows us to identify matching points across datasets, which is useful for, e.g., batch correction.
Any values for the BNINDEX1 and BNINDEX2 arguments should be equal to the output of buildIndex for the respective matrices,
using the algorithm specified with BNPARAM.
These arguments are only provided to improve efficiency during repeated searches on the same datasets (e.g., for comparisons between all pairs).
The specification of these arguments should not, generally speaking, alter the output of the function.
A list containing the integer vectors first and second, containing row indices from data1 and data2 respectively.
Corresponding entries in first and second specify a MNN pair consisting of the specified rows from each matrix.
Aaron Lun
queryKNN for the underlying neighbor search code.
fastMNN and related functions from the batchelor package, from which this code was originally derived.
B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- findMutualNN(B1, B2, k1=20) head(out$first) head(out$second)B1 <- matrix(rnorm(10000), ncol=50) # Batch 1 B2 <- matrix(rnorm(10000), ncol=50) # Batch 2 out <- findMutualNN(B1, B2, k1=20) head(out$first) head(out$second)
Find all neighbors within a threshold distance of each point of a dataset.
findNeighborsFromIndex( BNINDEX, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' findNeighborsFromIndex( BNINDEX, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) findNeighbors( X, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ..., BPPARAM = NULL, BNPARAM = NULL )findNeighborsFromIndex( BNINDEX, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' findNeighborsFromIndex( BNINDEX, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ... ) findNeighbors( X, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, ..., BPPARAM = NULL, BNPARAM = NULL )
BNINDEX |
A BiocNeighborIndex object, typically created by |
threshold |
A positive numeric scalar specifying the maximum distance at which a point is considered a neighbor. Alternatively, a vector containing a different distance threshold for each point. |
get.index |
A logical scalar indicating whether the indices of the neighbors should be recorded. |
get.distance |
A logical scalar indicating whether distances to the neighbors should be recorded. |
num.threads |
Integer scalar specifying the number of threads to use for the search. |
subset |
An integer, logical or character vector specifying the indices of points in |
... |
For For |
X |
A numeric matrix or matrix-like object where rows correspond to data points and columns correspond to variables (i.e., dimensions).
Alternatively, a prebuilt BiocNeighborIndex object from |
BPPARAM |
Soft-deprecated, use |
BNPARAM |
A BiocNeighborParam object specifying how the index should be constructed.
If |
This function identifies all points in X that within threshold of each point in X.
For Euclidean distances, this is equivalent to identifying all points in a hypersphere centered around the point of interest.
Not all implementations support this search mode, but we can use KmknnParam and VptreeParam.
If threshold is a vector, each entry is assumed to specify a (possibly different) threshold for each point in X.
If subset is also specified, each entry is assumed to specify a threshold for each point in subset.
An error will be raised if threshold is a vector of incorrect length.
If multiple queries are to be performed to the same X, it may be beneficial to build the index from X with buildIndex.
The resulting pointer object can be supplied as X to multiple findNeighbors calls, avoiding the need to repeat index construction in each call.
A list is returned containing:
index, if get.index=TRUE.
This is a list of integer vectors where each entry corresponds to a point (denoted here as ) in X.
The vector for contains the set of row indices of all points in X that lie within threshold of point .
Neighbors for are sorted by increasing distance.
distance, if get.distance=TRUE.
This is a list of numeric vectors where each entry corresponds to a point (as above) and contains the distances of the neighbors from .
Elements of each vector in distance match to elements of the corresponding vector in index.
If both get.index=FALSE and get.distance=FALSE, an integer vector is returned of length equal to the number of observations.
The i-th entry contains the number of neighbors of within threshold.
If subset is not NULL, each entry of the above vector/lists corresponds to a point in the subset, in the same order as supplied in subset.
Aaron Lun
buildIndex, to build an index ahead of time.
Y <- matrix(runif(100000), ncol=20) out <- findNeighbors(Y, threshold=1) summary(lengths(out$index))Y <- matrix(runif(100000), ncol=20) out <- findNeighbors(Y, threshold=1) summary(lengths(out$index))
Convert the results of various neighbor search functions to alternative representations for more convenient downstream use.
findResultsToMatrix( results, num.pts = NULL, index.transposed = FALSE, distance.transposed = FALSE, query.rows = TRUE, repr = "C" ) findResultsToHits( results, num.pts = NULL, index.transposed = FALSE, distance.transposed = FALSE, query.from = TRUE ) queryResultsToMatrix( results, num.pts, index.transposed = FALSE, distance.transposed = FALSE, query.rows = TRUE, repr = "C" ) queryResultsToHits( results, num.pts, index.transposed = FALSE, distance.transposed = FALSE, query.from = TRUE )findResultsToMatrix( results, num.pts = NULL, index.transposed = FALSE, distance.transposed = FALSE, query.rows = TRUE, repr = "C" ) findResultsToHits( results, num.pts = NULL, index.transposed = FALSE, distance.transposed = FALSE, query.from = TRUE ) queryResultsToMatrix( results, num.pts, index.transposed = FALSE, distance.transposed = FALSE, query.rows = TRUE, repr = "C" ) queryResultsToHits( results, num.pts, index.transposed = FALSE, distance.transposed = FALSE, query.from = TRUE )
results |
For For In both cases, the list should contain an |
num.pts |
Integer specifying the number of data points in the For For |
index.transposed |
Boolean indicating whether |
distance.transposed |
Boolean indicating whether |
query.rows |
Boolean indicating whether query points should be represented as rows in the returned matrix. |
repr |
String specifing the type of sparse matrix, see |
query.from |
Boolean indicating whether query points should be stored as |
For findResultsToMatrix and queryResultsToMatrix, a sparse matrix from the Matrix package is returned.
If query.rows=TRUE, each row corresponds to a query point and the columns with non-zero entries represent the nearest neighbors.
Otherwise, each column corresponds to a query point.
If results contains distance, the value of each non-zero entry contains the distance to that neighbor.
Otherwise, a pattern matrix is returned indicating whether each entry is a nearest neighbor.
For findResultsToHits and queryResultsToHits, a Hits object is returned.
If query.from=TRUE, from contains the observations and to contains the identities of the neighbors.
Otherwise, the query points are stored in to.
If results contains distance, a distance field will be present in mcols,
containing the distance from each observation to each of its neighbors.
The output will be a SelfHits instance if num.obs=NULL for findResultsToHits.
The colPairs slot of the SingleCellExperiment class, from the SingleCellExperiment package.
X <- matrix(rnorm(10000), ncol=20) out <- findKNN(X, k=5) sm <- findResultsToMatrix(out) str(sm) sh <- findResultsToHits(out) sh Y <- matrix(rnorm(1000), ncol=20) out2 <- queryKNN(query=Y, X=X, k=5) sm2 <- queryResultsToMatrix(out2, num.pts = nrow(X)) str(sm2) h2 <- queryResultsToHits(out2, num.pts = nrow(X)) h2X <- matrix(rnorm(10000), ncol=20) out <- findKNN(X, k=5) sm <- findResultsToMatrix(out) str(sm) sh <- findResultsToHits(out) sh Y <- matrix(rnorm(1000), ncol=20) out2 <- queryKNN(query=Y, X=X, k=5) sm2 <- queryResultsToMatrix(out2, num.pts = nrow(X)) str(sm2) h2 <- queryResultsToHits(out2, num.pts = nrow(X)) h2
Extend loadIndex to work with algorithms from BiocNeighbors extension packages.
getLoadGenericIndexRegistry() registerLoadGenericIndexClass(name, class) registerLoadIndexFunction(name, fun)getLoadGenericIndexRegistry() registerLoadGenericIndexClass(name, class) registerLoadIndexFunction(name, fun)
name |
String containing the name of the neighbor search algorithm, e.g., |
class |
Function that accepts a single argument (the external pointer to a |
fun |
Function that accepts a directory path (see |
If the extension developer implements their search algorithm in the form of C++ subclasses of the various knncolle interfaces,
they can call getLoadGenericIndexRegistry in their package's .onLoad to access the registry and add a loading method for their algorithm.
They should also call registerLoadGenericIndexClass to specify a constructor function for the corresponding BiocNeighborGenericIndex subclass.
If the extension developer implements their search algorithm by subclassing BiocNeighbors generics at the R level,
they should call registerLoadIndexFunction in their package's .onLoad.
This function is responsible for reading the on-disk contents from the specified directory and using them to construct a BiocNeighborIndex instance.
For getLoadGenericIndexRegistry, an external pointer to the global knncolle::load_prebuilt_registry() object in C++.
For registerLoadGenericIndexClass, the name and class are registered with loadIndex.
For registerLoadIndexFunction, the name and function are registered with loadIndex.
Aaron Lun
getLoadGenericIndexRegistry() registerLoadGenericIndexClass("knncolle_annoy::Annoy", AnnoyIndex)getLoadGenericIndexRegistry() registerLoadGenericIndexClass("knncolle_annoy::Annoy", AnnoyIndex)
A class to hold parameters for the HNSW algorithm for approximate nearest neighbor identification.
HnswParam( nlinks = 16, ef.construction = 200, ef.search = 10, distance = c("Euclidean", "Manhattan", "Cosine") ) ## S4 method for signature 'HnswParam' defineBuilder(BNPARAM)HnswParam( nlinks = 16, ef.construction = 200, ef.search = 10, distance = c("Euclidean", "Manhattan", "Cosine") ) ## S4 method for signature 'HnswParam' defineBuilder(BNPARAM)
nlinks |
Integer scalar, number of bi-directional links per element for index generation. |
ef.construction |
Integer scalar, size of the dynamic list for index generation. |
ef.search |
Integer scalar, size of the dynamic list for neighbor searching. |
distance |
String specifying the distance metric to use. Cosine distances are implemented as Euclidean distances on L2-normalized coordinates. |
BNPARAM |
A HsnwParam instance. |
In the HNSW algorithm (Malkov and Yashunin, 2016), each point is a node in a “nagivable small world” graph. The nearest neighbor search proceeds by starting at a node and walking through the graph to obtain closer neighbors to a given query point. Nagivable small world graphs are used to maintain connectivity across the data set by creating links between distant points. This speeds up the search by ensuring that the algorithm does not need to take many small steps to move from one cluster to another. The HNSW algorithm extends this idea by using a hierarchy of such graphs containing links of different lengths, which avoids wasting time on small steps in the early stages of the search where the current node position is far from the query.
Larger values of nlinks improve accuracy at the expense of speed and memory usage.
Larger values of ef.construction improve index quality at the expense of indexing time.
The value of ef.search controls the accuracy of the neighbor search at run time,
where larger values improve accuracy at the expense of a slower search.
Technically, the index construction algorithm is stochastic but, for various logistical reasons, the seed is hard-coded into the C++ code. This means that the results of the HNSW neighbor searches will be fully deterministic for the same inputs, even though the theory provides no such guarantees.
The HnswParam constructor returns an instance of the HnswParam class.
The defineBuilder method returns an external pointer that can be used in buildIndex to construct a HNSW index.
Aaron Lun
BiocNeighborParam, for the parent class and its available methods.
https://github.com/nmslib/hnswlib, for details on the underlying algorithm.
(out <- HnswParam()) out[['nlinks']] out[['nlinks']] <- 20L out(out <- HnswParam()) out[['nlinks']] out[['nlinks']] <- 20L out
A class to hold parameters for the k-means k-nearest-neighbors (KMKNN) algorithm for exact nearest neighbor identification.
KmknnParam(..., distance = c("Euclidean", "Manhattan", "Cosine")) ## S4 method for signature 'KmknnParam' defineBuilder(BNPARAM)KmknnParam(..., distance = c("Euclidean", "Manhattan", "Cosine")) ## S4 method for signature 'KmknnParam' defineBuilder(BNPARAM)
... |
Further arguments, ignored. |
distance |
String specifying the distance metric to use. Cosine distances are implemented as Euclidean distances on L2-normalized coordinates. |
BNPARAM |
A KmknnParam instance. |
In the KMKNN algorithm (Wang, 2012), k-means clustering is first applied to the data points using the square root of the number of points as the number of cluster centers. The cluster assignment and distance to the assigned cluster center for each point represent the KMKNN indexing information. This speeds up the nearest neighbor search by exploiting the triangle inequality between cluster centers, the query point and each point in the cluster to narrow the search space. The advantage of the KMKNN approach is its simplicity and minimal overhead, resulting in performance improvements over conventional tree-based methods for high-dimensional data where most points need to be searched anyway. It is also trivially extended to find all neighbors within a threshold distance from a query point.
Note that KMKNN operates much more naturally with Euclidean distances.
Computational efficiency may not be optimal when using it with other choices of distance, though the results will still be exact.
The KmknnParam constructor returns an instance of the KmknnParam class.
The defineBuilder method returns a list that can be used in buildIndex to construct a KMKNN index.
Aaron Lun, using code from the cydar package.
Wang X (2012). A fast exact k-nearest neighbors algorithm for high dimensional search using k-means clustering and triangle inequality. Proc Int Jt Conf Neural Netw, 43, 6:2351-2358.
BiocNeighborParam, for the parent class and its available methods.
(out <- KmknnParam(iter.max=100))(out <- KmknnParam(iter.max=100))
Load a BiocNeighborIndex object from its on-disk representation back into the current R session.
loadIndex(dir, ...)loadIndex(dir, ...)
dir |
String containing the path to a directory in which the original index was saved.
This should be the same as the argument passed to |
... |
Additional arguments passed to specific methods. |
As discussed in saveIndex, it is expected that the on-disk representation is loaded in the same R environment that was used to save it.
Developers are directed to getLoadGenericIndexRegistry to add loading functions for their own search algorithms.
A BiocNeighborIndex object, created from files inside dir.
Aaron Lun
Y <- matrix(rnorm(100000), ncol=20) k.out <- buildIndex(Y) k.nn <- findKNN(k.out, k=5) tmp <- tempfile() dir.create(tmp) saveIndex(k.out, tmp) reloaded <- loadIndex(tmp) re.nn <- findKNN(reloaded, k=5) identical(k.nn, re.nn)Y <- matrix(rnorm(100000), ncol=20) k.out <- buildIndex(Y) k.nn <- findKNN(k.out, k=5) tmp <- tempfile() dir.create(tmp) saveIndex(k.out, tmp) reloaded <- loadIndex(tmp) re.nn <- findKNN(reloaded, k=5) identical(k.nn, re.nn)
Query a reference dataset to determine the distance to the k-th nearest neighbor of each point in a query dataset.
queryDistanceFromIndex( BNINDEX, query, k, num.threads = 1, subset = NULL, transposed = FALSE, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' queryDistanceFromIndex( BNINDEX, query, k, num.threads = 1, subset = NULL, transposed = FALSE, ..., .check.nonfinite = TRUE ) queryDistance( X, query, k, num.threads = 1, ..., subset = NULL, transposed = FALSE, BNPARAM = NULL )queryDistanceFromIndex( BNINDEX, query, k, num.threads = 1, subset = NULL, transposed = FALSE, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' queryDistanceFromIndex( BNINDEX, query, k, num.threads = 1, subset = NULL, transposed = FALSE, ..., .check.nonfinite = TRUE ) queryDistance( X, query, k, num.threads = 1, ..., subset = NULL, transposed = FALSE, BNPARAM = NULL )
BNINDEX |
A BiocNeighborIndex object, typically created by |
query |
A numeric matrix or matrix-like object of query points, containing the same number of columns as |
k |
A positive integer scalar specifying the number of nearest neighbors to retrieve. Alternatively, an integer vector of length equal to the number of points in All |
num.threads |
Integer scalar specifying the number of threads to use for the search. |
subset |
An integer, logical or character vector indicating the rows of |
transposed |
A logical scalar indicating whether |
... |
For For |
.check.nonfinite |
Boolean indicating whether to check for non-finite values in |
X |
The reference dataset to be queried.
This should be a numeric matrix or matrix-like object where rows correspond to reference points and columns correspond to variables (i.e., dimensions).
Alternatively, a prebuilt BiocNeighborIndex object from |
BNPARAM |
A BiocNeighborParam object specifying how the index should be constructed.
If |
If multiple queries are to be performed to the same X, it may be beneficial to build the index from X with buildIndex.
The resulting pointer object can be supplied as X to multiple queryKNN calls, avoiding the need to repeat index construction in each call.
Numeric vector of length equal to the number of points in query (or subset, if provided),
containing the distance from each point to its k-th nearest neighbor.
This is equivalent to but more memory efficient than using queryKNN and subsetting to the last distance.
Aaron Lun
buildIndex, to build an index ahead of time.
Y <- matrix(rnorm(100000), ncol=20) Z <- matrix(rnorm(20000), ncol=20) out <- queryDistance(Y, query=Z, k=5) head(out)Y <- matrix(rnorm(100000), ncol=20) Z <- matrix(rnorm(20000), ncol=20) out <- queryDistance(Y, query=Z, k=5) head(out)
Query a reference dataset for the k-nearest neighbors of each point in a query dataset.
queryKnnFromIndex( BNINDEX, query, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' queryKnnFromIndex( BNINDEX, query, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., .check.nonfinite = TRUE ) queryKNN( X, query, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., BPPARAM = NULL, BNPARAM = NULL )queryKnnFromIndex( BNINDEX, query, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' queryKnnFromIndex( BNINDEX, query, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., .check.nonfinite = TRUE ) queryKNN( X, query, k, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., BPPARAM = NULL, BNPARAM = NULL )
BNINDEX |
A BiocNeighborIndex object, typically created by |
query |
A numeric matrix or matrix-like object of query points, containing the same number of columns as |
k |
A positive integer scalar specifying the number of nearest neighbors to retrieve. Alternatively, an integer vector of length equal to the number of points in All |
get.index |
A logical scalar indicating whether the indices of the nearest neighbors should be recorded.
Setting this to Alternatively, if |
get.distance |
A logical scalar indicating whether distances to the nearest neighbors should be recorded.
Setting this to Alternatively, if |
num.threads |
Integer scalar specifying the number of threads to use for the search. |
subset |
An integer, logical or character vector indicating the rows of |
transposed |
A logical scalar indicating whether |
... |
For For |
.check.nonfinite |
Boolean indicating whether to check for non-finite values in |
X |
The reference dataset to be queried.
This should be a numeric matrix or matrix-like object where rows correspond to reference points and columns correspond to variables (i.e., dimensions).
Alternatively, a prebuilt BiocNeighborIndex object from |
BPPARAM |
Soft-deprecated, use |
BNPARAM |
A BiocNeighborParam object specifying how the index should be constructed.
If |
If multiple queries are to be performed to the same X, it may be beneficial to build the index from X with buildIndex.
The resulting pointer object can be supplied as X to multiple queryKNN calls, avoiding the need to repeat index construction in each call.
List containing index (if get.index is not FALSE) and distance (if get.distance is not FALSE).
If get.index=TRUE or "normal" and k is an integer scalar,
index is an integer matrix with k columns where each row corresponds to a point (denoted here as ) in query.
The -th row contains the indices of points in X that are the nearest neighbors to point , sorted by increasing distance from .
If get.index=FALSE or "transposed" and k is an integer scalar,
index is as described above but transposed, i.e., the i-th column contains the indices of neighboring points in X.
If get.distance=TRUE or "normal" and k is an integer scalar,
distance is a numeric matrix of the same dimensions as index.
The -th row contains the distances of neighboring points in X to the point , sorted in increasing order.
If get.distance=FALSE or "transposed" and k is an integer scalar,
distance is as described above but transposed, i.e., the i-th column contains the distances to neighboring points in X.
If get.index is not FALSE and k is an integer vector,
index is a list of integer vectors where each vector corresponds to a point (denoted here as ) in X.
The -th vector has length k[i] and contains the indices of points in X that are the nearest neighbors to point , sorted by increasing distance from .
If get.distance is not FALSE and k is an integer vector,
distance is a list of numeric vectors of the same lengths as those in index.
The -th vector contains the distances of neighboring points in X to the point , sorted in increasing order.
Aaron Lun
buildIndex, to build an index ahead of time.
queryDistance, to obtain the distance from each query point to its k-th nearest neighbor.
Y <- matrix(rnorm(100000), ncol=20) Z <- matrix(rnorm(20000), ncol=20) out <- queryKNN(Y, query=Z, k=5) head(out$index) head(out$distance)Y <- matrix(rnorm(100000), ncol=20) Z <- matrix(rnorm(20000), ncol=20) out <- queryKNN(Y, query=Z, k=5) head(out$index) head(out$distance)
Find all points in a reference dataset that lie within a threshold distance of each point in a query dataset.
queryNeighborsFromIndex( BNINDEX, query, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' queryNeighborsFromIndex( BNINDEX, query, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., .check.nonfinite = TRUE ) queryNeighbors( X, query, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., BPPARAM = NULL, BNPARAM = NULL )queryNeighborsFromIndex( BNINDEX, query, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ... ) ## S4 method for signature 'BiocNeighborGenericIndex' queryNeighborsFromIndex( BNINDEX, query, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., .check.nonfinite = TRUE ) queryNeighbors( X, query, threshold, get.index = TRUE, get.distance = TRUE, num.threads = 1, subset = NULL, transposed = FALSE, ..., BPPARAM = NULL, BNPARAM = NULL )
BNINDEX |
A BiocNeighborIndex object, typically created by |
query |
A numeric matrix or matrix-like object of query points, containing the same number of columns as |
threshold |
A positive numeric scalar specifying the maximum distance at which a point is considered a neighbor. Alternatively, a vector containing a different distance threshold for each query point. |
get.index |
A logical scalar indicating whether the indices of the neighbors should be recorded. |
get.distance |
A logical scalar indicating whether distances to the neighbors should be recorded. |
num.threads |
Integer scalar specifying the number of threads to use for the search. |
subset |
An integer, logical or character vector indicating the rows of |
transposed |
A logical scalar indicating whether |
... |
For For |
.check.nonfinite |
Boolean indicating whether to check for non-finite values in |
X |
The reference dataset to be queried.
This should be a numeric matrix or matrix-like object where rows correspond to reference points and columns correspond to variables (i.e., dimensions).
Alternatively, a prebuilt BiocNeighborIndex object from |
BPPARAM |
Soft-deprecated, use |
BNPARAM |
A BiocNeighborParam object specifying how the index should be constructed.
If |
This function identifies all points in X that within threshold of each point in query.
For Euclidean distances, this is equivalent to identifying all points in a hypersphere centered around the point of interest.
Not all implementations support this search mode, but we can use KmknnParam and VptreeParam.
If threshold is a vector, each entry is assumed to specify a (possibly different) threshold for each point in query.
If subset is also specified, each entry is assumed to specify a threshold for each point in subset.
An error will be raised if threshold is a vector of incorrect length.
If multiple queries are to be performed to the same X, it may be beneficial to build the index from X with buildIndex.
The resulting pointer object can be supplied as X to multiple queryKNN calls, avoiding the need to repeat index construction in each call.
A list is returned containing:
index, if get.index=TRUE.
This is a list of integer vectors where each entry corresponds to a point (denoted here as ) in query.
The vector for contains the set of row indices of all points in X that lie within threshold of point .
Neighbors for are sorted by increasing distance from .
distance, if get.distance=TRUE.
This is a list of numeric vectors where each entry corresponds to a point (as above) and contains the distances of the neighbors from .
Elements of each vector in distance match to elements of the corresponding vector in index.
If both get.index=FALSE and get.distance=FALSE, an integer vector is returned of length equal to the number of observations.
The i-th entry contains the number of neighbors of within threshold.
If subset is not NULL, each entry of the above vector/lists refers to a point in the subset, in the same order as supplied in subset.
Aaron Lun
buildIndex, to build an index ahead of time.
Y <- matrix(rnorm(100000), ncol=20) Z <- matrix(rnorm(20000), ncol=20) out <- queryNeighbors(Y, query=Z, threshold=3) summary(lengths(out$index))Y <- matrix(rnorm(100000), ncol=20) Z <- matrix(rnorm(20000), ncol=20) out <- queryNeighbors(Y, query=Z, threshold=3) summary(lengths(out$index))
Save an index for nearest-neighbor searching to disk.
saveIndex(BNINDEX, dir, ...) ## S4 method for signature 'BiocNeighborGenericIndex' saveIndex(BNINDEX, dir, ...)saveIndex(BNINDEX, dir, ...) ## S4 method for signature 'BiocNeighborGenericIndex' saveIndex(BNINDEX, dir, ...)
BNINDEX |
A BiocNeighborIndex object representing a pre-built index, typically from |
dir |
String containing the path to a directory in which to save the index. This directory should already exist. |
... |
Further arguments to pass to specific methods. |
Files generated by saveIndex are not guaranteed to be portable across architectures, compilers, or even versions of BiocNeighbors.
An index saved in this manner is only intended to be read back to the same R environment on the same machine.
One or more files are created on disk inside dir.
These can be used to reconstitute BNINDEX by calling loadIndex.
Aaron Lun
Y <- matrix(rnorm(100000), ncol=20) k.out <- buildIndex(Y) tmp <- tempfile() dir.create(tmp) saveIndex(k.out, tmp) list.files(tmp, recursive=TRUE)Y <- matrix(rnorm(100000), ncol=20) k.out <- buildIndex(Y) tmp <- tempfile() dir.create(tmp) saveIndex(k.out, tmp) list.files(tmp, recursive=TRUE)
A class to hold parameters for the vantage point (VP) tree algorithm for exact nearest neighbor identification.
VptreeParam(distance = c("Euclidean", "Manhattan", "Cosine")) ## S4 method for signature 'VptreeParam' defineBuilder(BNPARAM)VptreeParam(distance = c("Euclidean", "Manhattan", "Cosine")) ## S4 method for signature 'VptreeParam' defineBuilder(BNPARAM)
distance |
String specifying the distance metric to use. Cosine distances are implemented as Euclidean distances on L2-normalized coordinates. |
BNPARAM |
A VptreeParam instance. |
In a VP tree (Yianilos, 1993), each node contains a subset of points that is split into two further partitions. The split is determined by picking an arbitrary point inside that subset as the node center, computing the distance to all other points from the center, and taking the median as the “radius”. The left child of this node contains all points within the median distance from the radius, while the right child contains the remaining points. This is applied recursively until all points resolve to individual nodes. The nearest neighbor search traverses the tree and exploits the triangle inequality between query points, node centers and thresholds to narrow the search space.
VP trees are often faster than more conventional KD-trees or ball trees as the former uses the points themselves as the nodes of the tree, avoiding the need to create many intermediate nodes and reducing the total number of distance calculations. Like KMKNN, it is also trivially extended to find all neighbors within a threshold distance from a query point.
The VptreeParam constructor returns an instance of the VptreeParam class.
The defineBuilder method returns an external pointer that can be used in buildIndex to construct a VP tree index.
Aaron Lun
Yianilos PN (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, 311-321.
BiocNeighborParam, for the parent class and its available methods.
https://stevehanov.ca/blog/index.php?id=130, for a description of the algorithm.
(out <- VptreeParam())(out <- VptreeParam())