The BiocNeighbors package implements a variety of nearest-neighbor (NN) search algorithms with a consistent interface. This includes exact search methods like vantage point trees, k-means k-nearest neighbors, and an exhaustive brute-force search, as well as approximate methods like Approximate Nearest Neighbors Oh Yeah (Annoy) and hierarchical navigable small worlds (HNSW). We provides methods to search nearest neighbors within a dataset, query across datasets, and (for some algorithms) find all neighbors within a certain distance.
The findKNN()
function will find the
k
-nearest neighbors within a dataset, returning one matrix
of neighbor identities and another matrix of (sorted) distances to each
neighbor.
# Mocking up a matrix with 'nobs' points in the rows
# and 'ndim' dimensions in the columns.
nobs <- 1000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
library(BiocNeighbors)
fout <- findKNN(data, k=10)
str(fout)
## List of 2
## $ index : int [1:1000, 1:10] 10 501 106 881 721 486 775 62 461 1 ...
## $ distance: num [1:1000, 1:10] 1.06 1.199 0.992 1.155 1.028 ...
Each row of the index
matrix corresponds to a row in
data
and contains the row indices in data
that
are its nearest neighbors. For example, the 3rd point in
data
has the following nearest neighbors with their
(sorted) distances:
## [1] 106 116 791 493 68 840 939 736 388 400
## [1] 0.9923836 1.0118565 1.0627019 1.0715067 1.0875516 1.1378636 1.1385719
## [8] 1.1421896 1.1537396 1.1638757
Users can easily change the algorithm by setting the
BNPARAM=
argument. For example, the code below switches to
the Annoy algorithm for a faster, less accurate search.
## List of 2
## $ index : int [1:1000, 1:10] 10 501 106 881 721 486 775 62 461 1 ...
## $ distance: num [1:1000, 1:10] 1.06 1.199 0.992 1.155 1.028 ...
The same mechanism can be used to change the distance metric from the Euclidean default.
## List of 2
## $ index : int [1:1000, 1:10] 10 797 106 881 721 486 400 114 461 246 ...
## $ distance: num [1:1000, 1:10] 4.02 3.9 3.62 3.63 3.71 ...
If the number of neighbors differs for each point, we can supply an
integer vector to k=
instead. This yields a list of vectors
containing the neighbors and their distances to each point.
var.k <- sample(10, nrow(data), replace=TRUE)
# use I() to distinguish between scalar and length-1 vectors.
var.out <- findKNN(data, k=I(var.k))
head(var.out$index)
## [[1]]
## [1] 10 743 121 322 608 246 25 923 96 538
##
## [[2]]
## [1] 501 83
##
## [[3]]
## [1] 106 116 791
##
## [[4]]
## [1] 881 558 267 120 400 335 833
##
## [[5]]
## [1] 721
##
## [[6]]
## [1] 486 785 269 48 140
## [[1]]
## [1] 1.059995 1.195867 1.207465 1.208032 1.246542 1.246966 1.275415 1.285533
## [9] 1.287975 1.288305
##
## [[2]]
## [1] 1.198662 1.208656
##
## [[3]]
## [1] 0.9923836 1.0118565 1.0627019
##
## [[4]]
## [1] 1.155241 1.170470 1.184136 1.197051 1.197828 1.215869 1.221885
##
## [[5]]
## [1] 1.02773
##
## [[6]]
## [1] 1.030991 1.063953 1.082797 1.102278 1.103196
queryKNN()
is a related function that will find the
k
-nearest neighbors in one dataset based on query points in
another dataset. Here, the rows of the output matrices correspond to
rows of our query
matrix.
nquery <- 100
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5)
str(qout)
## List of 2
## $ index : int [1:100, 1:5] 349 540 520 686 443 783 493 440 94 71 ...
## $ distance: num [1:100, 1:5] 1.11 1.33 1.14 1.14 1.02 ...
When performing multiple calls to findKNN()
or
queryKNN()
on the same data, advanced users may prefer to
build the index first with buildIndex()
. This can be
efficiently reused without repeating the index construction in each
call.
In some applications, we need to find all neighboring points within a
certain distance threshold. This is achieved using the
findNeighbors()
function:
nobs <- 8000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findNeighbors(data, threshold=1)
head(fout$index)
## [[1]]
## [1] 5845 5382
##
## [[2]]
## [1] 1461 1647
##
## [[3]]
## [1] 410
##
## [[4]]
## [1] 6666 3205 156 4552
##
## [[5]]
## [1] 2886 2443 3980 7484 3585 3148 5932
##
## [[6]]
## integer(0)
## [[1]]
## [1] 0.9209169 0.9849596
##
## [[2]]
## [1] 0.9741905 0.9803832
##
## [[3]]
## [1] 0.8953597
##
## [[4]]
## [1] 0.9568178 0.9717669 0.9795090 0.9995167
##
## [[5]]
## [1] 0.8810099 0.8948550 0.9425610 0.9534772 0.9583569 0.9727592 0.9996600
##
## [[6]]
## numeric(0)
Each entry of the index
list corresponds to a point in
data
and contains the row indices in data
that
are within threshold
. For example, the 3rd point in
data
has the following neighbors with the associated
distances:
## [1] 410
## [1] 0.8953597
Again, we can switch algorithms by specifying BNPARAM=
.
However, keep in mind that not all algorithms (particularly the
approximate methods) support this range-based search.
vparam <- VptreeParam(distance="Manhattan")
vout <- findNeighbors(data, threshold=1, BNPARAM=vparam)
queryNeighbors()
is a related function that will
identify all points within a certain distance of a query point. Each
entry of the returned index
and distance
corresponds to a row of query
and describes its neighbors
in data
.
nquery <- 100
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryNeighbors(data, query, threshold=1)
head(qout$index)
## [[1]]
## [1] 2308 6812 1354 6347 3242 7625
##
## [[2]]
## [1] 6994 216 1693 5627
##
## [[3]]
## integer(0)
##
## [[4]]
## integer(0)
##
## [[5]]
## [1] 7
##
## [[6]]
## [1] 1784 3951 1147 2695 7216 4161 832 724 4625 6374 2597 265
As described above, advanced users can built the index first before
repeated calls to findNeighbors()
and
queryNeighbors()
.
prebuilt <- buildIndex(data)
out1 <- findNeighbors(prebuilt, threshold=1)
out2 <- queryNeighbors(prebuilt, query,threshold=1)
If only the number of neighbors is of interest, we can set
get.index=FALSE
and get.distance=FALSE
in the
findNeighbors()
/queryNeighbors()
call. This
will count the number of neighbors without storing their
identities/distances for greater memory efficiency.
## [1] 2 2 1 4 7 0
R/Bioconductor package developers can use BiocNeighbors
to perform nearest-neighbor searches in their own C++ code. This uses
the interfaces in the knncolle
library to abstract away the underlying search methods. First, we add
some linking instructions in the DESCRIPTION
to make the
header files available:
LinkingTo: Rcpp, assorthead, BiocNeighbors
This includes the BiocNeighbors.h
file that implements
the BiocNeigbors::Prebuilt
class:
When a buildIndex()
method returns an external pointer,
that pointer is expected to refer to an instance of a
BiocNeighbors::Prebuilt
. Thus, downstream code just needs
to accept a pointer and cast it to a
BiocNeighbors::Prebuilt
for use in nearest-neighbor
searches. (The same approach can be applied to obtain a
BiocNeighbors::Builder
for construction of new
indices.)
SEXP do_something(SEXP raw_ptr, int k) {
BiocNeighbors::PrebuiltPointer ptr(raw_ptr);
const auto& prebuilt = ptr->index;
auto searcher = prebuilt->initialize();
std::vector<int> indices;
std::vector<double> distances;
searcher->search(0, k, &indices, &distances);
return Rcpp::List::create(
Rcpp::IntegerVector(indices.begin(), indices.end()),
Rcpp::IntegerVector(distances.begin(), distances.end())
);
}
Some extra work is required for cosine-normalized indices, which is
indicated by ptr->cosine == true
. Any query data should
also be L2-normalized before use in Searcher::search()
or
Searcher::search_all()
. (This is not required for overloads
that accept an index for searching within a dataset, only for
queries between datasets.)
Developers can extend the BiocNeighbors
framework to more algorithms through the creation of new
BiocNeighborParam
classes. Users can then switch between
algorithms by simply changing the BNPARAM=
argument in
their calls to findKNN()
, etc. In general, we recommend
putting these extensions in a new R/Bioconductor package that adds more
methods to the BiocNeighbors
generics. Developers can choose to write pure R extensions or they can
write them partially in C++.
Each extension package should:
SomeNewParam
S4 class that inherits from
BiocNeighborParam
.SomeNewIndex
S4 class that inherits from
BiocNeighborIndex
and contains an arbitrary index
structure.buildIndex()
S4 method, dispatching on a
matrix
type for X=
and a
SomeNewParam
type for BNPARAM=
. This should
return an instance of SomeNewIndex
.findKNN()
and
queryKNN()
(and optionally findNeighbors()
queryNeighbors()
). One method should dispatch on a
matrix
type for X=
and a
SomeNewParam
type for BNPARAM=
. The other
method should dispatch on a prebuilt SomeNewIndex
type for
X=
and an ANY
type for BNPARAM=
(as the BNPARAM=
is ignored for prebuilt indices).Alternatively, each extension package should:
Builder
,
Prebuilt
and Searcher
interfaces in the knncolle
library. This requires LinkingTo: assorthead, BiocNeighbors
in the DESCRIPTION
.SomeNewParam
S4 class that inherits from
BiocNeighborParam
.SomeNewIndex
S4 class that inherits from
BiocNeighborGenericIndex
. This should contain a
ptr
slot for an external pointer and a names
slot for the observation names.defineBuilder()
method, dispatching on a
matrix
type for X=
and a
SomeNewParam
type for BNPARAM=
. This should
call into C++ and return a list containing builder
, an
external pointer to an instance of a BiocNeighbors::Builder
(see definition in BiocNeighbors.h
); and
class
, the constructor for the new
SomeNewIndex
class.No new methods are required for findKNN()
,
queryNeighbors()
, etc. as a
BiocNeighborGeneric
method is already available for each
generic. This approach also allows the new method to be used in C++ code
of downstream packages that accept a
BiocNeighbors::Prebuilt
instance.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_2.1.0 knitr_1.48 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
## [4] xfun_0.48 maketools_1.3.1 cachem_1.1.0
## [7] htmltools_0.5.8.1 rmarkdown_2.28 buildtools_1.0.0
## [10] lifecycle_1.0.4 cli_3.6.3 sass_0.4.9
## [13] jquerylib_0.1.4 compiler_4.4.1 sys_3.4.3
## [16] tools_4.4.1 evaluate_1.0.1 bslib_0.8.0
## [19] Rcpp_1.0.13 yaml_2.3.10 BiocManager_1.30.25
## [22] jsonlite_1.8.9 rlang_1.1.4