MyNixOS website logo
Description

Approximate k-Nearest Neighbour Search for 'bigmemory' Matrices with Annoy.

Approximate Euclidean k-nearest neighbour search routines that operate on 'bigmemory::big.matrix' data through Annoy indexes created with 'RcppAnnoy'. The package builds persistent on-disk indexes plus sidecar metadata from streamed 'big.matrix' rows, supports euclidean, angular, Manhattan, and dot-product Annoy metrics, and can either return in-memory results or stream neighbour indices and distances into destination 'bigmemory' matrices. Explicit index life cycle helpers, stronger metadata validation, descriptor-aware file-backed workflows, and benchmark helpers are also included.

bigANNOY

Approximate nearest-neighbour search for bigmemory matrices with Annoy

Frédéric Bertrand

R-CMD-check R-hub GitHub Repo stars

The bigANNOY package provides approximate nearest-neighbour search specialised for bigmemory::big.matrix objects through persisted Annoy indexes. It keeps the reference data in bigmemory storage during build and query workflows, supports repeated-query sessions through explicit open/load helpers, and can stream neighbour indices and distances directly into destination big.matrix objects.

Current features include:

  • native C++ bigmemory-backed build and search paths, with an R backend kept as a debug-only fallback,
  • persisted Annoy indexes plus sidecar metadata for safe reopen and validation,
  • Euclidean, angular, Manhattan, and dot-product Annoy metrics,
  • self-search and external-query workflows on dense matrices, big.matrix objects, descriptors, descriptor paths, and external pointers,
  • streamed output into file-backed or in-memory big.matrix destinations,
  • explicit lifecycle helpers such as annoy_open_index(), annoy_load_bigmatrix(), annoy_is_loaded(), annoy_close_index(), and annoy_validate_index(), and
  • benchmark helpers that can compare approximate Euclidean search against the exact bigKNN baseline when bigKNN is available.

These workflows make bigANNOY useful both as a standalone approximate search package and as the ANN side of an exact-versus-approximate evaluation pipeline built around bigKNN.

Installation

The package is currently easiest to install from GitHub:

# install.packages("remotes")
remotes::install_github("fbertran/bigANNOY")

If you prefer a local source install, clone the repository and run:

R CMD build bigANNOY
R CMD INSTALL bigANNOY_0.3.0.tar.gz

Options

The package defines a small set of runtime options:

OptionDefault valueDescription
bigANNOY.block_size1024LDefault number of rows processed per build/search block.
bigANNOY.progressFALSEEmit simple progress messages during long-running builds, searches, and benchmarks.
bigANNOY.backend"cpp"Backend request. "cpp" uses the native compiled backend, "auto" falls back when compiled symbols are not loaded, and "r" forces the debug-only R backend.

All options can be changed with options() at runtime. For example, options(bigANNOY.block_size = 2048L) increases the default block size used by the build and search helpers.

Examples

The examples below use a small Euclidean reference matrix so the returned neighbours are easy to inspect.

Build and query an Annoy index

library(bigmemory)
library(bigANNOY)

reference <- as.big.matrix(matrix(
  c(0, 0,
    1, 0,
    0, 1,
    1, 1,
    2, 2),
  ncol = 2,
  byrow = TRUE
))

query <- matrix(
  c(0.1, 0.1,
    1.8, 1.9),
  ncol = 2,
  byrow = TRUE
)

index <- annoy_build_bigmatrix(
  reference,
  path = tempfile(fileext = ".ann"),
  metric = "euclidean",
  n_trees = 20L,
  seed = 123L,
  load_mode = "eager"
)

result <- annoy_search_bigmatrix(
  index,
  query = query,
  k = 2L,
  search_k = 100L
)

result$index
round(result$distance, 3)

Reopen and validate a persisted index

reopened <- annoy_open_index(index$path, load_mode = "lazy")

annoy_is_loaded(reopened)

report <- annoy_validate_index(
  reopened,
  strict = TRUE,
  load = TRUE
)

report$valid
annoy_is_loaded(reopened)

Stream results into bigmemory outputs

index_store <- big.matrix(nrow(query), 2L, type = "integer")
distance_store <- big.matrix(nrow(query), 2L, type = "double")

annoy_search_bigmatrix(
  index,
  query = query,
  k = 2L,
  xpIndex = index_store,
  xpDistance = distance_store
)

bigmemory::as.matrix(index_store)
round(bigmemory::as.matrix(distance_store), 3)

Benchmark approximate Euclidean search

benchmark_annoy_bigmatrix(
  n_ref = 2000L,
  n_query = 200L,
  n_dim = 20L,
  k = 10L,
  n_trees = 50L,
  search_k = 1000L,
  metric = "euclidean",
  exact = TRUE
)

If bigKNN is installed, the Euclidean benchmark helpers also report exact search timing and recall against the exact baseline.

Installed Benchmark Runner

An installed command-line benchmark script is also available at:

system.file("benchmarks", "benchmark_annoy.R", package = "bigANNOY")

Example single-run command:

Rscript "$(R -q -e 'cat(system.file(\"benchmarks\", \"benchmark_annoy.R\", package = \"bigANNOY\"))')" \
  --mode=single \
  --n_ref=5000 \
  --n_query=500 \
  --n_dim=50 \
  --k=20 \
  --n_trees=100 \
  --search_k=5000 \
  --load_mode=eager

Vignettes

The package now ships with focused vignettes for the main workflows:

  • getting-started-bigannoy
  • persistent-indexes-and-lifecycle
  • file-backed-bigmemory-workflows
  • benchmarking-recall-and-latency
  • metrics-and-tuning
  • validation-and-sharing-indexes
  • bigannoy-vs-bigknn

Together they cover the basic ANN workflow, loaded-index lifecycle, file-backed bigmemory usage, benchmarking and recall evaluation, tuning, validation and sharing of persisted indexes, and the relationship between approximate bigANNOY search and exact bigKNN search.

Metadata

Version

0.3.0

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows