Description

Lightweight Vector Database with Embedded Machine Learning Models.

Description

A lightweight vector database for text retrieval in R with embedded machine learning models and no external API (Application Programming Interface) keys. Supports dense and hybrid search, optional HNSW (Hierarchical Navigable Small World) approximate nearest-neighbor indexing, faceted filters with ACL (Access Control List) metadata, command-line tools, and a local dashboard built with 'shiny'. The HNSW method is described by Malkov and Yashunin (2018) <doi:10.1109/TPAMI.2018.2889473>.

README.md

cran.r-project.org

VectrixDB

Overview

Zero config. Text in, results out. Pure R - no Python required.

VectrixDB is a lightweight, pure R vector database with built-in text embeddings. No external dependencies, no API keys, no Python - just install and use.

Installation

# Install from GitHub
devtools::install_github("knowusuboaky/vectrixdb-r")

# Required dependencies
install.packages(c("R6", "text2vec", "digest", "Matrix"))

# Optional for better performance
install.packages(c("RcppAnnoy", "word2vec", "stopwords"))

Quick Start

library(VectrixDB)

# Create and add - ONE LINE
db <- Vectrix$new("my_docs")$add(c("Python is great", "Machine learning is fun", "R is awesome"))

# Search - ONE LINE
results <- db$search("programming")

# Get top result
print(results$top()$text)
#> "R is awesome"

Features

Pure R - No Python Required

Unlike other vector databases, VectrixDB is 100% R: - TF-IDF embeddings - Built-in, works out of the box - BM25 search - Using text2vec - Optional word vectors - Load GloVe or word2vec files - ANN indexing - Fast search with RcppAnnoy

Search Modes

# Dense (semantic) search using TF-IDF/word vectors
results <- db$search("query", mode = "dense")

# Sparse (keyword/BM25) search
results <- db$search("query", mode = "sparse")

# Hybrid (dense + sparse with RRF fusion)
results <- db$search("query", mode = "hybrid")

# Ultimate (hybrid + reranking)
results <- db$search("query", mode = "ultimate")

Embedding Options

# Default: TF-IDF (no external files needed)
db <- Vectrix$new("docs")

# With GloVe word vectors (download from Stanford NLP)
db <- Vectrix$new("docs", model = "glove", model_path = "glove.6B.100d.txt")

# With word2vec (download or train your own)
db <- Vectrix$new("docs", model = "word2vec", model_path = "GoogleNews-vectors.bin")

# Custom embedding function
my_embed <- function(texts) {
  # Your custom logic
  matrix(rnorm(length(texts) * 100), nrow = length(texts))
}
db <- Vectrix$new("docs", embed_fn = my_embed, dimension = 100)

Metadata Filtering

# Add with metadata
db$add(
  texts = c("Python guide", "ML tutorial", "R handbook"),
  metadata = list(
    list(category = "programming", year = 2024),
    list(category = "ai", year = 2024),
    list(category = "programming", year = 2023)
  )
)

# Search with filter
results <- db$search("guide", filter = list(category = "programming"))

Reranking

# MMR for diversity
results <- db$search("AI", rerank = "mmr", diversity = 0.7)

# Cross-encoder style reranking
results <- db$search("AI", rerank = "cross-encoder")

Results API

results <- db$search("query")

# Access results
results$top()          # Get top result
results$texts()        # All result texts as vector
results$ids()          # All result IDs
results$scores()       # All scores
results$length()       # Number of results
results$get(2)         # Get second result

# Iterate
results$foreach(function(r) {
  cat(sprintf("%s: %.3f\n", r$id, r$score))
})

Advanced Usage

VectrixDB Class

# Create database
vdb <- VectrixDB$new("./my_data")

# Create collection with custom dimension
collection <- vdb$create_collection("docs", dimension = 100)

# Add vectors directly
collection$add(
  ids = c("doc1", "doc2"),
  vectors = matrix(rnorm(200), nrow = 2),
  metadata = list(list(a = 1), list(a = 2)),
  texts = c("First doc", "Second doc")
)

# Search
results <- collection$search(query_vector, limit = 10)

# Hybrid search
results <- collection$hybrid_search(
  query = query_vector,
  query_text = "search terms",
  limit = 10
)

REST API Server

# Start server (requires plumber package)
vectrix_serve(path = "./my_data", port = 7377)

Performance Tips

Use RcppAnnoy for large collections - Automatically enabled for 100+ docs
Pre-train embeddings - Fit the embedder once on your corpus
Use word vectors - GloVe/word2vec provides better semantic search than TF-IDF

# Pre-train for better embeddings
db <- Vectrix$new("docs")
# Add your corpus - TF-IDF will fit automatically
db$add(large_corpus)
# Now searches will use the fitted vocabulary

Comparison with Python Version

Feature	R Version	Python Version
Dependencies	Pure R	Python + ONNX
Default embeddings	TF-IDF	MiniLM (neural)
Word vectors	GloVe, word2vec	Same + more
ANN indexing	RcppAnnoy	usearch
API server	plumber	FastAPI

The R version prioritizes simplicity and R-native tools. For neural embeddings, use the Python version or provide custom embed_fn.

Dependencies

Required: - R6 - OOP classes - text2vec - TF-IDF, BM25, vocabulary - digest - ID generation - Matrix - Sparse matrices

Optional: - RcppAnnoy - Fast ANN search - word2vec - Load word2vec models - stopwords - Remove stopwords - RSQLite, DBI - Persistent storage - plumber - REST API

License

Apache License 2.0

Author

Kwadwo Daddy Nyame Owusu Boakye

r-VectrixDB

VectrixDB

Overview

Installation

Quick Start

Features

Pure R - No Python Required

Search Modes

Embedding Options

Metadata Filtering

Reranking

Results API

Advanced Usage

VectrixDB Class

REST API Server

Performance Tips

Comparison with Python Version

Dependencies

License

Author

Links

Version

License

Status

Source

Homepage

Platforms (80)

VectrixDB

Overview

Installation

Quick Start

Features

Pure R - No Python Required

Search Modes

Embedding Options

Metadata Filtering

Reranking

Results API

Advanced Usage

VectrixDB Class

REST API Server

Performance Tips

Comparison with Python Version

Dependencies

License

Author

Links

Version

License

Status

Source

Homepage

Platforms80 (80)

Platforms (80)