Description

Columnar Query Engine for Larger-than-RAM Data.

Description

A minimal columnar query engine with lazy execution on datasets larger than RAM. Provides 'dplyr'-like verbs (filter(), select(), mutate(), group_by(), summarise(), joins, window functions) and common aggregations (n(), sum(), mean(), min(), max(), sd(), first(), last()) backed by a pure C11 pull-based execution engine and a custom on-disk format ('.vtr'). Reads and writes 'GeoTIFF' (including tiled and 'BigTIFF' layouts) and a tiled raster format ('.vec') with overview pyramids and time cubes for larger-than-RAM raster data.

README.md

cran.r-project.org

vectra

vectra is an R-native columnar query engine for datasets larger than RAM.

Write dplyr-style pipelines against multi-GB files on a laptop. Data streams through a C11 pull-based engine one row group at a time, so peak memory stays bounded regardless of file size.

Quick Start

Point vectra at any file and query it with dplyr verbs. Nothing runs until collect().

library(vectra)

# CSV: lazy scan with type inference
tbl_csv("measurements.csv") |>
  filter(temperature > 30, year >= 2020) |>
  group_by(station) |>
  summarise(avg_temp = mean(temperature), n = n()) |>
  collect()

# GeoTIFF: climate rasters as tidy data
tbl_tiff("worldclim_bio1.tif") |>
  filter(band1 > 0) |>
  mutate(temp_c = band1 / 10) |>
  collect()

# Point extraction: sample raster values at coordinates, no terra needed
tiff_extract_points("worldclim_bio1.tif",
                    x = c(10.5, 11.2), y = c(47.1, 47.3))

# SQLite: zero-dependency, no DBI required
tbl_sqlite("survey.db", "responses") |>
  filter(year == 2025) |>
  left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
  collect()

For repeated queries, convert to vectra's native .vtr format for faster reads:

write_vtr(big_df, "data.vtr", batch_size = 100000)

tbl("data.vtr") |>
  filter(x > 0, region == "EU") |>
  group_by(region) |>
  summarise(total = sum(value), n = n()) |>
  collect()

Append new data without rewriting the file, or do a key-based diff between two snapshots:

# Append new rows as a new row group; existing data untouched
append_vtr(new_rows_df, "data.vtr")

# Logical diff: what was added or deleted between two snapshots?
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added)   # rows present in new but not old
d$deleted          # key values present in old but not new

Fuzzy string matching runs inside the C engine, no round-trip to R:

tbl("taxa.vtr") |>
  filter(levenshtein(species, "Quercus robur") <= 2) |>
  mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
  arrange(desc(similarity)) |>
  collect()

s <- vtr_schema(
  fact    = tbl("observations.vtr"),
  species = link("sp_id", tbl("species.vtr")),
  site    = link("site_id", tbl("sites.vtr"))
)

# Pull columns from any dimension; joins are built automatically
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)

Use explain() to inspect the optimized plan:

tbl("data.vtr") |>
  filter(x > 0) |>
  select(id, x) |>
  explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#>   FilterNode [streaming]
#>     ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]
#>
#> Output columns (2):
#>   id <int64>
#>   x <double>

Why vectra

Querying large datasets in R usually means Arrow (requires compiled binaries matching your platform), DuckDB (links a 30 MB bundled library), or Spark (requires a JVM and cluster configuration).

vectra is a self-contained C11 engine compiled as a standard R extension. No external libraries, no JVM, no runtime configuration. It provides:

Streaming execution: data flows one row group at a time, never fully in memory
Zero-copy filtering: selection vectors avoid row duplication
Query optimizer: column pruning skips unneeded columns at scan; predicate pushdown uses per-rowgroup min/max statistics to skip entire row groups
Hash joins: build right, stream left; join a 50 GB fact table against a lookup without materializing both
External sort: 1 GB memory budget with automatic spill-to-disk
Window functions: row_number(), rank(), dense_rank(), lag(), lead(), cumsum(), cummean(), cummin(), cummax()
String expressions: nchar(), substr(), grepl() evaluated in the engine without round-tripping to R
Multiple data sources: .vtr, CSV, SQLite, GeoTIFF --- all produce the same lazy query nodes
Integer TIFF output: write rasters as int16/int32/uint8/uint16/float32 with embedded GDAL metadata for 5-10x smaller files

Features

Category	Verbs
Transform	`filter()`, `select()`, `mutate()`, `transmute()`, `rename()`, `relocate()`
Aggregate	`group_by()`, `summarise()` (`n`, `sum`, `mean`, `min`, `max`, `sd`, `var`, `first`, `last`, `any`, `all`, `median`, `n_distinct`), `count()`, `tally()`, `distinct()`
Join	`left_join()`, `inner_join()`, `right_join()`, `full_join()`, `semi_join()`, `anti_join()`, `cross_join()`, `lookup()`
Order	`arrange()`, `slice_head()`, `slice_tail()`, `slice_min()`, `slice_max()`, `slice()`
Window	`row_number()`, `rank()`, `dense_rank()`, `lag()`, `lead()`, `cumsum()`, `cummean()`, `cummin()`, `cummax()`, `ntile()`, `percent_rank()`, `cume_dist()`
Date/Time	`year()`, `month()`, `day()`, `hour()`, `minute()`, `second()`, `as.Date()` (in `filter()`/`mutate()`)
String	`nchar()`, `substr()`, `grepl()`, `tolower()`, `toupper()`, `trimws()`, `paste0()`, `gsub()`, `sub()`, `startsWith()`, `endsWith()` (in `filter()`/`mutate()`)
String similarity	`levenshtein()`, `levenshtein_norm()`, `dl_dist()`, `dl_dist_norm()`, `jaro_winkler()`: fuzzy matching in `filter()`/`mutate()`, with optional `max_dist` early termination
Expression	`abs()`, `sqrt()`, `log()`, `exp()`, `floor()`, `ceiling()`, `round()`, `log2()`, `log10()`, `sign()`, `trunc()`, `if_else()`, `between()`, `%in%`, `as.numeric()`, `pmin()`, `pmax()`, `resolve()`, `propagate()` (in `filter()`/`mutate()`)
Combine	`bind_rows()`, `bind_cols()`, `across()`
Schema	`vtr_schema()`, `link()`, `lookup()`: star schema definition and dimension lookup with match reporting
I/O	`tbl()`, `tbl_csv()`, `tbl_sqlite()`, `tbl_tiff()`, `write_vtr()`, `write_csv()`, `write_sqlite()`, `write_tiff()`, `tiff_extract_points()`, `tiff_metadata()`, `append_vtr()`, `delete_vtr()`, `diff_vtr()`
Inspect	`explain()`, `glimpse()`, `print()`, `pull()`

Full tidyselect support in select(), rename(), relocate(), and across(): starts_with(), ends_with(), contains(), matches(), where(), everything(), all_of(), any_of().

Installation

# CRAN
install.packages("vectra")

# Development version
pak::pak("gcol33/vectra")

Documentation

Getting Started: full walkthrough with runnable examples
Format Backends: CSV, SQLite, Excel, GeoTIFF, and streaming conversion pipelines
Joins: all join types, fuzzy joins, key coercion, and memory model
Star Schemas: dimension lookups, match reporting, and avoiding flat-table column creep
String Operations: pattern matching, fuzzy matching, and block lookups
Indexing and Optimization: hash indexes, zone-map pruning, column pruning, and reading explain() output
Working with Large Data: streaming pipelines, append/delete/diff, external sort, and memory budgeting
Engine Reference: execution model, types, coercion, .vtr format, and limitations
Function Reference

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

If this package saved you some time, buying me a coffee is a nice way to say thanks.

License

MIT (see the LICENSE.md file)

Citation

@software{vectra,
  author = {Colling, Gilles},
  title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
  year = {2026},
  url = {https://github.com/gcol33/vectra}
}

r-vectra

vectra

Quick Start

Why vectra

Features

Installation

Documentation

Support

License

Citation

Version

License

Status

Source

Homepage

Platforms (80)

vectra

Quick Start

Why vectra

Features

Installation

Documentation

Support

License

Citation

Version

License

Status

Source

Homepage

Platforms80 (80)

Platforms (80)