MyNixOS website logo
Description

Columnar Query Engine for Larger-than-RAM Data.

A minimal columnar query engine with lazy execution on datasets larger than RAM. Provides 'dplyr'-like verbs (filter(), select(), mutate(), group_by(), summarise(), joins, window functions) and common aggregations (n(), sum(), mean(), min(), max(), sd(), first(), last()) backed by a pure C11 pull-based execution engine and a custom on-disk format ('.vtr'). Reads and writes 'GeoTIFF' (including tiled and 'BigTIFF' layouts) and a tiled raster format ('.vec') with overview pyramids and time cubes for larger-than-RAM raster data.

vectra

CRAN status R-CMD-check ASAN/UBSAN Codecov test coverage License: MIT

vectra is an R-native columnar query engine for datasets larger than RAM.

Write dplyr-style pipelines against multi-GB files on a laptop. Data streams through a C11 pull-based engine one row group at a time, so peak memory stays bounded regardless of file size.

Quick Start

Point vectra at any file and query it with dplyr verbs. Nothing runs until collect().

library(vectra)

# CSV: lazy scan with type inference
tbl_csv("measurements.csv") |>
  filter(temperature > 30, year >= 2020) |>
  group_by(station) |>
  summarise(avg_temp = mean(temperature), n = n()) |>
  collect()

# GeoTIFF: climate rasters as tidy data
tbl_tiff("worldclim_bio1.tif") |>
  filter(band1 > 0) |>
  mutate(temp_c = band1 / 10) |>
  collect()

# Point extraction: sample raster values at coordinates, no terra needed
tiff_extract_points("worldclim_bio1.tif",
                    x = c(10.5, 11.2), y = c(47.1, 47.3))

# SQLite: zero-dependency, no DBI required
tbl_sqlite("survey.db", "responses") |>
  filter(year == 2025) |>
  left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
  collect()

For repeated queries, convert to vectra's native .vtr format for faster reads:

write_vtr(big_df, "data.vtr", batch_size = 100000)

tbl("data.vtr") |>
  filter(x > 0, region == "EU") |>
  group_by(region) |>
  summarise(total = sum(value), n = n()) |>
  collect()

Append new data without rewriting the file, or do a key-based diff between two snapshots:

# Append new rows as a new row group; existing data untouched
append_vtr(new_rows_df, "data.vtr")

# Logical diff: what was added or deleted between two snapshots?
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added)   # rows present in new but not old
d$deleted          # key values present in old but not new

Fuzzy string matching runs inside the C engine, no round-trip to R:

tbl("taxa.vtr") |>
  filter(levenshtein(species, "Quercus robur") <= 2) |>
  mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
  arrange(desc(similarity)) |>
  collect()

Register a star schema to avoid flat-table column creep. Define the links once, then pull only what you need:

s <- vtr_schema(
  fact    = tbl("observations.vtr"),
  species = link("sp_id", tbl("species.vtr")),
  site    = link("site_id", tbl("sites.vtr"))
)

# Pull columns from any dimension; joins are built automatically
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)

Use explain() to inspect the optimized plan:

tbl("data.vtr") |>
  filter(x > 0) |>
  select(id, x) |>
  explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#>   FilterNode [streaming]
#>     ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]
#>
#> Output columns (2):
#>   id <int64>
#>   x <double>

Why vectra

Querying large datasets in R usually means Arrow (requires compiled binaries matching your platform), DuckDB (links a 30 MB bundled library), or Spark (requires a JVM and cluster configuration).

vectra is a self-contained C11 engine compiled as a standard R extension. No external libraries, no JVM, no runtime configuration. It provides:

  • Streaming execution: data flows one row group at a time, never fully in memory
  • Zero-copy filtering: selection vectors avoid row duplication
  • Query optimizer: column pruning skips unneeded columns at scan; predicate pushdown uses per-rowgroup min/max statistics to skip entire row groups
  • Hash joins: build right, stream left; join a 50 GB fact table against a lookup without materializing both
  • External sort: 1 GB memory budget with automatic spill-to-disk
  • Window functions: row_number(), rank(), dense_rank(), lag(), lead(), cumsum(), cummean(), cummin(), cummax()
  • String expressions: nchar(), substr(), grepl() evaluated in the engine without round-tripping to R
  • Multiple data sources: .vtr, CSV, SQLite, GeoTIFF --- all produce the same lazy query nodes
  • Integer TIFF output: write rasters as int16/int32/uint8/uint16/float32 with embedded GDAL metadata for 5-10x smaller files

Features

CategoryVerbs
Transformfilter(), select(), mutate(), transmute(), rename(), relocate()
Aggregategroup_by(), summarise() (n, sum, mean, min, max, sd, var, first, last, any, all, median, n_distinct), count(), tally(), distinct()
Joinleft_join(), inner_join(), right_join(), full_join(), semi_join(), anti_join(), cross_join(), lookup()
Orderarrange(), slice_head(), slice_tail(), slice_min(), slice_max(), slice()
Windowrow_number(), rank(), dense_rank(), lag(), lead(), cumsum(), cummean(), cummin(), cummax(), ntile(), percent_rank(), cume_dist()
Date/Timeyear(), month(), day(), hour(), minute(), second(), as.Date() (in filter()/mutate())
Stringnchar(), substr(), grepl(), tolower(), toupper(), trimws(), paste0(), gsub(), sub(), startsWith(), endsWith() (in filter()/mutate())
String similaritylevenshtein(), levenshtein_norm(), dl_dist(), dl_dist_norm(), jaro_winkler(): fuzzy matching in filter()/mutate(), with optional max_dist early termination
Expressionabs(), sqrt(), log(), exp(), floor(), ceiling(), round(), log2(), log10(), sign(), trunc(), if_else(), between(), %in%, as.numeric(), pmin(), pmax(), resolve(), propagate() (in filter()/mutate())
Combinebind_rows(), bind_cols(), across()
Schemavtr_schema(), link(), lookup(): star schema definition and dimension lookup with match reporting
I/Otbl(), tbl_csv(), tbl_sqlite(), tbl_tiff(), write_vtr(), write_csv(), write_sqlite(), write_tiff(), tiff_extract_points(), tiff_metadata(), append_vtr(), delete_vtr(), diff_vtr()
Inspectexplain(), glimpse(), print(), pull()

Full tidyselect support in select(), rename(), relocate(), and across(): starts_with(), ends_with(), contains(), matches(), where(), everything(), all_of(), any_of().

Installation

# CRAN
install.packages("vectra")

# Development version
pak::pak("gcol33/vectra")

Documentation

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

If this package saved you some time, buying me a coffee is a nice way to say thanks.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{vectra,
  author = {Colling, Gilles},
  title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
  year = {2026},
  url = {https://github.com/gcol33/vectra}
}
Metadata

Version

0.6.2

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows