MyNixOS website logo
Description

Incremental Feature Engineering with Database Persistence.

Define feature logic, compute only new or unprocessed rows, and persist the resulting flat feature table in a database. The package provides an explicit incremental pipeline for fetching source rows, computing feature definitions, and writing computed features to a database table.

featdelta

featdelta is an R package for incremental feature engineering with database persistence. It is designed for workflows where raw observations live in a database, feature logic is easier to write and test in R, and the computed features should be stored back in a database table for modelling, reporting, monitoring, or downstream reuse.

Instead of rebuilding the same feature table from scratch whenever new rows arrive, featdelta lets you define feature expressions in R, fetch only rows that have not yet been processed, compute the features locally, and upsert the results into a persistent feature table.

Installation

install.packages("featdelta")

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("LordRudolf/featdelta")

Core idea

The standard pipeline is:

  1. define reusable feature logic with fd_define();
  2. fetch raw database rows that are missing from the feature table;
  3. compute the requested features in R;
  4. create, extend, insert into, or update the database feature table.

The main orchestration function is fd_run(), which combines the fetch, compute, and upsert steps.

Small example

This example uses an in-memory SQLite database and mtcars, but the same pattern applies to database tables selected with ordinary SQL.

library(DBI)
library(RSQLite)
library(featdelta)

cars <- mtcars
cars$id <- seq_len(nrow(cars))

con <- dbConnect(SQLite(), ":memory:")
dbWriteTable(con, "raw_cars", cars[1:20, ], overwrite = TRUE)

defs <- fd_define(
  transmission = ifelse(am == 1, "automatic", "manual"),
  hp_per_cyl = hp / cyl,
  wt_per_hp = wt / hp
)

run_day_one <- fd_run(
  con = con,
  sql = "SELECT * FROM raw_cars ORDER BY id",
  defs = defs,
  key = "id",
  feat_table_name = "car_features"
)

dbGetQuery(con, "SELECT * FROM car_features ORDER BY id")

When more raw rows arrive, call the same pipeline again. By default, fd_run(fetch_mode = "new_only") processes only keys that are not already in the feature table.

dbAppendTable(con, "raw_cars", cars[21:30, ])

run_day_two <- fd_run(
  con = con,
  sql = "SELECT * FROM raw_cars ORDER BY id",
  defs = defs,
  key = "id",
  feat_table_name = "car_features"
)

dbGetQuery(con, "SELECT * FROM car_features ORDER BY id")

If feature definitions change and existing rows should be recomputed, use fetch_mode = "all" to refresh the rows returned by the SQL query.

defs_v2 <- fd_define(
  transmission = ifelse(am == 1, "automatic", "manual"),
  hp_per_cyl = hp / cyl,
  wt_per_hp = wt / hp,
  mpg_per_cyl = mpg / cyl
)

fd_run(
  con = con,
  sql = "SELECT * FROM raw_cars ORDER BY id",
  defs = defs_v2,
  key = "id",
  feat_table_name = "car_features",
  fetch_mode = "all"
)

Multi-column feature blocks

For feature logic that naturally produces several columns at once, use fd_block().

defs <- fd_define(
  engine_ratios = fd_block({
    data.frame(
      hp_per_cyl = hp / cyl,
      disp_per_cyl = disp / cyl,
      wt_per_hp = wt / hp
    )
  })
)

Learn more

The package includes vignettes covering the getting-started workflow, feature definition patterns, database pipeline details, production patterns, and scheduled runs.

Metadata

Version

0.1.0

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows