MyNixOS website logo
Description

Pipeline Audit Trails and Data Diagnostics for 'tidyverse' Workflows.

Provides pipeline audit trails and data diagnostics for 'tidyverse' workflows. The audit trail system captures lightweight metadata snapshots at each step of a pipeline, building a structured record without storing the data itself. Operation-aware taps enrich snapshots with join match rates and filter drop statistics. Trails can be serialized to 'JSON' or 'RDS' and exported as self-contained 'HTML' visualizations. Also includes diagnostic functions for interactive data analysis including frequency tables, string quality auditing, and data comparison.

tidyaudit

CRAN status CRAN downloads R-CMD-check Codecov test coverage Lifecycle: experimental

Pipeline audit trails and data diagnostics for tidyverse workflows

Audit trails track what happens at every step of a dplyr pipeline by recording metadata-only snapshots: row counts, column changes, NA totals, numeric shifts, and custom functions. Build a trail by dropping taps into your pipe — transparent pass-throughs that record a snapshot and let the data flow on unchanged. Operation-aware taps, such as left_join_tap() and filter_tap() go further, capturing match rates and drop counts. The result is a structured trail you can print, diff, export as HTML, or serialize to JSON. You can learn more in vignette("tidyaudit").

tidyaudit also includes a diagnostic toolkit for interactive data exploration — join validation, key checks, table comparison, and more — described in vignette("diagnostics").

Quick start

library(tidyaudit)
library(dplyr)
set.seed(123)

orders  <- data.frame(id = 1:100, amount = runif(100, 10, 500), region_id = sample(1:5, 100, TRUE))
regions <- data.frame(region_id = 1:4, name = c("North", "South", "East", "West"))

trail <- audit_trail("order_pipeline")

result <- orders |>
  audit_tap(trail, "raw") |>
  left_join_tap(regions, by = "region_id", .trail = trail, .label = "with_region") |>
  filter_tap(amount > 100, .trail = trail, .label = "high_value", .stat = amount)
#> i filter_tap: amount > 100
#> Dropped 18 of 100 rows (18.0%)
#> Stat amount: dropped 1,062.191 of 25,429.39

print(trail)
#> -- Audit Trail: "order_pipeline" -----------------------------------------------
#> Created: 2026-02-21 14:36:35
#> Snapshots: 3
#>
#>   #  Label        Rows  Cols  NAs  Type
#>   -  -----------  ----  ----  ---  ------------------------------------
#>   1  raw           100     3    0  tap
#>   2  with_region   100     4   23  left_join (many-to-one, 77% matched)
#>   3  high_value     82     4   20  filter (dropped 18 rows, 18%)
#>
#> Changes:
#>   raw -> with_region: = rows, +1 cols, +23 NAs
#>   with_region -> high_value: -18 rows, = cols, -3 NAs

audit_diff(trail, "raw", "high_value")
#> -- Audit Diff: "raw" -> "high_value" --
#>
#>   Metric  Before  After  Delta
#>   ------  ------  -----  -----
#>   Rows       100     82    -18
#>   Cols         3      4     +1
#>   NAs          0     20    +20
#>
#> Columns added: name
#>
#> Numeric shifts (common columns):
#>     Column     Mean before  Mean after   Shift
#>     ---------  -----------  ----------  ------
#>     id               50.50       49.66   -0.84
#>     amount          254.29      297.16  +42.87
#>     region_id         3.08        3.05   -0.03

Three taps. Three snapshots. A complete record of what the pipeline did to your data — and what it cost.

Export as HTML

Share a trail as a self-contained HTML file — one file you can email, attach to a report, or drop into a compliance folder:

audit_export(trail, "order_pipeline.html")

The output is an interactive flow diagram with clickable nodes and edges, light/dark theme toggle, and embedded JSON export. No server or internet required.

audit_export demo

Features

Audit trail system

Build a structured timeline of your pipeline's behavior. Drop taps into any dplyr pipe and get a traceable, diffable, exportable record of every step.

  • audit_trail() / audit_tap() — create a trail and record snapshots inside pipes
  • left_join_tap(), filter_tap(), and friends — operation-aware taps that capture match rates, drop counts, stat impact, and relationship types
  • tab_tap() — track frequency distributions across pipeline steps
  • audit_diff() — before/after comparison of any two snapshots
  • audit_report() — full pipeline report in one call
  • audit_export() — self-contained HTML visualization
  • write_trail() / read_trail() — serialize to RDS or JSON for CI pipelines and dashboards
  • Snapshot controls (.numeric_summary, .cols_include, .cols_exclude) — fine-tune what each tap captures

Diagnostic toolkit

Standalone functions for interactive data exploration — the questions you ask in the console before, during, and after building a pipeline.

  • validate_join() — analyze a join before performing it (match rates, duplicates, unmatched keys)
  • validate_primary_keys() / validate_var_relationship() — key and relationship validation
  • compare_tables() — column, row, numeric, and categorical comparison between two data frames
  • filter_keep() / filter_drop() — filter with diagnostic output and configurable warning thresholds
  • diagnose_nas() / summarize_column() / get_summary_table() — data quality diagnostics
  • diagnose_strings() / audit_transform() — string quality auditing and type-aware transformation diagnostics
  • tab() — frequency tables and crosstabulations with sorting, cutoffs, and weighting

Installation

# Install from CRAN
install.packages("tidyaudit")

# Install development version
pak::pak("fpcordeiro/tidyaudit")

Learn more

Relationship to dtaudit

tidyaudit is the tidyverse-native counterpart to dtaudit, a data.table-based package on CRAN. Same design vocabulary, independent implementations — choose the one that matches your stack.

License

LGPL (>= 3)

Metadata

Version

0.2.1

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows