Data Validation Based on YAML Rules.
datadiff
Data Validation with YAML Rules
Overview
datadiff is a comprehensive R package for data validation that lets you compare datasets using configurable validation rules defined in YAML files.
It supports exact matching, tolerance-based numeric comparisons, text normalization, and row count validation.
Installation
remotes::install_github("thinkr-open/datadiff")
Quick Start
library(datadiff)
# Create reference and candidate datasets
reference <- data.frame(
id = 1:3,
amount = c(100.00, 200.00, 300.00),
category = c("A", "B", "C")
)
candidate <- data.frame(
id = 1:3,
amount = c(100.01, 200.01, 300.001), # Small differences within tolerance
category = c("a", "b", "c") # Different case
)
# Generate a rules template
write_rules_template(reference, key = "id",
path = "validation_rules.yaml",
numeric_abs = 0.01,
character_case_insensitive = TRUE)
# Edit the rules file to configure validation (e.g., set tolerances, case sensitivity)
# Compare datasets
result <- compare_datasets_from_yaml(reference, candidate, path = "validation_rules.yaml")
print(result$reponse)
Key Features
- YAML Configuration: Define validation rules in human-readable YAML files
- Flexible Comparisons: Support for exact matching and tolerance-based comparisons
- Text Normalization: Case-insensitive comparison and whitespace trimming
- Type-Aware Validation: Different rules for different data types
- Comprehensive Reporting: Detailed validation reports powered by pointblank
- Row Count Validation: Ensure datasets have the expected number of rows
Dependencies
This package is built on top of the excellent pointblank package, which provides the robust data validation engine.
Pointblank handles all the heavy lifting for validation logic, error reporting, and test execution.
Configuration
write_rules_template() generates a starting YAML. You then edit by_name to override any rule for a specific column.
Full example with column-level overrides (by_name)
version: 1
defaults:
na_equal: yes
ignore_columns:
documentation
keys: id
label: reference vs candidate
row_validation:
check_count: yes
expected_count: ~ # null = use reference row count
tolerance: 0
by_type:
numeric:
abs: 1.0e-09 # near-exact by default for all numeric columns
rel: 0
integer:
abs: 0
character:
equal_mode: exact
case_insensitive: no
trim: no
date:
equal_mode: exact
datetime:
equal_mode: exact
logical:
equal_mode: exact
by_name:
id: [] # no override — inherits integer rule (abs: 0)
amount:
abs: 0.01 # override: accept differences up to 0.01 for this column
rel: 0
unit_price:
abs: 0.001
rel: 0.0001 # mixed mode: 0.001 + 0.01% of reference value
category:
case_insensitive: yes # override: ignore case for this column only
trim: yes
created_at:
equal_mode: exact # redundant here, but explicit for documentation
How by_name interacts with by_type: rules are merged, with by_name taking precedence. A column not listed in by_name uses its by_type defaults unchanged.
| Column | Effective rule | Source |
|---|---|---|
id | abs: 0 | by_type.integer |
amount | abs: 0.01, rel: 0 | by_name.amount overrides by_type.numeric |
unit_price | abs: 0.001, rel: 0.0001 | by_name.unit_price overrides by_type.numeric |
category | case_insensitive: yes, trim: yes | by_name.category overrides by_type.character |
created_at | equal_mode: exact | by_type.date (by_name is redundant here) |
Numeric tolerance: formula, edge effects, and best practices
Threshold formula
For numeric columns, validation relies on a single combined threshold:
threshold = abs + rel × |reference_value|
result = OK if |candidate - reference| ≤ threshold
The two parameters add up—they are not two independent guards.
| Parameter | Role | Default value |
|---|---|---|
abs | Absolute tolerance: fixed floor, independent of the magnitude of values | 1e-9 |
rel | Relative tolerance: fraction of the reference value added to the threshold | 0 |
Pure absolute mode (recommended by default)
With rel: 0, the threshold is constant regardless of the magnitude of the values being compared:
by_type:
numeric:
abs: 0.01 # fixed threshold: any difference > 0.01 is detected
rel: 0
| Reference | Candidate | Difference | Threshold | Result |
|---|---|---|---|---|
| 1.00 | 1.005 | 0.005 | 0.01 | OK |
| 1 000 000.00 | 1 000 000.005 | 0.005 | 0.01 | OK |
| 0.000001 | 0.000001 + 0.005 | 0.005 | 0.01 | OK |
| 1.00 | 1.02 | 0.02 | 0.01 | ERROR |
| 1 000 000.00 | 1 000 000.02 | 0.02 | 0.01 | ERROR |
The same difference of 0.02 is detected whether you compare thousandths or millions.
Pure relative mode
With abs: 0, the threshold is proportional to the reference value:
by_type:
numeric:
abs: 0 # no fixed floor
rel: 0.01 # tolerance of 1% of the reference value
| Reference | Candidate | Difference | Threshold (1%) | Result |
|---|---|---|---|---|
| 100.00 | 100.50 | 0.50 | 1.00 | OK |
| 1 000 000.00 | 1 005 000.00 | 5 000 | 10 000 | OK |
| 0.00 | 0.001 | 0.001 | 0 | ERROR (implicit division by zero) |
Warning: if the reference value is
0, the relative threshold is0—any difference, even tiny, will be detected as an error.
This is whyabsacts as a safety floor.
Mixed mode: when and how to use it
Combining both parameters is useful when values can be close to zero and very large, and you want a tolerance that adapts in both cases:
by_type:
numeric:
abs: 0.001 # floor: protects the ref ≈ 0 case
rel: 0.01 # +1% for large values
For ref = 1 000 000: threshold = 0.001 + 0.01 × 1 000 000 = 10 000.001
Pitfall: with
rel > 0and large reference values, the threshold can become much wider than you intuitively expect.
For example,rel: 1e-9on a value of12 000 000yields a threshold of≈ 0.012, so a difference of0.000565would pass undetected, even withabs: 1e-9.Rule of thumb: use
rel: 0(the default) unless you explicitly need a tolerance proportional to the magnitude of the data.
Character comparison options
For character columns, you can configure text normalization and comparison behavior:
- equal_mode: Comparison mode (
"exact"or"normalized") - case_insensitive: Whether to ignore case differences (
true/false) - trim: Whether to trim whitespace before comparison (
true/false)
Example:
by_type:
character:
equal_mode: normalized # Apply normalization before comparison
case_insensitive: true # Ignore case differences
trim: true # Remove leading/trailing whitespace
For reference value "Hello World":
- Candidate
"hello world"passes withcase_insensitive: true - Candidate
" Hello World "passes withtrim: true - Candidate
"HELLO WORLD"passes with both options enabled - Candidate
"Hello Universe"fails regardless of normalization
Row count validation
You can validate that datasets have an expected number of rows using the row_validation section:
- check_count: Whether to enable row count validation (
true/false) - expected_count: Expected number of rows (if
null, uses the reference dataset row count) - tolerance: Allowed deviation from the expected count
Example:
row_validation:
check_count: true
expected_count: 1000
tolerance: 50
This validates that the candidate dataset has between 950 and 1050 rows (1000 ± 50).
If expected_count is not specified, the reference dataset’s row count is used as the expected value.
Comparing Parquet files (large datasets)
datadiff can compare Parquet files that are too large to fit in RAM. The recommended approach uses arrow::open_dataset() — do not call arrow::to_duckdb() yourself before passing to datadiff; the package handles the Arrow → DuckDB conversion internally with a single connection.
Recommended strategy: Arrow Dataset (lazy, out-of-core)
library(datadiff)
library(arrow)
ds_ref <- arrow::open_dataset("path/to/reference/")
ds_cand <- arrow::open_dataset("path/to/candidate/")
# Generate a rules template from the schema (no data loaded)
write_rules_template(ds_ref, key = "ID", path = "rules.yaml")
# Compare — stays lazy until the final boolean slim table
result <- compare_datasets_from_yaml(
data_reference = ds_ref,
data_candidate = ds_cand,
key = "ID",
path = "rules.yaml"
)
Internally, compare_datasets_from_yaml():
- Opens a private DuckDB connection (
fresh_con). - Materialises each Parquet dataset as a DuckDB physical temp table via
read_parquet()— all memory is managed by DuckDB’s buffer pool, so disk spilling works correctly. - Runs the full join + 125 boolean expressions as a single lazy SQL query.
dplyr::compute()materialises only the slim boolean result table (~125 logical columns × N rows) — the wide source data is never loaded into R.dplyr::collect()brings the slim boolean table (~few GB) into R and passes a plaindata.frameto pointblank — no live DuckDB connection needed during interrogation.- Disconnects and destroys
fresh_conon exit.
Memory tuning: duckdb_memory_limit
DuckDB’s default memory cap (80 % of total RAM) can leave insufficient headroom when R, Arrow, and the OS are already using significant memory. The duckdb_memory_limit parameter controls how much RAM DuckDB may use before spilling intermediate results to tempdir():
result <- compare_datasets_from_yaml(
data_reference = ds_ref,
data_candidate = ds_cand,
key = "ID",
path = "rules.yaml",
duckdb_memory_limit = "8GB" # default — safe for machines with >= 16 GB
)
| Machine RAM | Recommended duckdb_memory_limit | Notes |
|---|---|---|
| 8 GB | "3GB" | Leaves room for R + OS |
| 16 GB | "6GB" | Balanced |
| 32 GB | "8GB" (default) | Spills when needed |
| 64 GB+ | "20GB" | Reduces spilling, faster |
Increasing the limit reduces disk I/O and speeds up the comparison; decreasing it protects against OOM on memory-constrained machines. The limit only applies when Arrow datasets are used — it has no effect for data.frame or tbl_lazy inputs.
Strategy comparison
| Strategy | Input type | RAM usage | Requires |
|---|---|---|---|
| Arrow Dataset ✅ recommended | arrow::open_dataset() | Slim boolean table only (~few GB) | arrow, duckdb |
| Arrow Table | arrow::read_parquet(as_data_frame=FALSE) | Same as Arrow Dataset | arrow, duckdb |
| Lazy table (dbplyr) | tbl(con, "table_name") | Slim boolean table only | DBI, dbplyr |
data.frame | read.csv(), readr::read_csv(), etc. | Full data in RAM | — |
What NOT to do
# ❌ DO NOT convert to DuckDB yourself before passing to datadiff
ds_ref_duckdb <- arrow::to_duckdb(ds_ref) # creates Arrow's internal connection
ds_cand_duckdb <- arrow::to_duckdb(ds_cand) # different connection!
result <- compare_datasets_from_yaml(ds_ref_duckdb, ds_cand_duckdb, ...)
# → cross-connection join fails in pointblank
# ❌ DO NOT collect before passing
ref_df <- dplyr::collect(ds_ref) # loads full 4 GB into R RAM
Disk spilling location
DuckDB spills to tempdir() when the memory limit is reached. On Windows this is typically C:\Users\<user>\AppData\Local\Temp. Ensure that directory has sufficient free disk space (up to ~2–3× the size of your Parquet files in the worst case).
Main Functions
write_rules_template(): Generate a YAML rules templateread_rules(): Load and validate rules from YAMLcompare_datasets_from_yaml(): Main validation functiondetect_column_types(): Infer column data typespreprocess_dataframe(): Apply text normalizationanalyze_columns(): Compare column structures
Testing
Run the test suite:
devtools::test()
Sponsor
The development of this package has been sponsored by:
License
MIT License
Dev part
This README has been compiled on the
Sys.time()
#> [1] "2026-03-11 18:07:04 CET"
Here are the test & coverage results:
devtools::check(quiet = TRUE)
#> ℹ Loading datadiff
#> ── R CMD check results ───────────────────────────────────── datadiff 0.4.2 ────
#> Duration: 4m 47.4s
#>
#> ❯ checking for future file timestamps ... NOTE
#> unable to verify current time
#>
#> 0 errors ✔ | 0 warnings ✔ | 1 note ✖
Sys.setenv("NOT_CRAN" = TRUE)
covr::package_coverage()
#> datadiff Coverage: 98.98%
#> R/preprocessing.R: 97.22%
#> R/compare_datasets_from_yaml.R: 98.58%
#> R/data_types.R: 100.00%
#> R/pointblank_setup.R: 100.00%
#> R/tolerance.R: 100.00%
#> R/utils.R: 100.00%
#> R/validation.R: 100.00%
