MyNixOS website logo
Description

Diagnostic Tools for Data Frame Joins.

Provides diagnostic tools for understanding and debugging data frame joins. Analyzes key columns before joining to detect duplicates, mismatches, encoding issues, and other common problems. Explains unexpected row count changes and provides safe join wrappers with cardinality enforcement. Concepts and diagnostics build on tidy data principles as described in 'Wickham' (2014) <doi:10.18637/jss.v059.i10>.

joinspy

CRAN status CRAN downloads Monthly downloads R-CMD-check Codecov test coverage License: MIT

Find out why your keys don't match.

You ran a left join and lost 40% of your rows. dplyr says "many-to-many relationship." joinspy says 12 keys have trailing spaces, 8 differ only by case, and 3 contain invisible Unicode characters. Then it fixes them.

Quick Start

library(joinspy)

join_spy(orders, customers, by = "customer_id")

repaired <- join_repair(orders, customers, by = "customer_id")

suggest_repairs(join_spy(orders, customers, by = "customer_id"))

The Problem

Most join failures come down to string-level problems in keys:

  • "Alice" vs "Alice " (trailing space, invisible)
  • "NYC" vs "nyc" (case)
  • Zero-width spaces, BOMs, non-breaking spaces that look like regular spaces but aren't
  • "Johansson" vs "Johannson" (one character off)
  • Empty strings matching each other but not NA

R won't warn you about any of these. join_spy() catches them before the join runs.

What joinspy Does

Diagnose

join_spy() examines keys before the join:

orders <- data.frame(
  id = c("A", "B ", "c", "D"),
  amount = c(100, 200, 300, 400),
  stringsAsFactors = FALSE
)

customers <- data.frame(
  id = c("A", "B", "C", "E"),
  name = c("Alice", "Bob", "Carol", "Eve"),
  stringsAsFactors = FALSE
)

join_spy(orders, customers, by = "id")
#> -- Join Diagnostic Report --
#> Match rate (left): 25%
#>
#> Issues Detected:
#>   ! "B " has trailing whitespace (would match "B")
#>   ! "c" vs "C" — case mismatch
#>   x "D" has no match in right table

Repair

join_repair() fixes the issues, or previews what it would change with dry_run = TRUE. suggest_repairs() prints the R code instead of running it.

join_repair(orders, customers, by = "id", dry_run = TRUE)

repaired <- join_repair(orders, customers, by = "id",
                        standardize_case = "upper")

suggest_repairs(join_spy(orders, customers, by = "id"))
#> x$id <- trimws(x$id)
#> x$id <- toupper(x$id)
#> y$id <- toupper(y$id)

Predict

join_spy() also estimates result size for each join type:

report <- join_spy(orders, customers, by = "id")
report$expected_rows
#> inner_join: 1
#> left_join:  4
#> right_join: 4
#> full_join:  7

Explain

join_explain() works after the join, on the result:

result <- merge(orders, customers, by = "id", all.x = TRUE)
join_explain(result, orders, customers, by = "id", type = "left")
#> Result has same row count as left table
#> ! 3 left key(s) have no match in right table

Also Includes

The package ships join wrappers (left_join_spy(), inner_join_spy(), etc.) that run diagnostics before joining and attach the report as an attribute. join_strict() enforces cardinality (1:1, 1:m, m:1, m:m) and errors on violation. check_cartesian() flags many-to-many keys that would multiply your row count. analyze_join_chain() handles multi-step A-B-C sequences.

Joins work with tibbles, data.tables, and plain data frames.

Installation

# Install from CRAN
install.packages("joinspy")

# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/joinspy")

Documentation

Related Work

PackageFocus
dplyr 1.1+Cardinality checks via relationship argument
powerjoin12-level configurable checks, key preprocessing
joynMatch-status reporting variable per row
tidylogLogs row count changes after joins

joinspy focuses on string-level key diagnostics: whitespace, case, encoding, typos, and type mismatches. It identifies which specific keys failed, why, and can fix them automatically.

Support

"Software is like sex: it's better when it's free." — Linus Torvalds

I'm a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks.

Buy Me A Coffee

License

MIT (see the LICENSE.md file)

Citation

@software{joinspy,
  author = {Colling, Gilles},
  title = {joinspy: Diagnostic Tools for Data Frame Joins},
  year = {2025},
  url = {https://github.com/gcol33/joinspy}
}
Metadata

Version

0.8.0

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows