MyNixOS website logo
Description

Fast Fuzzy String Joins for Data Frames.

Perform fuzzy joins on data frames using approximate string matching. Implements inner, left, right, full, semi, and anti joins with string distance metrics from the 'stringdist' package, including Optimal String Alignment, Levenshtein, Damerau-Levenshtein, Jaro-Winkler, q-gram, cosine, Jaccard, and Soundex. Uses a 'data.table' backend plus compiled 'C++' result assembly to reduce overhead in large joins, while adaptive candidate planning avoids unnecessary distance evaluations in single-column string joins. Suitable for reconciling misspellings, inconsistent labels, and other near-match identifiers while optionally returning the computed distance for each match.

fuzzystring

CRAN status R-CMD-check Lifecycle: stable

fuzzystring provides fast, flexible fuzzy string joins for data.frame and data.table objects using approximate string matching. It combines stringdist-based matching with a data.table backend and compiled C++ result assembly to reduce overhead in large joins while preserving standard join semantics.

Why fuzzystring?

Real-world identifiers rarely line up exactly. fuzzystring is designed for workloads such as:

  • matching customer or company names with typos
  • reconciling product catalogs with inconsistent labels
  • linking survey responses to a controlled vocabulary
  • joining reference tables to messy user input

The package includes:

  • fuzzy inner, left, right, full, semi, and anti joins
  • multiple stringdist methods, including OSA, Levenshtein, Damerau-Levenshtein, Jaro-Winkler, q-gram, cosine, jaccard, and soundex
  • output that preserves the class of x (data.table, tibble, or base data.frame)
  • optional distance columns for matched pairs
  • case-insensitive matching
  • adaptive candidate planning for single-column joins
  • compiled C++ row expansion and result assembly across join modes

Installation

# Install from CRAN
install.packages("fuzzystring")

# Development version from GitHub
# pak::pak("PaulESantos/fuzzystring")
# remotes::install_github("PaulESantos/fuzzystring")

Quick start

library(fuzzystring)

x <- data.frame(
  name = c("Idea", "Premiom", "Very Good"),
  id = 1:3
)

y <- data.frame(
  approx_name = c("Ideal", "Premium", "VeryGood"),
  grp = c("A", "B", "C")
)

fuzzystring_inner_join(
  x, y,
  by = c(name = "approx_name"),
  max_dist = 2,
  distance_col = "distance"
)

Join families

fuzzystring_inner_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_left_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_right_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_full_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_semi_join(x, y, by = c(name = "approx_name"), max_dist = 2)
fuzzystring_anti_join(x, y, by = c(name = "approx_name"), max_dist = 2)

Distance methods

fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "osa")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "dl")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "jw")
fuzzystring_inner_join(x, y, by = c(name = "approx_name"), method = "soundex")

Case-insensitive matching

fuzzystring_inner_join(
  x, y,
  by = c(name = "approx_name"),
  ignore_case = TRUE,
  max_dist = 1
)

Included example data

The package ships with misspellings, a dataset of common misspellings adapted from Wikipedia for examples and testing.

data(misspellings)
head(misspellings)

Performance

fuzzystring keeps more of the join execution on a compiled path than the original fuzzyjoin implementation. In practice, the package combines:

  • data.table grouping and candidate planning
  • adaptive blocking for single-column string joins
  • compiled row expansion, row binding, and final assembly
  • type-preserving handling of dates, datetimes, factors, and list-columns

The benchmark article summarizes a precomputed comparison against fuzzyjoin::stringdist_join() using the same methods and sample sizes:

Multiple-column joins

fuzzystring_join() can match across more than one string column by applying the same distance method and threshold to each mapped column.

x_multi <- data.frame(
  first = c("Jon", "Maira"),
  last = c("Smyth", "Gonzales")
)

y_multi <- data.frame(
  first_ref = c("John", "Maria"),
  last_ref = c("Smith", "Gonzalez"),
  id = 1:2
)

fuzzystring_inner_join(
  x_multi, y_multi,
  by = c(first = "first_ref", last = "last_ref"),
  method = "osa",
  max_dist = 1
)

Related packages

Credits

fuzzystring builds on ideas popularized by fuzzyjoin, while reinterpreting the join pipeline around data.table and compiled C++ result assembly.

Metadata

Version

0.0.5

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows