MyNixOS website logo
Description

A Comprehensive Hit or Miss Probabilistic Entity Resolution Model.

Provides Bayesian probabilistic methods for record linkage and entity resolution across multiple datasets using the Comprehensive Hit Or Miss Probabilistic Entity Resolution (CHOMPER) model. The package implements three main inference approaches: (1) Evolutionary Variational Inference for record Linkage (EVIL), (2) Coordinate Ascent Variational Inference (CAVI), and (3) Markov Chain Monte Carlo (MCMC) with split and merge process. The model supports both discrete and continuous fields, and it performs locally-varying hit mechanism for the attributes with multiple truths. It also provides tools for performance evaluation based on either approximated variational factors or posterior samples. The package is designed to support parallel computing with multi-threading support for EVIL to estimate the linkage structure faster.

chomper

R-CMD-check

Overview

chomper is an R package that provides a Comprehensive Hit Or Miss Entity Resolution (CHOMPER) models.

Key Features

  • Multiple Inference Approaches: Implements three inference methods:

    • MCMC: Markov Chain Monte Carlo with split and merge process
    • EVIL: Evolutionary Variational Inference for record Linkage
    • CAVI: Single Coordinate Ascent Variational Inference
  • Locally Varying Hit Mechanism: Accounts for the attributes with multiple truths

  • Flexible Data Support: Handles both discrete and continuous fields

  • Parallel Computing: Multi-threading support for faster EVIL estimation

Installation

You can install chomper from GitHub with:

# install.packages("devtools")
devtools::install_github("hjkim8987/chomper", dependencies = TRUE, build_vignettes = TRUE)

Quick Start

library(chomper)

# Generate sample data for testing
sample_data <- generate_sample_data(
  n_entities = 100,
  n_files = 3,
  overlap_ratio = 0.7,
  discrete_columns = c(1, 2),
  discrete_levels = c(5, 5),
  continuous_columns = c(3, 4),
  continuous_params = matrix(c(0, 0, 1, 1), ncol = 2),
  distortion_ratio = c(0.1, 0.1, 0.1, 0.1)
)

# Get file information and drop "id" column
n <- numeric(3)
x <- list()
for (i in 1:3) {
  n[i] <- nrow(sample_data[[i]])
  x[[i]] <- sample_data[[i]][, colnames(sample_data[[i]]) != "id", drop = FALSE]
}
N <- sum(n)

# Set Hyperparameters
hyper_beta <- matrix(
  rep(c(N * 0.1 * 0.01, N * 0.1), 4),
  ncol = 2, byrow = TRUE
)

hyper_sigma <- matrix(
  rep(c(0.01, 0.01), 2),
  ncol = 2, byrow = TRUE
)

# Perform record linkage using EVIL
result <- chomperEVIL(
  x = x,
  k = 3,  # number of datasets
  n = n,  # rows per dataset
  N = N,  # columns per dataset
  p = 4,  # fields per dataset
  M = c(5, 5),  # categories for discrete fields
  discrete_fields = c(1, 2),
  continuous_fields = c(3, 4),
  hyper_beta = hyper_beta,   # hyperparameter for distortion rate
  hyper_sigma = hyper_sigma, # hyperparameter for continuous fields
  n_threads = 4
)

# Performance evaluation
psm_ <- psm_vi(result$nu) # Calculate a posterior similarity matrix

# install.pakcages("salso")
library(salso)

salso_estimate <- salso(psm_,
  loss = binder(),
  maxZealousAttempts = 0, probSequentialAllocation = 1
) # Find a Bayes estimate that minimizes Binder's loss

linkage_structure <- list()
for (ll in seq_along(salso_estimate)) {
  linkage_structure[[ll]] <- which(salso_estimate == salso_estimate[ll])
}
linkage_estimation <- matrix(linkage_structure)

# install.packages("blink")
library(blink)

key_temp <- c()
for (i in 1:3) {
  key_temp <- c(key_temp, sample_data[[i]][, "id"])
}

truth_binded <- matrix(key_temp, nrow = 1)
linkage_structure_true <- links(truth_binded, TRUE, TRUE)
linkage_truth <- matrix(linkage_structure_true)

perf <- performance(linkage_estimation, linkage_truth, N)
print(perf)

Main Functions

Core Inference Functions

  • chomperMCMC(): Markov Chain Monte Carlo
  • chomperEVIL(): Evolutionary Variational Inference for record Linkage
  • chomperCAVI(): Coordinate Ascent Variational Inference

Data Generation and Utilities

  • generate_sample_data(): Create synthetic data for testing and validation
  • flatten_posterior_samples(): Flatten posterior samples for obtaining a posterior similarity matrix

Evaluation and Performance

  • psm_mcmc(): Posterior similarity matrix for MCMC results
  • psm_vi(): Posterior similarity matrix for variational inference
  • performance(): Evaluate performance of estimation

License

This package is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Authors

  • Hyungjoon Kim - Maintainer - GitHub
  • Andee Kaplan - Contributor - GitHub
  • Matthew Koslovsky - Contributor - GitHub.
Metadata

Version

0.1.3

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows