MyNixOS website logo
Description

Data Leakage Detection Tools for Machine Learning.

Provides utilities to detect common data leakage patterns including train/test contamination, temporal leakage, and data duplication, enhancing model reliability and reproducibility in machine learning workflows. Generates diagnostic reports and visual summaries to support data validation. Methods based on best practices from Hastie, Tibshirani, and Friedman (2009, ISBN:978-0387848570).

leakR

Welcome to leakR, an R package designed to help researchers, data scientists, and machine learning practitioners rigorously detect and diagnose data leakage in their workflows.

Data leakage is a pervasive yet often overlooked issue that undermines the integrity and reproducibility of predictive models by allowing unintended information to "leak" between training and testing phases. leakR provides a modular, extensible toolkit for detecting the most common and impactful forms of leakage, starting with tabular data contamination, target leakage, and temporal misalignments, while laying the foundation for a universal leakage detection framework across diverse data domains.

Installation

From CRAN (Recommended)

install.packages("leakr")

From GitHub (Development Version)

For the latest features and bug fixes:

# Install devtools if you don't have it
install.packages("devtools")

# Install leakR from GitHub
devtools::install_github("cherylisabella/leakR")

Quick Start

library(leakr)

# Basic audit of your dataset
report <- leakr_audit(iris, target = "Species")

# View summary of issues found
leakr_summarise(report)

# Generate diagnostic visualizations
leakr_plot(report)

# Access detailed results
print(report)

Main Functions

FunctionPurpose
leakr_audit()Main auditing function - detects leakage across your dataset
leakr_summarise()Generate human-readable summaries of detected issues
leakr_plot()Create diagnostic visualizations highlighting problems
leakr_from_caret()Import and audit caret workflow objects
leakr_from_tidymodels()Import and audit tidymodels workflow objects
leakr_from_mlr3()Import and audit mlr3 workflow objects

Learn More

Get started with the comprehensive vignettes:

# Getting started guide
vignette("getting-started", package = "leakr")

# Advanced detection techniques
vignette("advanced-detection", package = "leakr") 

# Framework integration examples
vignette("framework-integration", package = "leakr")

Why leakR?

  • Automates leakage detection, filling a key methodological gap
  • Designed for clarity, reproducibility, and transparent ML research
  • Modular architecture supports gradual expansion (time series, NLP, images)
  • Useful for both academic and industry workflows

What leakR Detects

  • Train/test contamination - Overlapping records between training and test sets
  • Target leakage - Features that contain information about the target variable that wouldn't be available at prediction time
  • Duplicate rows/records - Exact and near-duplicate observations that can inflate performance metrics
  • Temporal misalignments - Time-based data leaks in time series analysis

Key Features

  • Visual summaries of suspicious patterns and leakage hotspots
  • Detailed leakage reports suitable for audits, peer review, or publications
  • Clean APIs for seamless integration into existing ML workflows
  • Example vignettes demonstrating real leakage phenomena with code illustrations
  • Framework integration with caret, tidymodels, and mlr3

Development Roadmap

  • Phase 1: Core tabular leakage detectors ✓
  • Phase 2: Time series leakage detection (in progress)
  • Phase 3: Domain-specific extensions (NLP, image pipelines)
  • Phase 4: Pipeline integration and multi-language support

Citation

If you use leakR in your research, please cite:

@Manual{leakr2025,
  title = {leakR: Data Leakage Detection Tools for Machine Learning},
  author = {Cheryl Isabella Lim},
  year = {2025},
  note = {R package version 0.1.0},
  url = {https://github.com/cherylisabella/leakR},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

leakR is currently under development. Feedback and contributions are welcome from the community!

Metadata

Version

0.1.0

License

Unknown

Platforms (76)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows