MyNixOS website logo
Description

Differential Item Functioning for AI-Scored Assessments.

Detects and quantifies differential item functioning (DIF) in AI-scored educational and psychological assessments. Provides a fully self-contained robust DIF engine (M-estimation via iteratively re-weighted least squares with the bi-square loss) alongside the novel Differential AI Scoring Bias (DASB) test, which detects item-level scoring shifts that differ across subgroups when comparing human and AI scoring conditions. Includes simulation utilities, anchor weight diagnostics, and an AI-effect classification framework.

aiDIF: Differential Item Functioning for AI-Scored Assessments

R-CMD-check License: GPL v3

Overview

aiDIF addresses a modern measurement fairness challenge: does AI scoring introduce subgroup-dependent item bias?

As AI systems increasingly score essays, short answers, and structured responses in educational and psychological assessments, a critical question arises: does the AI scoring engine shift item difficulties differently for different demographic groups — even when no human-scoring DIF exists?

aiDIF provides:

  • Robust DIF analysis under both human and AI scoring conditions, using a fully self-contained M-estimation engine (IRLS with bi-square loss)
  • Differential AI Scoring Bias (DASB) test — a novel Wald test for item-level scoring shifts that differ across groups
  • AI-effect summaries classifying items as: stable_clean, stable_dif, introduced, masked, or new_direction across scoring conditions
  • Anchor weight diagnostics under potential AI contamination
  • Simulation utilities for benchmarking DIF methods in AI-scored settings

Installation

# Install from GitHub
devtools::install_github("causalfragility-lab/aiDIF")

# Or install from local source
devtools::install_local("path/to/aiDIF")

Quick Start

library(aiDIF)

# Generate synthetic data with known DIF and DASB
dat <- simulate_aidif_data(n_items = 6, seed = 1)

# Fit the model
mod <- fit_aidif(
  human_mle = dat$human,
  ai_mle    = dat$ai
)

# Compact summary
print(mod)

# Full report
summary(mod)

# Visualisations
plot(mod, type = "dif_forest")  # Forest plot: human vs AI DIF estimates
plot(mod, type = "dasb")        # Bar chart of DASB with error bars
plot(mod, type = "weights")     # Anchor weights in each scoring condition
plot(mod, type = "rho")         # Bi-square objective for human scoring

Core Concepts

Differential AI Scoring Bias (DASB)

For item i and group g, define the scoring shift:

delta_ig = d_ig^AI - d_ig^Human

where d_ig is the IRT intercept (difficulty) parameter. The DASB is:

DASB_i = delta_i2 - delta_i1

Under H₀: DASB_i = 0, a Wald test is conducted using the asymptotic variance derived from the delta method (assuming independent groups and scoring conditions):

Var(DASB_i) = Var(d_i1^H) + Var(d_i2^H) + Var(d_i1^AI) + Var(d_i2^AI)

A significant result means the AI scoring engine does not merely re-scale all items uniformly — it disadvantages (or advantages) one group at specific items.

AI-Effect Classification

ai_effect_summary() compares DIF flagging patterns between scoring conditions:

StatusMeaning
stable_cleanNot flagged in either condition
stable_difFlagged in both (same direction)
introducedFlagged only under AI scoring
maskedFlagged only under human scoring
new_directionFlagged in both, but bias reverses sign

From Existing IRT Fits

If you have fitted IRT models in mirt, use read_ai_scored() to bundle your parameter estimates into the format fit_aidif() expects:

library(mirt)

# Fit multigroup 2PL under human scoring
human_fit <- mirt(human_data, model = 1, itemtype = "2PL",
                  group = "group", SE = TRUE)

# Extract parameters manually and bundle
# (see ?read_ai_scored for the required list structure)
dat <- read_ai_scored(human_mle, ai_mle)

# Fit aiDIF model
mod <- fit_aidif(dat$human, dat$ai)

Simulation

# Generate synthetic data with known DIF and DASB
dat <- simulate_aidif_data(
  n_items    = 10,
  n_obs      = 500,
  impact     = 0.5,      # 0.5 SD group mean difference
  dif_items  = c(1, 2),  # items with human-scoring DIF
  dif_mag    = 0.5,
  dasb_items = 5,        # item with AI-induced differential bias
  dasb_mag   = 0.4,
  ai_drift   = 0.1       # uniform AI calibration offset
)

mod <- fit_aidif(dat$human, dat$ai)
summary(mod)

Package Architecture

aiDIF/
├── R/
│   ├── read_functions.R    # read_ai_scored()
│   ├── aidif_core.R        # fit_aidif() — main estimation wrapper
│   ├── robust_engine.R     # estimate_robust_scale(), Wald tests, IRLS engine
│   ├── scoring_bias.R      # scoring_bias_test(), ai_effect_summary(),
│   │                       #   anchor_weights()
│   ├── simulate.R          # simulate_aidif_data()
│   ├── validate_inputs.R   # Internal validation helpers
│   └── class_functions.R   # print/summary/plot S3 methods
├── tests/
│   └── testthat/
│       └── test-aidif.R
└── DESCRIPTION

Citation

If you use aiDIF in published research, please cite:

Hait, S. (2026). aiDIF: Differential Item Functioning for AI-Scored
Assessments. R package version 0.1.0.
https://github.com/causalfragility-lab/aiDIF

License

GPL (>= 3)

Metadata

Version

0.1.0

License

Unknown

Platforms (80)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arc-linux
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • sh4-linux
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows