MyNixOS website logo
Description

Stability and Robustness Evaluation for Machine Learning Models.

Provides tools for evaluating the trustworthiness of machine learning models in production and research settings. Computes a Stability Index that quantifies the consistency of model predictions across multiple runs or resamples, and a Robustness Score that measures model resilience under small input perturbations. Designed for data scientists, ML engineers, and researchers who need to monitor and ensure model reliability, reproducibility, and deployment readiness.

TrustworthyMLR

Stability and Robustness Evaluation for Machine Learning Models

Overview

TrustworthyMLR is an R package designed to help data scientists, machine learning engineers, and researchers evaluate the trustworthiness of their predictive models. In production environments and academic research alike, it is critical to understand not only how well a model performs, but how reliably it performs under varying conditions.

This package provides two core metrics:

MetricPurposeOutput
Stability IndexMeasures consistency of predictions across multiple training runs or resamples0–1 (1 = perfectly stable)
Classification StabilityConsistency of predicted classes (labels) adjusted for chance0–1 (1 = perfect agreement)
Robustness ScoreMeasures resilience of predictions under small input perturbations0–1 (1 = perfectly robust)
VisualizationsDecay curves and stability plots for deep diagnostic insightsplots

Why Trustworthiness Matters

Modern ML pipelines often focus exclusively on accuracy metrics (RMSE, AUC, F1). However, a model that achieves high accuracy on one training run but produces substantially different predictions on another is not reliable for deployment. Similarly, a model whose predictions change dramatically with tiny input perturbations is not robust enough for real-world use.

TrustworthyMLR addresses this gap by providing principled, easy-to-use diagnostics that complement traditional performance metrics. These tools are essential for:

  • Production ML systems — ensuring model stability across retraining cycles
  • Regulatory compliance — demonstrating model reliability for audits (e.g., finance, healthcare)
  • Academic research — reporting reproducibility metrics alongside performance results
  • Model selection — choosing the most trustworthy model among equally accurate candidates

Installation

Install the development version from GitHub:

# install.packages("devtools")
devtools::install_github("your-username/TrustworthyMLR")

Usage

Stability Index

Evaluate how consistent a model's predictions are across multiple runs:

library(TrustworthyMLR)

# Simulate predictions from 5 independent model runs
set.seed(42)
base_predictions <- rnorm(100)
prediction_matrix <- matrix(
  rep(base_predictions, 5) + rnorm(500, sd = 0.1),
  ncol = 5
)

# Compute stability (1 = perfectly consistent)
stability_index(prediction_matrix)
#> [1] 0.9950...

Robustness Score

Evaluate how sensitive a model's predictions are to small input noise:

# Define a prediction function (e.g., wrapping a trained model)
predict_fn <- function(X) X %*% c(1.5, -0.8, 2.3)

# Generate sample input data
set.seed(42)
X <- matrix(rnorm(300), ncol = 3)

# Compute robustness under 5% Gaussian noise
robustness_score(predict_fn, X, noise_level = 0.05, n_rep = 20)
#> [1] 0.9975...

### Visual Diagnostics

Visualize how model performance decays as noise increases:

plot_robustness(predict_fn, X, main = "Robustness Decay Curve")


Visualize prediction variance across observations:

plot_stability(prediction_matrix, main = "Model Prediction Stability")

Real-World Workflow Example

library(TrustworthyMLR)

# Step 1: Train multiple models (e.g., via cross-validation)
set.seed(1)
n <- 200
p <- 5
X <- matrix(rnorm(n * p), ncol = p)
y <- X %*% rnorm(p) + rnorm(n, sd = 0.5)

# Collect predictions from 10 bootstrap resamples
predictions <- replicate(10, {
  idx <- sample(n, replace = TRUE)
  fit <- lm(y[idx] ~ X[idx, ])
  predict(fit, newdata = data.frame(X))
})

# Step 2: Assess stability
cat("Stability Index:", stability_index(predictions), "\n")

# Step 3: Assess robustness
model <- lm(y ~ X)
pred_fn <- function(newX) {
  as.numeric(cbind(1, newX) %*% coef(model))
}
cat("Robustness Score:", robustness_score(pred_fn, X, noise_level = 0.05), "\n")

Functions Reference

FunctionDescription
stability_index()Compute the stability of predictions across multiple runs
robustness_score()Compute robustness of a model under input perturbations

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-metric)
  3. Commit your changes (git commit -m "Add new metric")
  4. Push to the branch (git push origin feature/new-metric)
  5. Open a Pull Request

License

MIT © Ali Hamza.

Metadata

Version

0.1.0

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows