Description

Dealing with Binary Replicates.

Description

Statistical methods for analyzing binary replicates, which are noisy binary measurements of latent binary states. Provides scoring functions (average, median, likelihood-based, and Bayesian) to estimate the probability that an individual is in the positive state. Includes maximum a posteriori estimation via the EM algorithm and full Bayesian inference via Stan. Supports classification with inconclusive decisions and prevalence estimation.

README.md

cran.r-project.org

BinaryReplicates

Statistical methods for analyzing binary replicates, i.e., noisy binary measurements of latent binary states. This package implements the methods described in:

Royer-Carenzi, M., Lorenzo, H., & Pudlo, P. (in press). Reconciling Binary Replicates: Beyond the Average. Statistics in Medicine.

Overview

The package provides scoring functions to estimate the probability that an individual is in the positive state, given noisy replicated measurements:

Method	Function	Requirements
Average-based	`average_scoring()`	None
Median-based	`median_scoring()`	None
MAP (EM algorithm)	`MAP_scoring()`	Fitted EM model
Likelihood-based	`likelihood_scoring()`	Known parameters
Bayesian	`bayesian_scoring()`	Fitted Bayesian model

Additional features:

Classification with inconclusive decisions (classify_with_scores())
Prevalence estimation from scores or Bayesian posterior
Credible intervals for model parameters
Cross-validation for model assessment (cvEM())

Statistical Model

For each individual $i$, we observe $n_i$ binary replicates. These are noisy measurements of a true latent state $T_i \in {0, 1}$:

$$ T_i \mid \theta \sim \text{Bernoulli}(\theta) $$

$$ S_i \mid T_i, p, q \sim \text{Binomial}\big(n_i,; T_i(1-q) + (1-T_i)p\big) $$

where:

$\theta \in (0, 1)$ is the prevalence (probability that $T_i = 1$)
$p \in (0, 1/2)$ is the false positive rate (probability of observing 1 when $T_i = 0$)
$q \in (0, 1/2)$ is the false negative rate (probability of observing 0 when $T_i = 1$)

The goal is to estimate the probability $\mathbb{P}(T_i = 1 \mid S_i = s_i)$ for each individual, which is given by:

$$ \mathbb{P}(T_i = 1 \mid S_i = s_i) = \frac{\theta \cdot (1-q)^{s_i} q^{n_i - s_i}}{\theta \cdot (1-q)^{s_i} q^{n_i - s_i} + (1-\theta) \cdot p^{s_i} (1-p)^{n_i - s_i}} $$

Installation

Prerequisites

The package depends on rstan for Bayesian inference. Install it first by following the guide at: https://mc-stan.org/install/

Install from GitHub

# install.packages("devtools")
devtools::install_github("pierrepudlo/BinaryReplicates")

Quick Start

library(BinaryReplicates)

# Load example data
data("periodontal")
ni <- periodontal$ni
si <- periodontal$si

# --- Fast approach: EM algorithm ---
fit_em <- EMFit(ni, si)
fit_em$parameters_hat
scores_MAP <- MAP_scoring(ni, si, fit_em)

# --- Full Bayesian approach ---
fit_bayes <- BayesianFit(ni, si, chains = 4, iter = 5000)
scores_Bayes <- bayesian_scoring(ni, si, fit_bayes)

# Classify individuals (0.5 = inconclusive)
classes_MAP <- classify_with_scores(scores_MAP, vL = 0.4, vU = 0.6)
classes_Bayes <- classify_with_scores(scores_Bayes, vL = 0.4, vU = 0.6)

# Compare classifications
table(MAP = classes_MAP, Bayesian = classes_Bayes)

Usage Examples

Generate synthetic data

theta <- 0.4
p <- q <- 0.22
n <- 50
ni <- sample(2:6, n, replace = TRUE)
ti <- rbinom(n, 1, theta)
si <- rbinom(n, ni, ti * (1 - q) + (1 - ti) * p)

Simple scoring methods

These methods require no model fitting:

# Average-based scores
Y_A <- average_scoring(ni, si)
theta_A <- prevalence_estimate(Y_A)

# Median-based scores
Y_M <- median_scoring(ni, si)
theta_M <- prevalence_estimate(Y_M)

MAP estimation with the EM algorithm

The EM algorithm estimates model parameters without full Bayesian inference:

fit_em <- EMFit(ni, si)
fit_em$parameters_hat

# MAP scores use the estimated parameters
Y_MAP <- MAP_scoring(ni, si, fit_em)
theta_MAP <- prevalence_estimate(Y_MAP)

# Classification with thresholds
T_MAP <- classify_with_scores(Y_MAP, vL = 0.4, vU = 0.6)

Full Bayesian inference

For full posterior inference, use Stan via BayesianFit():

# Fit the Bayesian model (uses Stan MCMC)
fit <- BayesianFit(ni, si, chains = 4, iter = 5000)
print(fit, pars = c("theta", "p", "q"))

# Credible intervals
credint(fit, level = 0.90)

# Bayesian scores and prevalence
Y_B <- bayesian_scoring(ni, si, fit)
theta_B <- bayesian_prevalence_estimate(fit)

# Classification
T_B <- classify_with_scores(Y_B, vL = 0.4, vU = 0.6)

To use multiple CPU cores for faster sampling:

options(mc.cores = parallel::detectCores())

Likelihood-based scores (known parameters)

When the true parameters are known (e.g., in simulations):

Y_L <- likelihood_scoring(ni, si, list(theta = theta, p = p, q = q))
T_L <- classify_with_scores(Y_L, vL = 0.4, vU = 0.6)

Prediction on new data

Compute predictive Bayesian scores for new observations:

new_ni <- rep(10, 11)
new_si <- 0:10
new_scores <- predict_scores(new_ni, new_si, fit)

Cross-validation

Assess model performance with cross-validation:

cv_result <- cvEM(ni, si, N_cv = 10)
cv_classified <- classify_with_scores_cvEM(cv_result, ti = ti, vL = 0.4)
cv_classified$risk  # Empirical risk

Documentation

For more details, see the package vignette:

vignette("introduction", package = "BinaryReplicates")

Citation

If you use this package, please cite:

Royer-Carenzi, M., Lorenzo, H., & Pudlo, P. (in press).
Reconciling Binary Replicates: Beyond the Average.
Statistics in Medicine.

License

GPL (>= 3)

r-BinaryReplicates

BinaryReplicates

Overview

Statistical Model

Installation

Prerequisites

Install from GitHub

Quick Start

Usage Examples

Generate synthetic data

Simple scoring methods

MAP estimation with the EM algorithm

Full Bayesian inference

Likelihood-based scores (known parameters)

Prediction on new data

Cross-validation

Documentation

Citation

License

Version

License

Status

Source

Homepage

Platforms (80)

BinaryReplicates

Overview

Statistical Model

Installation

Prerequisites

Install from GitHub

Quick Start

Usage Examples

Generate synthetic data

Simple scoring methods

MAP estimation with the EM algorithm

Full Bayesian inference

Likelihood-based scores (known parameters)

Prediction on new data

Cross-validation

Documentation

Citation

License

Version

License

Status

Source

Homepage

Platforms80 (80)

Platforms (80)