MyNixOS website logo
Description

Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation.

Provides a unified framework for generating, submitting, and analyzing pairwise comparisons of writing quality using large language models (LLMs). The package supports live and/or batch evaluation workflows across multiple providers ('OpenAI', 'Anthropic', 'Google Gemini', 'Together AI', and locally-hosted 'Ollama' models), includes bias-tested prompt templates and a flexible template registry, and offers tools for constructing forward and reversed comparison sets to analyze consistency and positional bias. Results can be modeled using Bradley–Terry (1952) <doi:10.2307/2334029> or Elo rating methods to derive writing quality scores. For information on the method of pairwise comparisons, see Thurstone (1927) <doi:10.1037/h0070288> and Heldsinger & Humphry (2010) <doi:10.1007/BF03216919>. For information on Elo ratings, see Clark et al. (2018) <doi:10.1371/journal.pone.0190393>.

pairwiseLLM banner

pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation

R-CMD-check Codecov testcoverage CRANstatus License:MIT Project Status: Active – The project has reached a stable, usablestate and is being activelydeveloped.

pairwiseLLM provides a unified, extensible framework for generating, submitting, and modeling pairwise comparisons of writing quality using large language models (LLMs).

It includes:

  • Unified live and batch APIs across OpenAI, Anthropic, and Gemini
  • A prompt template registry with tested templates designed to reduce positional bias
  • Positional-bias diagnostics (forward vs reverse design)
  • Bradley–Terry (BT) and Elo modeling
  • Consistent data structures for all providers

Vignettes

Several vignettes are available to demonstrate functionality.

For basic function usage, see:

For advanced batch processing workflows, see:

For information on prompt evaluation and positional-bias diagnostics, see:


Supported Models

The following models are confirmed to work for pairwise comparisons:

ProviderModelReasoning Mode?
OpenAIgpt-5.2✅ Yes
OpenAIgpt-5.1✅ Yes
OpenAIgpt-4o❌ No
OpenAIgpt-4.1❌ No
Anthropicclaude-sonnet-4-5✅ Yes
Anthropicclaude-haiku-4-5✅ Yes
Anthropicclaude-opus-4-5✅ Yes
Google/Geminigemini-3-pro-preview✅ Yes
DeepSeek-AI1DeepSeek-R1✅ Yes
DeepSeek-AI1DeepSeek-V3❌ No
Moonshot-AI1Kimi-K2-Instruct-0905❌ No
Qwen1Qwen3-235B-A22B-Instruct-2507❌ No
Qwen2qwen3:32b✅ Yes
Google2gemma3:27b❌ No
Mistral2mistral-small3.2:24b❌ No

1 via the together.ai API

2 via Ollama on a local machine

Batch APIs are currently available for OpenAI, Anthropic, and Gemini only. Models accessed via Together.ai and Ollama are supported for live comparisons via submit_llm_pairs() / llm_compare_pair().

BackendLiveBatch
openai
anthropic
gemini
together
ollama

Installation

Once the package is available on CRAN, install with:

install.packages("pairwiseLLM")

To install the development version from GitHub:

# install.packages("pak")
pak::pak("shmercer/pairwiseLLM")

Load the package:

library(pairwiseLLM)

Core Concepts

At a high level, pairwiseLLM workflows follow this structure:

  1. Writing samples – e.g., essays, constructed responses, short answers.
  2. Trait – a rating dimension such as “overall quality” or “organization”.
  3. Pairs – pairs of samples to be compared for that trait.
  4. Prompt template – instructions + placeholders for {TRAIT_NAME}, {TRAIT_DESCRIPTION}, {SAMPLE_1}, {SAMPLE_2}.
  5. Backend – which provider/model to use (OpenAI, Anthropic, Gemini, Together, Ollama).
  6. Modeling – convert pairwise results to latent scores via BT or Elo.

The package provides helpers for each step.


Live Comparisons

Use the unified API:

  • llm_compare_pair() — compare one pair
  • submit_llm_pairs() — compare many pairs at once

Example:

data("example_writing_samples")

pairs <- example_writing_samples |>
  make_pairs() |>
  sample_pairs(5, seed = 123) |>
  randomize_pair_order()

td <- trait_description("overall_quality")
tmpl <- get_prompt_template("default")

res <- submit_llm_pairs(
  pairs             = pairs,
  backend           = "openai",
  model             = "gpt-4o",
  trait_name        = td$name,
  trait_description = td$description,
  prompt_template   = tmpl
)

Batch Comparisons

Large-scale runs use:

  • llm_submit_pairs_batch()
  • llm_download_batch_results()

Example:

batch <- llm_submit_pairs_batch(
  backend           = "anthropic",
  model             = "claude-sonnet-4-5",
  pairs             = pairs,
  trait_name        = td$name,
  trait_description = td$description,
  prompt_template   = tmpl
)

results <- llm_download_batch_results(batch)

API Keys

pairwiseLLM reads keys only from environment variables.
Keys are never printed, never stored, and never written to disk.

You can verify which providers are available using:

check_llm_api_keys()

This returns a tibble showing whether R can see the required keys for:

  • OpenAI
  • Anthropic
  • Google Gemini
  • Together.ai

Setting API Keys

You may set keys temporarily for the current R session:

Sys.setenv(OPENAI_API_KEY = "your-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-key-here")
Sys.setenv(GEMINI_API_KEY = "your-key-here")
Sys.setenv(TOGETHER_API_KEY = "your-key-here")

…but for normal use and for reproducible analyses, it is strongly recommended
to store them in your ~/.Renviron file.

Recommended method: Adding keys to ~/.Renviron

Open your .Renviron file:

usethis::edit_r_environ()

Add the following lines:

OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"
GEMINI_API_KEY="your-gemini-key"
TOGETHER_API_KEY="your-together-key"

Save the file, then restart R.

You can confirm that R now sees the keys:

check_llm_api_keys()

Prompt Templates & Registry

pairwiseLLM includes:

  • A default template tested for positional bias
  • Support for multiple templates stored by name
  • User-defined templates via register_prompt_template()

View available templates

list_prompt_templates()
#> [1] "default" "test1"   "test2"   "test3"   "test4"   "test5"

Show the default template (truncated)

tmpl <- get_prompt_template("default")
cat(substr(tmpl, 1, 400), "...\n")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#> 
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#> 
#> SAMPLES:
#> 
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#> 
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#> 
#> EVALUATION PROCESS (Mental Simulation):
#> 
#> 1.  **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the  ...

Register your own template

register_prompt_template("my_template", "
Compare two essays for {TRAIT_NAME}…

{TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}.

SAMPLE 1:
{SAMPLE_1}

SAMPLE 2:
{SAMPLE_2}

<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or
<BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
")

Use it in a submission:

tmpl <- get_prompt_template("my_template")

Trait Descriptions

Traits define what “quality” means.

trait_description("overall_quality")
#> $name
#> [1] "Overall Quality"
#> 
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n      how clearly the writing is organized, and how effective the language and\n      conventions are."

You can also provide custom traits:

trait_description(
  custom_name        = "Clarity",
  custom_description = "How understandable, coherent, and well structured the ideas are."
)

Positional Bias Testing

LLMs often show a first-position or second-position bias.
pairwiseLLM includes explicit tools for testing this.

Typical workflow

pairs_fwd <- make_pairs(example_writing_samples)
pairs_rev <- sample_reverse_pairs(pairs_fwd, reverse_pct = 1.0)

Submit:

res_fwd <- submit_llm_pairs(pairs_fwd, model = "gpt-4o", backend = "openai", ...)
res_rev <- submit_llm_pairs(pairs_rev, model = "gpt-4o", backend = "openai", ...)

Compute bias:

cons <- compute_reverse_consistency(res_fwd, res_rev)
bias <- check_positional_bias(cons)

cons$summary
bias$summary

Positional-bias tested templates

Five included templates have been tested across different backend providers. Complete details are presented in a vignette: vignette("prompt-template-bias")


Bradley–Terry & Elo Modeling

Bradley–Terry (BT)

bt_data <- build_bt_data(res)
bt_fit <- fit_bt_model(bt_data)
summarize_bt_fit(bt_fit)

Elo Modeling

# res: output from submit_llm_pairs() / llm_submit_pairs_batch()
elo_data <- build_elo_data(res)
elo_fit  <- fit_elo_model(elo_data, runs = 5)

elo_fit$elo
elo_fit$reliability
elo_fit$reliability_weighted

Live vs Batch Summary

WorkflowUse CaseFunctions
Livesmall or interactive runssubmit_llm_pairs, llm_compare_pair
Batchlarge jobs, cost controlllm_submit_pairs_batch, llm_download_batch_results

Contributing

Contributions to pairwiseLLM are very welcome!

  • Bug reports (with reproducible examples when possible)
  • Feature requests, ideas, and discussion
  • Pull requests improving:
    • functionality
    • documentation
    • examples / vignettes
    • test coverage
  • Backend integrations (e.g., additional LLM providers or local inference engines)
  • Modeling extensions

Reporting issues

If you encounter a problem:

  1. Run:

    devtools::session_info()
    
  2. Include:

    • reproducible code
    • the error message
    • the model/backend involved
    • your operating system
  3. Open an issue at:
    https://github.com/shmercer/pairwiseLLM/issues


License

MIT License. See LICENSE.


Package Author and Maintainer


Citation

Mercer, S. H. (2025). pairwiseLLM: Pairwise writing quality comparisons with large language models (Version 1.0.0) [R package; Computer software]. https://github.com/shmercer/pairwiseLLM.

Metadata

Version

1.1.0

License

Unknown

Platforms (78)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    uefi
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-uefi
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-linux
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-uefi
  • x86_64-windows