The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

May 2, 20262605.01311

cs.LGecon.EMstat.APstat.ML

TLDR

A new three-source method (OBS, EXP, SIM) debiases language model evaluation from confounded logs, using EXP/SIM for causal recovery.

Key contributions

Addresses bias in offline LM evaluation from usage logs due to confounded model choice.
Introduces a three-source design: observational logs, randomized experiments, and offline simulators.
Shows randomized experiments and simulators can recover causal model values, with logs reducing error.
Evaluates six estimator families, finding no single dominant method across all regimes.

Why it matters

This paper offers a practical solution to a common problem in language model evaluation, debiasing real-world usage logs. It provides a more robust and causally valid way to assess model performance by combining different data sources, reducing reliance on costly randomized experiments. This is crucial for efficient LM development.

Original Abstract

Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers