ArXiv TLDR

The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

🐦 Tweet
2605.07671

Lauri Lovén, Sasu Tarkoma

cs.GTcs.AIcs.MAecon.THmath.OC

TLDR

This paper reveals an endogeneity in AI oversight: smooth scoring rules fail to elicit truthful reports from strategic agents, but sharp, step-function thresholds offer a solution.

Key contributions

  • Optimal oversight using non-affine approval functions makes truthful reporting suboptimal for strategic agents.
  • This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula.
  • A constructive escape exists: step-function approval thresholds achieve first-best screening for every rule.
  • Brier score uniquely achieves welfare equivalence between second-best and first-best due to type-independent inflation cost.

Why it matters

This paper is crucial for AI alignment, showing that smooth scoring-based oversight fails to elicit truthful reports from strategic agents. It offers a concrete design using sharp thresholds to preserve calibration, vital for scalable AI oversight and marketplace operations.

Original Abstract

Eliciting truthful reports from autonomous agents is a core problem in scalable AI oversight: a principal scores the agent's report using a strictly proper scoring rule, but the agent also benefits from the report through a non-accuracy channel (approval for autonomous action, allocation share, downstream control). The same structure appears in classical mechanism-design settings such as marketplace operation. Our main result is an endogeneity: the principal's optimal oversight necessarily uses a non-affine approval function to screen types, yet any non-affine approval makes truthful reporting suboptimal under the combined objective whenever deviation is undetectable. The principal cannot avoid the perturbation that undermines calibration. This impossibility holds for all strictly proper scoring rules, with a closed-form perturbation formula. A constructive escape exists: a step-function approval threshold achieves first-best screening for every strictly proper scoring rule, because the agent's binary inflate-or-not choice creates a type-space threshold regardless of the generator's curvature. Under the Brier score specifically, the type-independent inflation cost yields a welfare equivalence between second-best and first-best; we prove this equivalence is unique to Brier (the welfare gap under smooth $C^1$ oversight is bounded below by $Ω(\text{Var}(1/G'') (γ/β)^2)$ for every non-Brier rule). Two instances develop the framework: AI agent oversight (the lead motivating setting) and marketplace operation (a parallel mechanism-design domain). The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truthful reports from a strategic agent; sharp thresholds are the calibration-preserving design.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.