The Verification Tax: Fundamental Limits of AI Auditing in the Rare-Error Regime
TLDR
As AI models improve, verifying their calibration becomes fundamentally harder due to a "verification tax," requiring new auditing practices.
Key contributions
- Proves a "verification tax": calibration auditing becomes fundamentally harder as AI models improve.
- Self-evaluation without labels provides zero information for assessing model calibration.
- Identifies a sharp phase transition where miscalibration becomes statistically undetectable.
- Shows active querying can eliminate the Lipschitz constant, simplifying calibration verification.
Why it matters
This paper reveals fundamental limits to AI auditing, particularly for calibration in high-performing models. It challenges standard evaluation practices and provides a theoretical framework for understanding the "verification tax." The findings suggest that current methods may be insufficient for frontier models, advocating for new approaches like active querying to ensure reliable AI.
Original Abstract
The most cited calibration result in deep learning -- post-temperature-scaling ECE of 0.012 on CIFAR-100 (Guo et al., 2017) -- is below the statistical noise floor. We prove this is not a failure of the experiment but a law: the minimax rate for estimating calibration error with model error rate epsilon is Theta((Lepsilon/m)^{1/3}), and no estimator can beat it. This "verification tax" implies that as AI models improve, verifying their calibration becomes fundamentally harder -- with the same exponent in opposite directions. We establish four results that contradict standard evaluation practice: (1) self-evaluation without labels provides exactly zero information about calibration, bounded by a constant independent of compute; (2) a sharp phase transition at mepsilon approx 1 below which miscalibration is undetectable; (3) active querying eliminates the Lipschitz constant, collapsing estimation to detection; (4) verification cost grows exponentially with pipeline depth at rate L^K. We validate across five benchmarks (MMLU, TruthfulQA, ARC-Challenge, HellaSwag, WinoGrande; ~27,000 items) with 6 LLMs from 5 families (8B-405B parameters, 27 benchmark-model pairs with logprob-based confidence), 95% bootstrap CIs, and permutation tests. Self-evaluation non-significance holds in 80% of pairs. Across frontier models, 23% of pairwise comparisons are indistinguishable from noise, implying that credible calibration claims must report verification floors and prioritize active querying once gains approach benchmark resolution.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.