ArXiv TLDR

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

🐦 Tweet
2605.10805

Wenbo Zhang, Lijinghua Zhang, Liner Xiang, Hengrui Cai

cs.AIcs.CLstat.ML

TLDR

RACER dynamically routes between reasoning and non-reasoning LLM judges to optimize accuracy and cost, especially under distribution shift.

Key contributions

  • Reasoning LLM judges boost accuracy for complex tasks (math/coding) but are costly and less effective for simple ones.
  • Introduces RACER, a robust adaptive routing system for LLM judges to balance accuracy and cost.
  • RACER formulates routing as a constrained distributionally robust optimization problem.
  • RACER accounts for distribution shift, has theoretical guarantees, and shows superior accuracy-cost trade-offs.

Why it matters

LLM judges are costly, and reasoning isn't always beneficial. This paper shows when to use reasoning and proposes RACER to dynamically optimize LLM judge usage. It provides a robust solution for cost-efficient and accurate LLM evaluation, especially in dynamic environments.

Original Abstract

Reasoning-capable large language models (LLMs) have recently been adopted as automated judges, but their benefits and costs in LLM-as-a-Judge settings remain unclear. Through controlled comparisons between reasoning and non-reasoning judges, we show that explicit reasoning substantially improves judgment accuracy on tasks requiring structured verification (e.g., math and coding), while offering limited or even negative gains on simpler evaluations and incurring significantly higher computational cost. These findings motivate that reasoning should be used selectively rather than universally, with awareness of possible distribution shift. We propose a Robust Adaptive Cost-Efficient Routing (RACER), which dynamically selects between reasoning and non-reasoning judges under a fixed budget by formulating routing as a constrained distributionally robust optimization problem. RACER explicitly accounts for distribution shift via a KL-divergence uncertainty set, admits an efficient primal--dual algorithm, and enjoys theoretical guarantees including uniqueness of the optimal policy and linear convergence. Extensive experiments show that RACER achieves superior accuracy--cost trade-offs under distribution shift.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.