Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

April 16, 20262604.15302

cs.AIcs.CLcs.LG

TLDR

This paper introduces a diagnostic toolkit using transitivity analysis and conformal prediction sets to assess the per-instance reliability of LLM judges for NLG evaluation.

Key contributions

Transitivity analysis reveals widespread per-input inconsistency in LLM judges, despite low aggregate violation rates.
Conformal prediction sets provide theoretically-guaranteed coverage and serve as per-instance reliability indicators.
Prediction set width shows consistent cross-judge agreement, capturing document-level difficulty over judge-specific noise.
Reliability varies significantly by criterion: relevance is most reliable, while fluency and consistency are least reliable.

Why it matters

LLM judges are widely used but their per-instance reliability is unclear. This paper offers a crucial toolkit to diagnose these issues, showing that reliability varies significantly by evaluation criterion. It helps improve the trustworthiness of automatic NLG evaluation.

Original Abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers