MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
Jaeyun Lee, Junyoung Koh, Zeynel Tok, Hunar Batra, Ronald Clark
TLDR
MCJudgeBench is a new benchmark for evaluating LLM judges at the constraint level in multi-constraint instruction following, revealing nuanced reliability issues.
Key contributions
- Evaluates LLM judges at the individual constraint level, not just overall response judgments.
- Provides per-constraint gold labels (yes, partial, no) and controlled response perturbations.
- Includes prompt variants and metrics for correctness and intrinsic/procedural inconsistency.
- Reveals that high overall correctness doesn't guarantee reliability across all label categories or low inconsistency.
Why it matters
This paper introduces a crucial benchmark for thoroughly evaluating LLM judges, moving beyond superficial overall assessments. It highlights that current LLM judges have complex reliability issues, performing inconsistently across different constraint types and under various perturbations. This work is vital for developing more robust and trustworthy LLM evaluation systems.
Original Abstract
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in {yes, partial, no}, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.