Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

May 7, 20262605.06161

cs.AIcs.SE

TLDR

LLM safety judges are unreliable; their verdicts depend on policy wording, not just agent behavior, leading to flawed safety evaluations.

Key contributions

Introduces "policy invariance" as a crucial reliability test for LLM safety judges.
Operationalizes invariance into three testable principles: rubric-semantics, rubric-threshold, and ambiguity-aware calibration.
Reveals current LLM judges respond similarly to meaningful policy shifts and meaningless rewrites, flipping up to 9.1% of verdicts.
Proposes Policy Invariance Score and Judge Card protocol to audit judge reliability beyond accuracy.

Why it matters

This paper highlights a critical flaw in current LLM safety evaluation, showing that judge verdicts are highly sensitive to policy wording. It provides a novel framework and tools to measure and improve the reliability of these judges, moving beyond simple accuracy metrics. This is essential for building truly trustworthy and robust AI systems.

Original Abstract

LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today's judges respond to meaningful normative shifts and to meaningless structural rewrites with comparable strength, and cannot tell the two apart. Content-preserving policy rewrites flip up to 9.1% of verdicts above baseline jitter, and 18-43% of all observed flips occur on unambiguous cases under such rewrites, so existing safety scores conflate what the agent did with how the evaluator was prompted. Beyond the diagnosis, we contribute the Policy Invariance Score and the Judge Card reporting protocol, which expose an order-of-magnitude spread in judge reliability that is invisible to accuracy-only leaderboards. We release the protocol and code so that future agent-safety benchmarks can audit their own evaluators rather than trust them by default.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers