Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

April 22, 20262604.20726

Mohamed Hesham Elganayni, Runsheng Chen, Sebastian Nagl, Matthias Grabmair

cs.CLcs.AI

TLDR

This paper shows automatic prompt optimization for LLM-as-a-Judge in legal QA outperforms human design, with lenient judges yielding better, more transferable prompts.

Key contributions

Automatic prompt optimization consistently beats human-centered design for LLM-as-a-Judge in legal QA.
Lenient LLM judges provide more effective and consistent feedback for prompt optimization than strict judges.
Prompts optimized with lenient judge feedback transfer better to strict judges than the reverse.
Judge disposition (lenient vs. strict) during optimization significantly impacts prompt generalizability.

Why it matters

This paper demonstrates that algorithmically optimizing prompts for LLM-as-a-Judge evaluations can surpass human-centered design. It highlights the critical role of judge disposition, showing lenient judges lead to more generalizable prompts. These findings offer a path to more robust and efficient LLM evaluation frameworks.

Original Abstract

This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability. Code and optimized prompts are available at https://github.com/TUMLegalTech/icail2026-llm-judge-gaming.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers