ArXiv TLDR

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

🐦 Tweet
2605.13695

Andrea Morandi

cs.CLcs.AI

TLDR

RTLC, a three-stage prompting paradigm inspired by Feynman, significantly boosts LLM-as-judge accuracy on JudgeBench without fine-tuning.

Key contributions

  • RTLC is a three-stage prompting method (Research, Teach-to-Learn, Critique) for LLM-as-judge.
  • Inspired by the Feynman Learning Technique, it uses a pedagogical scaffold and self-critique.
  • Boosts Claude 3.7 Sonnet's JudgeBench accuracy by 14.0 pp (from 64.6% to 78.6%) without fine-tuning.
  • Outperforms N=10 self-consistency and zero-shot baselines, with Teach-to-Learn being key.

Why it matters

LLM-as-judge accuracy is low on objective tasks. This method significantly improves it without complex setups. It offers a simple, effective way to enhance LLM evaluation capabilities, making them more reliable for open-ended generation.

Original Abstract

LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.