Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

May 8, 20262605.08061

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley

cs.AI

TLDR

Introduces Rubric-Grounded RL, a framework using LLM judges and structured, multi-criterion rewards to improve generalizable reasoning in LLMs.

Key contributions

Formalizes Rubric-Grounded RL, using LLM judges for structured, multi-criterion reward decomposition.
Optimizes policies with partial-credit signals from judge-generated rubrics, improving training efficiency.
Achieves 71.7% normalized reward on held-out rubric evaluation using Llama-3.1-8B-Instruct.
Demonstrates improved performance on four external reasoning benchmarks (GSM8K, MATH, GPQA).

Why it matters

This paper introduces a novel RL framework that leverages structured, multi-criterion rewards from LLM judges, moving beyond binary or holistic scores. This approach enables more granular optimization and better generalization. The results show significant improvements on both internal rubric evaluations and external reasoning tasks, indicating a promising path for developing more robust and transferable reasoning abilities in LLMs.

Original Abstract

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers