RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

May 11, 20262605.10899

Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen + 7 more

cs.CLcs.LG

TLDR

RubricEM is a meta-RL framework that uses rubrics to guide policy decomposition and reflection for training research agents without verifiable rewards.

Key contributions

Uses rubrics as a shared interface to structure policy execution, judge feedback, and agent memory.
Decomposes research trajectories into stage-aware planning, evidence gathering, review, and synthesis.
Employs Stage-Structured GRPO for dense, stagewise rubric-based credit assignment.
Trains a reflection meta-policy to distill judged trajectories into reusable guidance for future attempts.

Why it matters

Training research agents is hard due to a lack of verifiable rewards and long trajectories. RubricEM tackles this by using rubrics to structure policy execution, feedback, and memory, enabling effective long-horizon optimization.

Original Abstract

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers