ArXiv TLDR

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

🐦 Tweet
2604.11626

Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin + 1 more

cs.AIcs.LG

TLDR

RationalRewards uses explicit, multi-dimensional critiques to improve visual generation at both training and test time, outperforming scalar rewards.

Key contributions

  • Introduces RationalRewards, a reward model that generates explicit, multi-dimensional critiques for visual generation.
  • Proposes PARROT, a framework to train rationale-producing reward models from readily available preference data.
  • Improves visual generators via fine-grained RL rewards and a novel test-time critique-and-refine loop.
  • Achieves state-of-the-art preference prediction among open-source models with 10-20x less training data.

Why it matters

This paper transforms reward models from passive evaluators into active optimization tools by incorporating reasoning. It significantly improves visual generation quality at both training and test time, even matching RL fine-tuning with a novel critique-and-refine loop.

Original Abstract

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.