ArXiv TLDR

Visual Preference Optimization with Rubric Rewards

🐦 Tweet
2604.13029

Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu + 13 more

cs.CVcs.AI

TLDR

rDPO introduces rubric-based preference optimization for visual tasks, using instance-specific checklists to generate high-quality feedback.

Key contributions

  • Proposes rDPO, a preference optimization framework using instance-specific rubrics for visual reasoning.
  • Rubric-based prompting significantly improves judge models, nearing GPT-5.4 performance.
  • Rubric-based filtering boosts downstream benchmark performance to 82.69, outperforming outcome-based methods.
  • rDPO achieves 61.01 on a comprehensive benchmark, surpassing baselines and base models.

Why it matters

Existing preference optimization struggles with fine-grained visual reasoning. rDPO introduces instance-specific rubrics for detailed feedback, significantly boosting performance. This offers a superior method for training complex multimodal models.

Original Abstract

The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.