ArXiv TLDR

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

🐦 Tweet
2604.07343

Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou + 2 more

cs.CLcs.LG

TLDR

Personalized RewardBench evaluates reward models' ability to capture individual user preferences, revealing SOTA models struggle and predicting downstream performance.

Key contributions

  • Introduces Personalized RewardBench (PRB) to evaluate reward models' capacity for personalized preferences.
  • PRB uses human-evaluated, user-specific chosen/rejected response pairs, ensuring personal preference is the key differentiator.
  • Reveals current state-of-the-art reward models achieve only 75.94% accuracy on personalized preferences.
  • Demonstrates PRB's strong correlation with downstream LLM performance in Best-of-N sampling and PPO.

Why it matters

This paper addresses a critical gap in evaluating reward models for personalized LLM alignment. By showing current models struggle and providing a robust benchmark, it paves the way for developing more user-centric and truly aligned LLMs. This is crucial for advancing pluralistic alignment.

Original Abstract

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.