Misaligned by Reward: Socially Undesirable Preferences in LLMs

May 6, 20262605.05003

cs.CLcs.AIcs.CY

TLDR

Reward models for LLMs often prefer socially undesirable responses, indicating a lack of social intelligence and highlighting the need for better social alignment evaluations.

Key contributions

Extended reward model benchmarking to bias, safety, morality, and ethical reasoning domains.
Developed a framework to convert social evaluation datasets into pairwise preference data.
Found reward models often prefer socially undesirable options and produce systematically biased distributions.
Revealed a key trade-off between avoiding bias and preserving contextual faithfulness.

Why it matters

This paper reveals critical shortcomings in current LLM reward models regarding social alignment. It shows that these models often encode socially undesirable preferences, leading to biased outputs. This work is crucial for developing more ethically aligned and socially intelligent AI systems.

Original Abstract

Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers