Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
Yikai Wang, Shang Liu, Jose Blanchet
TLDR
This paper introduces Wasserstein Distributionally Robust Regret Optimization (DRRO) for RLHF to mitigate reward over-optimization, offering a less pessimistic approach.
Key contributions
- Proposes Wasserstein Distributionally Robust Regret Optimization (DRRO) for RLHF.
- DRRO minimizes worst-case regret, offering a less pessimistic approach than standard DRO.
- Derives an exact solution and water-filling structure for promptwise problems under an ℓ₁ ambiguity set.
- Yields a practical policy-gradient algorithm with a simple sampled-bonus interpretation.
Why it matters
RLHF often suffers from reward over-optimization, where models improve on proxy rewards but degrade in true utility. This paper introduces a robust optimization method that effectively mitigates this Goodharting problem, offering a less pessimistic and more practical solution than prior approaches for aligning large language models.
Original Abstract
Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$ ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.