RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

September 1, 20232309.00267

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret + 6 more

cs.CLcs.AIcs.LG

TLDR

RLAIF leverages AI-generated feedback to train language models with reinforcement learning, matching or surpassing traditional human feedback approaches while improving scalability.

Key contributions

Demonstrated that RLAIF matches RLHF performance on summarization and dialogue tasks using AI-generated preference labels.
Showed RLAIF can outperform supervised fine-tuning even when the AI feedback model is the same size or identical to the policy model.
Introduced direct-RLAIF (d-RLAIF), which bypasses reward model training by using direct AI feedback during RL, achieving better results than standard RLAIF.

Why it matters

This paper addresses the high cost and scalability challenges of collecting human feedback for training large language models by replacing it with AI-generated feedback. By proving that AI feedback can effectively guide reinforcement learning to produce aligned and high-quality outputs, and introducing a more efficient direct feedback method, the work paves the way for more scalable and cost-effective model alignment strategies.

Original Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers