Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright + 15 more
TLDR
This paper presents InstructGPT, a method to align language models with user intent by fine-tuning GPT-3 using human feedback, resulting in more truthful, helpful, and less toxic outputs.
Key contributions
- Collected human-labeled demonstrations and rankings to fine-tune GPT-3 with supervised learning and reinforcement learning from human feedback.
- InstructGPT (1.3B parameters) outperforms the much larger 175B GPT-3 model in human preference evaluations.
- Improved truthfulness and reduced toxic outputs with minimal loss in performance on standard NLP benchmarks.
Why it matters
This work addresses a critical limitation of large language models: their inability to reliably follow user intent and produce aligned, safe, and helpful outputs. By incorporating human feedback into the training loop, the authors demonstrate a scalable approach to significantly improve model alignment, which is essential for deploying language models responsibly and effectively in real-world applications.
Original Abstract
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.