Tom Henighan

5 papers · Latest: December 19, 2022

Discovering Language Model Behaviors with Model-Written Evaluations

This paper introduces a method to automatically generate high-quality evaluations using language models themselves, revealing new and unexpected behaviors as models scale.

2212.09251Dec 19, 2022

Natural Language Processing

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI trains harmless AI assistants using AI-generated feedback guided by a set of human-defined principles, minimizing the need for human-labeled data.

2212.08073Dec 15, 2022

Natural Language Processing

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

This paper demonstrates that reinforcement learning from human feedback (RLHF) can effectively fine-tune language models to be both helpful and harmless, improving performance across NLP tasks while maintaining specialized skills.

2204.05862Apr 12, 2022

Natural Language Processing

Language Models are Few-Shot Learners

GPT-3, a 175 billion parameter language model, demonstrates strong few-shot learning abilities across diverse NLP tasks without task-specific fine-tuning.

2005.14165May 28, 2020

Machine Learning

Scaling Laws for Neural Language Models

This paper identifies power-law scaling relationships between language model performance and factors like model size, dataset size, and compute, enabling optimal training strategies under fixed compute budgets.

2001.08361Jan 23, 2020

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.