Zac Hatfield-Dodds
3 papers ยท Latest:
Natural Language Processing
Discovering Language Model Behaviors with Model-Written Evaluations
This paper introduces a method to automatically generate high-quality evaluations using language models themselves, revealing new and unexpected behaviors as models scale.
2212.09251
Natural Language ProcessingConstitutional AI: Harmlessness from AI Feedback
Constitutional AI trains harmless AI assistants using AI-generated feedback guided by a set of human-defined principles, minimizing the need for human-labeled data.
2212.08073
Natural Language ProcessingTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
This paper demonstrates that reinforcement learning from human feedback (RLHF) can effectively fine-tune language models to be both helpful and harmless, improving performance across NLP tasks while maintaining specialized skills.
2204.05862
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.