Nicholas Joseph

4 papers · Latest: December 19, 2022

Discovering Language Model Behaviors with Model-Written Evaluations

This paper introduces a method to automatically generate high-quality evaluations using language models themselves, revealing new and unexpected behaviors as models scale.

2212.09251Dec 19, 2022

Natural Language Processing

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI trains harmless AI assistants using AI-generated feedback guided by a set of human-defined principles, minimizing the need for human-labeled data.

2212.08073Dec 15, 2022

Natural Language Processing

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

This paper demonstrates that reinforcement learning from human feedback (RLHF) can effectively fine-tune language models to be both helpful and harmless, improving performance across NLP tasks while maintaining specialized skills.

2204.05862Apr 12, 2022

Machine Learning

Evaluating Large Language Models Trained on Code

Codex, a GPT model fine-tuned on GitHub code, significantly outperforms prior models in generating correct Python programs from docstrings, demonstrating strong code synthesis capabilities.

2107.03374Jul 7, 2021

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.