ArXiv TLDR

LIMA: Less Is More for Alignment

🐦 Tweet
2305.11206

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun + 10 more

cs.CLcs.AIcs.LG

TLDR

LIMA shows that fine-tuning a large language model on just 1,000 curated examples can achieve performance comparable to state-of-the-art models, highlighting the dominant role of pretraining over extensive instruction tuning.

Key contributions

  • Fine-tuned a 65B parameter LLaMa model on only 1,000 high-quality prompts without reinforcement learning or preference modeling.
  • LIMA effectively learned complex response formats and generalized well to unseen tasks.
  • Human evaluations found LIMA's outputs comparable or preferred over GPT-4, Bard, and DaVinci003 in a significant portion of cases.

Why it matters

This paper challenges the prevailing notion that large-scale instruction tuning and reinforcement learning are essential for aligning large language models, demonstrating that most knowledge is acquired during pretraining and that minimal, carefully selected fine-tuning data suffices to produce high-quality, aligned outputs. This insight can drastically reduce the resources and data needed for model alignment, making advanced language models more accessible and efficient to deploy.

Original Abstract

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.