ArXiv TLDR

Transcending Scaling Laws with 0.1% Extra Compute

🐦 Tweet
2210.11399

Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So + 11 more

cs.CLcs.AIcs.LG

TLDR

UL2R fine-tuning significantly improves large language model performance and scaling efficiency with only 0.1% extra compute, enabling substantial computational savings and emergent abilities.

Key contributions

  • Introduces UL2R, a lightweight fine-tuning method using UL2's mixture-of-denoiser objective applied to large LMs like PaLM.
  • Demonstrates up to 2x computational savings at 540B scale, achieving PaLM-level performance with half the compute.
  • Shows improved scaling curves lead to emergent abilities on challenging benchmarks and better few-shot performance across diverse NLP tasks.

Why it matters

This paper matters because it challenges the conventional wisdom that improving large language models requires massive additional compute and data. By showing that a small amount of targeted fine-tuning can drastically enhance model efficiency and capabilities, it paves the way for more accessible, cost-effective advances in large-scale language modeling, making state-of-the-art performance more attainable and environmentally sustainable.

Original Abstract

Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.