ArXiv TLDR

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

🐦 Tweet
2305.07759

Ronen Eldan, Yuanzhi Li

cs.CLcs.AIcs.LG

TLDR

TinyStories demonstrates that very small language models under 10 million parameters can generate coherent, fluent English stories using a simple synthetic dataset and a novel GPT-4-based evaluation framework.

Key contributions

  • Introduced TinyStories, a synthetic dataset of simple short stories using vocabulary of 3-4 year olds, generated by GPT-3.5 and GPT-4.
  • Showed that tiny LMs with fewer than 10M parameters and minimal architecture can produce multi-paragraph, grammatically correct, and coherent stories.
  • Proposed a new GPT-4-based evaluation paradigm treating model outputs as student stories graded on grammar, creativity, and consistency, providing richer assessment than standard benchmarks.

Why it matters

This paper challenges the prevailing belief that coherent English generation requires large, complex language models by proving that small models trained on carefully designed simple data can achieve impressive fluency and reasoning. The novel evaluation method also offers a more nuanced and practical way to assess language model capabilities, which is crucial for advancing research in low-resource settings and understanding how language abilities emerge in neural networks.

Original Abstract

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.