ArXiv TLDR

Efficient Pre-Training with Token Superposition

🐦 Tweet
2605.06546

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

cs.CL

TLDR

Token-Superposition Training (TST) is a simple, drop-in method that significantly boosts LLM pre-training efficiency, reducing time by up to 2.5x.

Key contributions

  • Introduces Token-Superposition Training (TST) for more efficient LLM pre-training.
  • TST is a simple, drop-in method requiring no changes to model architecture, optimizer, or data.
  • Employs a two-phase approach: superposition with multi-hot cross-entropy, followed by standard training.
  • Achieves up to a 2.5x reduction in total pre-training time at 10B scale, outperforming baselines.

Why it matters

Large Language Model pre-training is prohibitively expensive and inefficient. This paper presents a practical, non-invasive method to drastically cut pre-training time and cost, making large-scale model development more accessible and sustainable.

Original Abstract

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.