ArXiv TLDR

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

🐦 Tweet
2605.13769

Abdalrahman Wael

cs.CLcs.LG

TLDR

This paper compares dense and MoE transformers at tiny scale, finding MoE outperforms dense when matching active parameters but not total parameters.

Key contributions

  • Compares dense and MoE transformers in a sub-25M parameter regime.
  • MoE outperforms dense models when matching active parameters (0.0758 loss gap).
  • Dense models still outperform MoE when matching total parameters (0.0180 loss gap).
  • The dense total-match advantage narrows sharply over training.

Why it matters

This research clarifies MoE vs. dense model performance at tiny scales, crucial for resource-constrained environments. It shows MoE benefits depend on parameter budget definitions (active vs. total), guiding efficient model design.

Original Abstract

We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model's favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.