ArXiv TLDR

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

🐦 Tweet
2605.06665

Minbin Huang, Han Shi, Chuanyang Zheng, Yimeng Wu, Guoxuan Chen + 3 more

cs.LGcs.AI

TLDR

UniPool introduces a globally shared expert pool for MoE architectures, significantly improving efficiency and performance by decoupling expert capacity from model depth.

Key contributions

  • Replaces rigid per-layer expert allocation with a single globally shared expert pool for MoE models.
  • Uses a pool-level auxiliary loss and NormRouter for stable, balanced training of the shared pool.
  • Consistently improves validation loss and perplexity across five LLaMA-architecture model scales.
  • Matches or outperforms vanilla MoE using 41.6%-66.7% fewer expert parameters, enabling sublinear scaling.

Why it matters

Traditional MoE models inefficiently tie expert capacity to each layer. UniPool offers a more efficient and scalable approach, demonstrating that expert parameters can grow sublinearly with depth while outperforming current methods. This innovation could lead to more efficient and powerful large language models.

Original Abstract

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.