EMO: Pretraining Mixture of Experts for Emergent Modularity

May 7, 20262605.06663

cs.CL

TLDR

EMO is a new Mixture-of-Experts model that achieves emergent modularity, allowing efficient selective expert use for memory-constrained LLM deployment.

Key contributions

Introduces EMO, an MoE that encourages tokens within a document to share expert pools, fostering emergent modularity.
Matches standard MoE performance as a full model while enabling efficient selective expert use.
Retaining only 25% of EMO's experts incurs just a 1% performance drop, unlike standard MoEs.
EMO's expert subsets specialize semantically (e.g., math, code), contrasting with syntactic specialization in standard MoEs.

Why it matters

This paper addresses the inefficiency of monolithic LLMs and the limitations of standard MoEs in memory-constrained settings. EMO's emergent modularity allows for significant memory savings by enabling the use of only relevant expert subsets without severe performance degradation. This opens new avenues for building more efficient, composable, and domain-specific large language models.

Original Abstract

Large language models are typically deployed as monolithic systems, requiring the full model even when applications need only a narrow subset of capabilities, e.g., code, math, or domain-specific knowledge. Mixture-of-Experts (MoEs) seemingly offer a potential alternative by activating only a subset of experts per input, but in practice, restricting inference to a subset of experts for a given domain leads to severe performance degradation. This limits their practicality in memory-constrained settings, especially as models grow larger and sparser. We introduce EMO, an MoE designed for modularity-the independent use and composition of expert subsets-without requiring human-defined priors. Our key idea is to encourage tokens from similar domains to rely on similar experts. Since tokens within a document often share a domain, EMO restricts them to select experts from a shared pool, while allowing different documents to use different pools. This simple constraint enables coherent expert groupings to emerge during pretraining using document boundaries alone. We pretrain a 1B-active, 14B-total EMO on 1T tokens. As a full model, it matches standard MoE performance. Crucially, it enables selective expert use: retaining only 25% (12.5%) of experts incurs just a 1% (3%) absolute drop, whereas standard MoEs break under the same setting. We further find that expert subsets in EMO specialize at semantic levels (e.g., domains such as math or code), in contrast to the low-level syntactic specialization observed in standard MoEs. Altogether, our results demonstrate a path toward modular, memory-efficient deployment of large, sparse models and open new opportunities for composable architectures.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers