Generalization and Scaling Laws for Mixture-of-Experts Transformers

April 10, 20262604.09175

cs.LGcs.AImath.STstat.ML

TLDR

A new theory for MoE Transformers explains generalization and scaling, deriving laws for model size, data, and compute tradeoffs.

Key contributions

Develops a theory separating active per-input capacity from routing combinatorics in MoE Transformers.
Derives a sup-norm covering-number bound and a generalization bound for MoE architectures.
Proves an approximation theorem showing error reduction via active capacity or increasing experts.
Establishes neural scaling laws for MoE model size, data size, and compute-optimal tradeoffs.

Why it matters

This work provides a crucial statistical foundation for understanding MoE Transformers. It clarifies the theoretical limits and behaviors certified by worst-case analysis, guiding future research and development of these powerful models.

Original Abstract

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^β$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers