ArXiv TLDR

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

🐦 Tweet
2605.10933

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen + 1 more

cs.LGcs.CL

TLDR

DECO is a sparse MoE model matching dense performance on end-side devices, offering 3x speedup and reduced storage overhead.

Key contributions

  • Introduces DECO, a sparse MoE architecture matching dense Transformer performance on end-side devices.
  • Employs ReLU-based routing with learnable expert-wise scaling for adaptive expert contribution.
  • Features NormSiLU activation for stable expert activation ratios and increased intrinsic sparsity.
  • Delivers a 3.00x speedup on real hardware compared to dense inference with only 20% active experts.

Why it matters

MoE models struggle with large footprints on end-side devices. DECO offers a sparse MoE that matches dense model performance while significantly reducing computational cost and storage. This makes high-performance AI more accessible for resource-constrained edge applications.

Original Abstract

While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 3.00$\times$ speedup on real hardware compared with dense inference. Codes and checkpoints will be released.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.