FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

April 29, 20262604.26881

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi, David Bermbach

cs.DCcs.LG

TLDR

FaaSMoE is a serverless framework for multi-tenant Mixture-of-Experts (MoE) serving, significantly reducing resource usage.

Key contributions

FaaSMoE uses FaaS to address MoE underutilization in multi-tenant environments.
Decouples MoE control/execution, deploying experts as stateless FaaS functions for on-demand scaling.
Supports configurable expert granularity, balancing elasticity with invocation overhead.
Reduces resource usage by over two-thirds compared to full-model baselines.

Why it matters

Mixture-of-Experts (MoE) models are powerful but challenging to deploy efficiently in multi-tenant environments due to high memory requirements. FaaSMoE provides a serverless architecture that significantly reduces resource consumption, making scalable and cost-effective MoE serving practical. This enables broader adoption of large, efficient AI models in shared cloud infrastructure.

Original Abstract

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and evaluate it using Qwen1.5-moe-2.7B under multi-tenant workloads. Compared to a full-model baseline, FaaSMoE uses less than one third of the resources, demonstrating a practical and resource-efficient path towards scalable MoE serving in a multi-tenant environment.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers