Mixtral of Experts

January 8, 20242401.04088

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary + 21 more

cs.LGcs.CL

TLDR

Mixtral 8x7B is a Sparse Mixture of Experts language model that achieves performance on par with or exceeding much larger models like Llama 2 70B and GPT-3.5 by dynamically routing tokens through a subset of experts.

Key contributions

Introduces Mixtral 8x7B, a SMoE model with 8 experts per layer, routing each token through 2 experts dynamically.
Enables access to 47B parameters per token while activating only 13B during inference, improving efficiency.
Outperforms or matches larger models on benchmarks including math, code generation, and multilingual tasks.
Provides an instruction-tuned variant that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B chat models.
Both base and instruct models are open-sourced under Apache 2.0 license.

Why it matters

This paper matters because it demonstrates that sparse mixture of experts architectures can deliver state-of-the-art performance with significantly fewer active parameters, enabling more efficient and scalable large language models. By outperforming much larger dense models on key benchmarks and releasing the models openly, Mixtral advances accessible high-performance NLP and sets a new standard for efficient model design.

Original Abstract

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers