Mixture of Experts (MoE)

Model architecture activating only a subset of parameters per token for efficient scaling.

A scheme to increase neural network efficiency by using only a subset of its parameters (known as an expert) to process a given input token or example. A gating network routes each input token or example to the proper expert(s).

For details, see either of the following papers:

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Mixture-of-Experts with Expert Choice Routing

Examples

1.
Mixtral 8x7B (Mistral AI) uses 8 expert FFN layers with 2 activated per token - achieving GPT-3.5 quality with 45B total parameters but only 12B active, enabling faster and cheaper inference.
2.
GPT-4 is widely considered to be an MoE model with multiple experts - the architecture allows OpenAI to serve GPT-4-class intelligence at the inference cost of a fraction of the total parameter count.
3.
DeepSeek V3 (671B total parameters, 37B active MoE) achieves GPT-4-class performance at $5.5M training cost - demonstrating that MoE architecture enables frontier-quality models at dramatically lower training cost.

Real-world uses

Created for this library

1.
An LLM team adopts a mixture-of-experts architecture to scale parameter count without proportionally scaling inference cost.
2.
A search platform team uses mixture-of-experts to route different query types through specialized experts trained on relevant data.
3.
A research lab uses mixture-of-experts to specialize different experts on different languages within a multilingual translation model.

Back to glossary

Model architecture activating only a subset of parameters per token for efficient scaling.

For details, see either of the following papers:

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Mixture-of-Experts with Expert Choice Routing

Examples

1.
Mixtral 8x7B (Mistral AI) uses 8 expert FFN layers with 2 activated per token - achieving GPT-3.5 quality with 45B total parameters but only 12B active, enabling faster and cheaper inference.
2.
GPT-4 is widely considered to be an MoE model with multiple experts - the architecture allows OpenAI to serve GPT-4-class intelligence at the inference cost of a fraction of the total parameter count.
3.
DeepSeek V3 (671B total parameters, 37B active MoE) achieves GPT-4-class performance at $5.5M training cost - demonstrating that MoE architecture enables frontier-quality models at dramatically lower training cost.

Real-world uses

Created for this library

1.
An LLM team adopts a mixture-of-experts architecture to scale parameter count without proportionally scaling inference cost.
2.
A search platform team uses mixture-of-experts to route different query types through specialized experts trained on relevant data.
3.
A research lab uses mixture-of-experts to specialize different experts on different languages within a multilingual translation model.

Back to glossary

Mixture of ExpertsMoE

Examples

Real-world uses

Related terms

Loading…

Mixture of ExpertsMoE

Examples

Real-world uses

Related terms