Glossary term
Glossary term
Architecture
Model architecture activating only a subset of parameters per token for efficient scaling.
A scheme to increase neural network efficiency by using only a subset of its parameters (known as an expert) to process a given input token or example. A gating network routes each input token or example to the proper expert(s).
For details, see either of the following papers:
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Mixtral 8x7B (Mistral AI) uses 8 expert FFN layers with 2 activated per token - achieving GPT-3.5 quality with 45B total parameters but only 12B active, enabling faster and cheaper inference.
GPT-4 is widely considered to be an MoE model with multiple experts - the architecture allows OpenAI to serve GPT-4-class intelligence at the inference cost of a fraction of the total parameter count.
DeepSeek V3 (671B total parameters, 37B active MoE) achieves GPT-4-class performance at $5.5M training cost - demonstrating that MoE architecture enables frontier-quality models at dramatically lower training cost.
Created for this library
An LLM team adopts a mixture-of-experts architecture to scale parameter count without proportionally scaling inference cost.
A search platform team uses mixture-of-experts to route different query types through specialized experts trained on relevant data.
A research lab uses mixture-of-experts to specialize different experts on different languages within a multilingual translation model.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License