Mamba

Selective state space model that makes SSM parameters input-dependent (selective), achieving 4-5x higher inference throughput than Transformers of similar size while matching perplexity.

Examples

1.
Mamba (Gu and Dao, 2023) achieves 4-5x higher inference throughput than comparably-sized Transformers because it requires no KV cache, enabling much higher batch sizes. Mamba-3B matches Transformer-6B perplexity while being 40% cheaper to run.
2.
Mamba-2 (2024) introduces State Space Duality (SSD), allowing the architecture to operate in both SSM and attention modes and supporting larger state dimensions. Mistral AI released Codestral Mamba as a pure Mamba code model using this architecture.
3.
IBM Granite 4.0 incorporates Mamba-2 layers alongside attention layers in a hybrid architecture informed by Bamba research, demonstrating that the SSM-attention hybrid pattern is entering enterprise production models in 2024-2025.

Related terms

Back to glossary

Examples

Related terms

Loading…

Examples

Related terms