Hybrid Model

Architecture combining Transformer attention layers with SSM or other non-attention layers to balance capability and efficiency.

1.
AI21 Labs' Jamba uses a Mamba plus attention hybrid architecture (43% Mamba-2, 7% attention, 50% MLP) and scored 65.4 on Arena Hard, outperforming both pure Transformer and pure SSM models at the same scale.
2.
NVIDIA research (2024) validated that hybrid SSM-attention models match pure Transformers on five-shot MMLU and other benchmarks while retaining the inference efficiency of SSMs, driving adoption of hybrid architectures in production.
3.
IBM Granite 4.0, Microsoft Phi-Mamba, and Falcon Mamba all use hybrid attention-SSM architectures, suggesting that hybrid models will increasingly replace pure Transformer architectures for medium-scale (7-70B) production deployments.

Loading…