Grouped Query Attention (GQA)

Attention variant where multiple query heads share a single key/value head, reducing KV cache memory.

1.
Llama 2 70B uses GQA with 8 key/value heads shared across 64 query heads - reducing the KV cache size by 8x and enabling 2x longer context at the same memory budget compared to multi-head attention.
2.
Mistral 7B uses GQA to reduce inference memory footprint - enabling the model to run on a single 24GB consumer GPU for batch-size-1 inference that would otherwise require 48GB with standard MHA.
3.
Google's PaLM 2 uses GQA across all model sizes to improve inference efficiency at scale - the technique is now standard in virtually all new LLM architectures released after 2023.

Grouped Query AttentionGQA