Flash Attention

IO-aware exact attention algorithm that reduces memory usage and increases speed by tiling computation to fit in GPU SRAM.

1.
Flash Attention (Tri Dao, Stanford, 2022) enables training of GPT-3-scale models with 64k context windows on A100 GPUs - making long-context models like Claude and Gemini feasible at acceptable training cost.
2.
Flash Attention 2 achieves 2x speedup over Flash Attention 1 by reducing non-matmul FLOPs and improving parallelisation - used in vLLM, HuggingFace Transformers, and most production inference stacks.
3.
Flash Attention 3 (2024) is optimised for H100 Tensor Core architecture with asynchronous pipelining and FP8 support - achieving up to 75% of theoretical peak FLOPS on H100 for attention computation.

Loading…