Sliding Window Attention

Attention pattern where each token attends only to a local window of neighbouring tokens, reducing quadratic complexity.

1.
Mistral 7B uses sliding window attention with a 4k window size and rolling KV cache - enabling effective processing of long documents at O(n) memory rather than O(n^2) for standard attention.
2.
Longformer (Allen AI) combines sliding window local attention with global attention on task-specific tokens (e.g., [CLS]) - used for long document classification and named entity recognition on full research papers.
3.
BigBird (Google) extends this pattern with random attention tokens added to the sliding window - achieving 97% of full-attention accuracy on question answering over full-length Wikipedia articles.

Loading…