Glossary term
Glossary term
Architecture
Mechanism that lets a model weigh the relevance of each input token when producing each output token.
Google Translate uses attention to align source and target language tokens during neural machine translation - the model attends to relevant source words when generating each translated word.
Longformer (Allen AI) replaces standard O(n^2) attention with a sparse sliding-window attention pattern, enabling document-level understanding of 4,096-token legal contracts without memory overflow.
Flash Attention (Tri Dao, 2022) rewrites GPU-level attention computation to be IO-aware, reducing memory usage by 10-20x and enabling 64k+ token context windows in practical training runs.