Glossary term
Glossary term
Architecture
Attention computed in parallel across multiple learned subspaces, capturing different relationship patterns simultaneously.
GPT-4's 128 attention heads each specialise in different linguistic relationships - some heads track syntactic dependencies, others coreference chains, and others positional distance patterns.
A code-generation model's attention heads specialise differently - some track variable scoping, others function signatures, and others API call patterns - enabling structured code with correct variable references.
Research from Anthropic's interpretability team ('In-Context Vectors') shows specific attention heads in Claude are responsible for in-context learning, copy-suppression, and factual recall - enabling targeted ablation studies.