Self-Attention

Attention applied within a single sequence so each token can attend to all other tokens in the same sequence.

1.
In GPT-4's decoder, each generated token attends to all prior tokens in the context window via causal self-attention - this is the mechanism that enables coherent long-form text generation.
2.
BERT uses bidirectional self-attention in its encoder so each token attends to all surrounding tokens simultaneously - enabling richer representations for classification and NER tasks.
3.
Sparse self-attention in BigBird (Google) reduces complexity from O(n^2) to O(n) by combining local, global, and random attention patterns - enabling genomic sequence modelling over 4k+ base pairs.

Loading…