Glossary term
Glossary term
Architecture
Mechanism for injecting sequence position information into token embeddings since Transformers have no built-in order awareness.
A technique to add information about the position of a token in a sequence to the token's embedding. Transformer models use positional encoding to better understand the relationship between different parts of the sequence.
A common implementation of positional encoding uses a sinusoidal function. (Specifically, the frequency and amplitude of the sinusoidal function are determined by the position of the token in the sequence.) This technique enables a Transformer model to learn to attend to different parts of the sequence based on their position.
The original Transformer used sinusoidal positional encodings - fixed mathematical patterns that allowed the model to distinguish token positions without learned parameters.
Learned absolute positional embeddings (used in BERT and GPT-2) add trainable position vectors to token embeddings - but fail to generalise to sequence lengths longer than those seen during training.
ALiBi (Attention with Linear Biases, Press et al. 2021) adds position-based biases directly to attention scores rather than embeddings - used in Mosaic MPT-7B and BloombergGPT for length extrapolation.
Created for this library
An NLP team uses sinusoidal positional encoding in its transformer so token order is encoded explicitly in inputs.
A research team experiments with learned positional encodings to see if they capture position better than fixed sinusoidal encodings for long documents.
A code-completion team uses relative positional encodings in its transformer so the model focuses on relative token distances.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License