Glossary term
Glossary term
Architecture
Neural network architecture based on self-attention that processes sequences in parallel rather than sequentially.
A neural network architecture developed at Google that relies on self-attention mechanisms to transform a sequence of input embeddings into a sequence of output embeddings without relying on convolutions or recurrent neural networks. A Transformer can be viewed as a stack of self-attention layers.
A Transformer can include any of the following:
an encoder
a decoder
both an encoder and decoder
An encoder transforms a sequence of embeddings into a new sequence of the same length. An encoder includes N identical layers, each of which contains two sub-layers. These two sub-layers are applied at each position of the input embedding sequence, transforming each element of the sequence into a new embedding. The first encoder sub-layer aggregates information from across the input sequence. The second encoder sub-layer transforms the aggregated information into an output embedding.
A decoder transforms a sequence of input embeddings into a sequence of output embeddings, possibly with a different length. A decoder also includes N identical layers with three sub-layers, two of which are similar to the encoder sub-layers. The third decoder sub-layer takes the output of the encoder and applies the self-attention mechanism to gather information from it.
The blog post Transformer: A Novel Neural Network Architecture for Language Understanding provides a good introduction to Transformers.
See LLMs: What's a large language model? in Machine Learning Crash Course for more information.
BERT (Google, 2018) and GPT-3 (OpenAI, 2020) are both Transformer-based models. BERT powers Google Search's understanding of query intent across 70+ languages, processing billions of queries daily.
The Transformer architecture underlies every major LLM in production including GPT-4, Claude, Gemini, and Llama - all encoder-only, decoder-only, or encoder-decoder variants of the original Vaswani et al. 2017 design.
NVIDIA's Megatron-LM framework trains Transformer models at scale - used by Turing NLG (530B parameters) and BloombergGPT (50B) with tensor and pipeline parallelism across thousands of GPUs.
Created for this library
A document understanding team uses a transformer-based model for long-context contract review.
A code-completion vendor uses a transformer model to stream code suggestions token by token in the developer's IDE.
A search-quality team uses a transformer-based ranker on top of a feature-engineered baseline for long-tail queries.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License