Transformer

Neural network architecture based on self-attention that processes sequences in parallel rather than sequentially.

A neural network architecture developed at Google that relies on self-attention mechanisms to transform a sequence of input embeddings into a sequence of output embeddings without relying on convolutions or recurrent neural networks. A Transformer can be viewed as a stack of self-attention layers.

A Transformer can include any of the following:

an encoder

a decoder

both an encoder and decoder

An encoder transforms a sequence of embeddings into a new sequence of the same length. An encoder includes N identical layers, each of which contains two sub-layers. These two sub-layers are applied at each position of the input embedding sequence, transforming each element of the sequence into a new embedding. The first encoder sub-layer aggregates information from across the input sequence. The second encoder sub-layer transforms the aggregated information into an output embedding.

A decoder transforms a sequence of input embeddings into a sequence of output embeddings, possibly with a different length. A decoder also includes N identical layers with three sub-layers, two of which are similar to the encoder sub-layers. The third decoder sub-layer takes the output of the encoder and applies the self-attention mechanism to gather information from it.

The blog post Transformer: A Novel Neural Network Architecture for Language Understanding provides a good introduction to Transformers.

See LLMs: What's a large language model? in Machine Learning Crash Course for more information.

Examples

1.
BERT (Google, 2018) and GPT-3 (OpenAI, 2020) are both Transformer-based models. BERT powers Google Search's understanding of query intent across 70+ languages, processing billions of queries daily.
2.
The Transformer architecture underlies every major LLM in production including GPT-4, Claude, Gemini, and Llama - all encoder-only, decoder-only, or encoder-decoder variants of the original Vaswani et al. 2017 design.
3.
NVIDIA's Megatron-LM framework trains Transformer models at scale - used by Turing NLG (530B parameters) and BloombergGPT (50B) with tensor and pipeline parallelism across thousands of GPUs.

Real-world uses

Created for this library

1.
A document understanding team uses a transformer-based model for long-context contract review.
2.
A code-completion vendor uses a transformer model to stream code suggestions token by token in the developer's IDE.
3.
A search-quality team uses a transformer-based ranker on top of a feature-engineered baseline for long-tail queries.

Back to glossary

Neural network architecture based on self-attention that processes sequences in parallel rather than sequentially.

A Transformer can include any of the following:

an encoder

a decoder

both an encoder and decoder

The blog post Transformer: A Novel Neural Network Architecture for Language Understanding provides a good introduction to Transformers.

See LLMs: What's a large language model? in Machine Learning Crash Course for more information.

Examples

1.
BERT (Google, 2018) and GPT-3 (OpenAI, 2020) are both Transformer-based models. BERT powers Google Search's understanding of query intent across 70+ languages, processing billions of queries daily.
2.
The Transformer architecture underlies every major LLM in production including GPT-4, Claude, Gemini, and Llama - all encoder-only, decoder-only, or encoder-decoder variants of the original Vaswani et al. 2017 design.
3.
NVIDIA's Megatron-LM framework trains Transformer models at scale - used by Turing NLG (530B parameters) and BloombergGPT (50B) with tensor and pipeline parallelism across thousands of GPUs.

Real-world uses

Created for this library

1.
A document understanding team uses a transformer-based model for long-context contract review.
2.
A code-completion vendor uses a transformer model to stream code suggestions token by token in the developer's IDE.
3.
A search-quality team uses a transformer-based ranker on top of a feature-engineered baseline for long-tail queries.

Back to glossary

Transformer

Examples

Real-world uses

Related terms

Loading…

Transformer

Examples

Real-world uses

Related terms