Glossary term
Glossary term
Foundations
Tokenisation algorithm that iteratively merges the most frequent adjacent byte or character pairs to form a vocabulary of subword units.
GPT-4 uses a BPE tokeniser (tiktoken cl100k) with a 100,277-token vocabulary, balancing vocabulary coverage and sequence length. A typical English word tokenises to 1.3 tokens on average, compared to 1.8 tokens with the original GPT-2 BPE vocabulary.
Llama 3's BPE tokeniser has a 128,000-token vocabulary (4x larger than Llama 2's 32,000), improving multilingual coverage and reducing the average number of tokens per non-English word by 30%, directly reducing inference costs.
Hugging Face's tokenizers library implements BPE with Rust-based fast tokenisation, processing 10+ million tokens per second on a CPU - used in all production LLM serving stacks to tokenise inputs before model inference.