Byte Pair Encoding (BPE)

Tokenisation algorithm that iteratively merges the most frequent adjacent byte or character pairs to form a vocabulary of subword units.

1.
GPT-4 uses a BPE tokeniser (tiktoken cl100k) with a 100,277-token vocabulary, balancing vocabulary coverage and sequence length. A typical English word tokenises to 1.3 tokens on average, compared to 1.8 tokens with the original GPT-2 BPE vocabulary.
2.
Llama 3's BPE tokeniser has a 128,000-token vocabulary (4x larger than Llama 2's 32,000), improving multilingual coverage and reducing the average number of tokens per non-English word by 30%, directly reducing inference costs.
3.
Hugging Face's tokenizers library implements BPE with Rust-based fast tokenisation, processing 10+ million tokens per second on a CPU - used in all production LLM serving stacks to tokenise inputs before model inference.

Byte Pair EncodingBPE