Glossary term
Glossary term
Foundations
In language models, a token that is a substring of a word, which may be the entire word.
For example, a word like "itemize" might be broken up into the pieces "item" (a root word) and "ize" (a suffix), each of which is represented by its own token. Splitting uncommon words into such pieces, called subwords, allows language models to operate on the word's more common constituent parts, such as prefixes and suffixes.
Conversely, common words like "going" might not be broken up and might be represented by a single token.
For example, a word like "itemize" might be broken up into the pieces "item" (a root word) and "ize" (a suffix), each of which is represented by its own token. Splitting uncommon words into such pieces, called subwords, allows language models to operate on the word's more common constituent parts, such as prefixes and suffixes.
Conversely, common words like "going" might not be broken up and might be represented by a single token.
Created for this library
An NLP team uses subword tokens via BPE so its tokenizer handles rare and out-of-vocabulary words gracefully.
A multilingual translation team uses subword tokens to share vocabulary across languages and reduce model size.
A code-completion team uses subword tokens so the tokenizer handles long identifiers and rare API names compactly.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License