Subword Token

In language models, a token that is a substring of a word, which may be the entire word.

For example, a word like "itemize" might be broken up into the pieces "item" (a root word) and "ize" (a suffix), each of which is represented by its own token. Splitting uncommon words into such pieces, called subwords, allows language models to operate on the word's more common constituent parts, such as prefixes and suffixes.

Conversely, common words like "going" might not be broken up and might be represented by a single token.

Examples

1.
For example, a word like "itemize" might be broken up into the pieces "item" (a root word) and "ize" (a suffix), each of which is represented by its own token. Splitting uncommon words into such pieces, called subwords, allows language models to operate on the word's more common constituent parts, such as prefixes and suffixes.
2.
Conversely, common words like "going" might not be broken up and might be represented by a single token.

Real-world uses

Created for this library

1.
An NLP team uses subword tokens via BPE so its tokenizer handles rare and out-of-vocabulary words gracefully.
2.
A multilingual translation team uses subword tokens to share vocabulary across languages and reduce model size.
3.
A code-completion team uses subword tokens so the tokenizer handles long identifiers and rare API names compactly.

Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License

Back to glossary

In language models, a token that is a substring of a word, which may be the entire word.

Conversely, common words like "going" might not be broken up and might be represented by a single token.

Examples

1.
For example, a word like "itemize" might be broken up into the pieces "item" (a root word) and "ize" (a suffix), each of which is represented by its own token. Splitting uncommon words into such pieces, called subwords, allows language models to operate on the word's more common constituent parts, such as prefixes and suffixes.
2.
Conversely, common words like "going" might not be broken up and might be represented by a single token.

Real-world uses

Created for this library

1.
An NLP team uses subword tokens via BPE so its tokenizer handles rare and out-of-vocabulary words gracefully.
2.
A multilingual translation team uses subword tokens to share vocabulary across languages and reduce model size.
3.
A code-completion team uses subword tokens so the tokenizer handles long identifiers and rare API names compactly.

Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License

Back to glossary

Examples

Real-world uses

Loading…

Examples

Real-world uses