Glossary term
Glossary term
Foundations
A system or algorithm that translates a sequence of input data into tokens.
Most modern foundation models are multimodal. A tokenizer for a multimodal system must translate each input type into the appropriate format. For example, given input data consisting of both text and graphics, the tokenizer might translate input text into subwords and input images into small patches. The tokenizer must then convert all the tokens into a single unified embedding space, which enables the model to "understand" a stream of multimodal input.
Created for this library
An NLP team uses a subword tokenizer trained on its domain corpus so rare technical terms are split into meaningful subwords.
A multilingual translation team uses a shared tokenizer across languages so the model's vocabulary is compact.
A code-completion team uses a tokenizer trained on code so identifiers and operators are tokenized efficiently.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License