Tokenizer

A system or algorithm that translates a sequence of input data into tokens.

Most modern foundation models are multimodal. A tokenizer for a multimodal system must translate each input type into the appropriate format. For example, given input data consisting of both text and graphics, the tokenizer might translate input text into subwords and input images into small patches. The tokenizer must then convert all the tokens into a single unified embedding space, which enables the model to "understand" a stream of multimodal input.

Real-world uses

Created for this library

1.
An NLP team uses a subword tokenizer trained on its domain corpus so rare technical terms are split into meaningful subwords.
2.
A multilingual translation team uses a shared tokenizer across languages so the model's vocabulary is compact.
3.
A code-completion team uses a tokenizer trained on code so identifiers and operators are tokenized efficiently.

Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License

Back to glossary

Real-world uses

Loading…

Real-world uses