Glossary term
Glossary term
Agentic Systems
A system that picks the ideal model for a specific inference query.
Imagine a group of models, ranging from very large (lots of parameters) to much smaller (far fewer parameters). Very large models consume more computational resources at inference time than smaller models. However, very large models can typically infer more complex requests than smaller models. Model cascading determines the complexity of the inference query and then picks the appropriate model to perform the inference. The main motivation for model cascading is to reduce inference costs by generally selecting smaller models, and only selecting a larger model for more complex queries.
Imagine that a small model runs on a phone and a larger version of that model runs on a remote server. Good model cascading reduces cost and latency by enabling the smaller model to handle simple requests and only calling the remote model to handle complex requests.
See also model router.
Created for this library
A SaaS team uses model cascading to route simple queries to a Flash model and complex queries to a larger model only when needed.
A search team uses model cascading where a cheap recall model retrieves candidates before a heavy ranker scores the top few hundred.
An LLM product team uses model cascading to control cost by using a small model for first-pass classification and a larger model only on harder cases.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License