Glossary term
Glossary term
Product and Operations
Time taken for a system to respond.
The time it takes for a model to process input and generate a response. A high latency response takes takes longer to generate than a low latency response.
Factors that influence latency of large language models include:
Input and output token lengths
Model complexity
The infrastructure the model runs on
Optimizing for latency is crucial for creating responsive and user-friendly applications.
Groq's LPU (Language Processing Unit) achieves 750 tokens/sec on Llama 3 70B - 10x faster than GPU inference - enabling real-time voice AI applications where >500ms latency breaks the conversational experience.
Cloudflare Workers AI deploys quantised models at the network edge in 300+ PoPs, achieving <50ms time-to-first-token for inference requests globally - used by developer tools requiring sub-100ms AI responses.
Anthropic's Claude 3 Haiku targets <1-second time-to-first-token for most queries, making it suitable for interactive coding assistance in Cursor AI where latency directly affects developer flow state.
Created for this library
A SaaS company tracks 95th-percentile latency on its assistant feature and treats regressions above the SLA as a release blocker.
A trading platform requires inference latency under five milliseconds for its order-routing model.
A mobile app team profiles model latency on the worst-supported phone so the user experience is consistent across the device range.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License