Latency

Time taken for a system to respond.

The time it takes for a model to process input and generate a response. A high latency response takes takes longer to generate than a low latency response.

Factors that influence latency of large language models include:

Input and output token lengths

Model complexity

The infrastructure the model runs on

Optimizing for latency is crucial for creating responsive and user-friendly applications.

Examples

1.
Groq's LPU (Language Processing Unit) achieves 750 tokens/sec on Llama 3 70B - 10x faster than GPU inference - enabling real-time voice AI applications where >500ms latency breaks the conversational experience.
2.
Cloudflare Workers AI deploys quantised models at the network edge in 300+ PoPs, achieving <50ms time-to-first-token for inference requests globally - used by developer tools requiring sub-100ms AI responses.
3.
Anthropic's Claude 3 Haiku targets <1-second time-to-first-token for most queries, making it suitable for interactive coding assistance in Cursor AI where latency directly affects developer flow state.

Real-world uses

Created for this library

1.
A SaaS company tracks 95th-percentile latency on its assistant feature and treats regressions above the SLA as a release blocker.
2.
A trading platform requires inference latency under five milliseconds for its order-routing model.
3.
A mobile app team profiles model latency on the worst-supported phone so the user experience is consistent across the device range.

Back to glossary

Time taken for a system to respond.

The time it takes for a model to process input and generate a response. A high latency response takes takes longer to generate than a low latency response.

Factors that influence latency of large language models include:

Input and output token lengths

Model complexity

The infrastructure the model runs on

Optimizing for latency is crucial for creating responsive and user-friendly applications.

Examples

1.
Groq's LPU (Language Processing Unit) achieves 750 tokens/sec on Llama 3 70B - 10x faster than GPU inference - enabling real-time voice AI applications where >500ms latency breaks the conversational experience.
2.
Cloudflare Workers AI deploys quantised models at the network edge in 300+ PoPs, achieving <50ms time-to-first-token for inference requests globally - used by developer tools requiring sub-100ms AI responses.
3.
Anthropic's Claude 3 Haiku targets <1-second time-to-first-token for most queries, making it suitable for interactive coding assistance in Cursor AI where latency directly affects developer flow state.

Real-world uses

Created for this library

1.
A SaaS company tracks 95th-percentile latency on its assistant feature and treats regressions above the SLA as a release blocker.
2.
A trading platform requires inference latency under five milliseconds for its order-routing model.
3.
A mobile app team profiles model latency on the worst-supported phone so the user experience is consistent across the device range.

Back to glossary

Latency

Examples

Real-world uses

Related terms

Loading…

Latency

Examples

Real-world uses

Related terms