Glossary term
Glossary term
Infrastructure and Serving
Cached attention keys and values used to speed up autoregressive generation.
vLLM's PagedAttention manages KV cache as virtual memory pages - enabling 65% higher GPU memory utilisation than standard KV caching and 24x throughput improvement for concurrent LLM requests.
Anthropic's prompt caching feature for Claude 3.5 Sonnet allows teams to cache system prompts and large documents - a legal-tech company caches a 50-page contract in KV cache, reducing per-query cost by 90%.
Google's Gemini 1.5 Pro context caching reduces costs for long-context applications - a video analysis pipeline caches a 1-hour video transcript (500k tokens), paying the cache write cost once and reading cheaply per query.