Glossary term
Glossary term
Infrastructure and Serving
Inference optimisation that reuses computed KV cache for a shared prompt prefix across multiple requests.
Anthropic's Claude API offers prompt caching for system prompts longer than 1,024 tokens - a company that uses a 50-page policy document as context saves 90% on input tokens when the same prefix is reused across 1,000 requests.
vLLM's prefix caching allows shared system prompts to be computed once and served to all users in a multi-tenant deployment - reducing GPU memory bandwidth for a customer-service deployment by 60%.
Google Gemini 1.5 Pro's context caching lets a video analysis pipeline cache a 1-hour video transcript, paying write cost once and reading cheaply for each of 1,000 subsequent Q&A queries about the video.