Glossary term
Glossary term
Infrastructure and Serving
Open-source LLM inference and serving engine using PagedAttention for near-zero KV cache memory waste and high throughput.
vLLM is used by Together AI, Anyscale, and enterprise teams to serve Llama 3.1 at scale. The PagedAttention algorithm delivers 2-4x throughput improvement over FasterTransformer and Orca, achieving approximately 12,500 tokens per second for Llama 3.1 8B on a single H100 80GB GPU.
A European bank deploys vLLM to serve Llama 3.1 70B on an 8xH100 cluster at 600+ tokens per second aggregate throughput for an internal document-analysis assistant, handling 5,000 daily requests at sub-500ms latency.
vLLM powers the serving infrastructure at Replicate and RunPod, reducing GPU idle time from 60-80% (with static batching) to under 4% through continuous batching and dynamic KV block allocation.