Glossary term
Glossary term
Infrastructure and Serving
High-performance LLM and multimodal model serving framework from the LMSYS team featuring RadixAttention for efficient prefix caching in agentic and multi-turn workloads.
SGLang's RadixAttention reuses computed KV cache across requests sharing a common system prompt prefix. In agentic deployments where a 10,000-token system prompt is sent with every request, SGLang reduces that prefill computation to near zero after the first request.
SGLang joined the PyTorch ecosystem in March 2025 and provides day-one support for DeepSeek V3 and R1 models on both NVIDIA and AMD GPUs. Benchmarks on 2xH100 with GPT-OSS-120B show SGLang achieving the most stable per-token latency (4-21ms) across varying load patterns.
LLaVA-1.6's official serving demo is powered by SGLang, which handles multimodal inputs (text plus image tokens) with the same RadixAttention optimisations used for text-only models, demonstrating up to 6.4x throughput improvement over vLLM in multi-image workloads.