Glossary term
Glossary term
Infrastructure and Serving
Inference technique that dynamically inserts new requests into a running batch as sequences complete, maximising GPU utilisation.
vLLM implements continuous batching (called iteration-level scheduling) - when one sequence in a batch finishes, a new request is immediately inserted, achieving 23x throughput improvement over static batching.
Together AI and Anyscale deploy continuous batching to serve Llama 3.1 70B at 100+ concurrent users per H100 GPU - the technique is essential for the economics of multi-tenant LLM serving.
NVIDIA TensorRT-LLM implements in-flight batching (equivalent to continuous batching) - reducing GPU idle time from 40% (static batching) to <5% for variable-length LLM request streams.