Continuous Batching

Inference technique that dynamically inserts new requests into a running batch as sequences complete, maximising GPU utilisation.

1.
vLLM implements continuous batching (called iteration-level scheduling) - when one sequence in a batch finishes, a new request is immediately inserted, achieving 23x throughput improvement over static batching.
2.
Together AI and Anyscale deploy continuous batching to serve Llama 3.1 70B at 100+ concurrent users per H100 GPU - the technique is essential for the economics of multi-tenant LLM serving.
3.
NVIDIA TensorRT-LLM implements in-flight batching (equivalent to continuous batching) - reducing GPU idle time from 40% (static batching) to <5% for variable-length LLM request streams.

Loading…