Throughput

Amount of work completed per time unit.

1.
Together AI benchmarks Llama 3.1 70B at 120 tokens/sec/request with 100 concurrent users using vLLM on H100 GPUs - used to size infrastructure for a contact-centre AI handling 10,000 concurrent sessions.
2.
AWS Bedrock auto-scales inference throughput dynamically - a retail client processes 50,000 product-description generation requests during a Black Friday sale without pre-provisioning dedicated capacity.
3.
Anyscale's serving benchmark shows 10,000 tokens/sec aggregate throughput for Mistral 7B on a 4xA100 node - used by a document-processing company to calculate nodes needed for 1M documents/day.

Loading…