Inference Server

Service that hosts models and returns predictions or generated text.

1.
vLLM (UC Berkeley) achieves 24x higher throughput than naive HuggingFace Transformers serving via PagedAttention - deployed by Together AI, Anyscale, and enterprise teams to serve Llama 3.1 at production scale.
2.
NVIDIA Triton Inference Server is deployed by AWS, Google, and Microsoft to serve computer-vision and NLP models in production - supporting dynamic batching, model ensembles, and multi-framework model loading.
3.
NVIDIA NIM microservices are containerised inference servers for NVIDIA-optimised LLMs - used by BioNeMo users to deploy protein structure prediction models with one Docker pull on any NVIDIA GPU.

Loading…