Glossary term
Glossary term
Infrastructure and Serving
Service that hosts models and returns predictions or generated text.
vLLM (UC Berkeley) achieves 24x higher throughput than naive HuggingFace Transformers serving via PagedAttention - deployed by Together AI, Anyscale, and enterprise teams to serve Llama 3.1 at production scale.
NVIDIA Triton Inference Server is deployed by AWS, Google, and Microsoft to serve computer-vision and NLP models in production - supporting dynamic batching, model ensembles, and multi-framework model loading.
NVIDIA NIM microservices are containerised inference servers for NVIDIA-optimised LLMs - used by BioNeMo users to deploy protein structure prediction models with one Docker pull on any NVIDIA GPU.