Glossary term
Glossary term
Infrastructure and Serving
NVIDIA's open-source library for optimising and deploying LLM inference on NVIDIA GPUs with support for FP8, INT4, and pipeline parallelism.
NVIDIA TensorRT-LLM is used by AWS SageMaker, Azure Machine Learning, and Google Cloud to serve Llama 3, Mistral, and custom models with FP8 and INT4 quantisation, achieving state-of-the-art throughput on H100 and B200 GPUs.
A financial institution uses TensorRT-LLM to deploy a Llama 3.1 70B model with FP8 precision on 2xH100 NVLink, achieving 40% lower latency than vLLM on the same hardware for low-concurrency inference workloads.
SqueezeBits benchmarks (2024) show TensorRT-LLM consistently outperforms vLLM and SGLang on NVIDIA B200 GPUs due to deeper architecture-specific kernel optimisations, making it the preferred choice for single-user or low-concurrency production deployments.