Glossary term
Glossary term
Infrastructure and Serving
Hugging Face's open-source production serving framework for LLMs with built-in continuous batching, tensor parallelism, and quantisation support.
Hugging Face TGI powers the Inference Endpoints product used by 50,000+ enterprise customers to deploy open-source models on cloud infrastructure with production-grade monitoring, auto-scaling, and model caching.
Mistral AI used TGI to serve the first public release of Mistral 7B, providing an immediately deployable serving stack that the community could use before custom serving solutions were available.
A media company uses TGI with GPTQ 4-bit quantisation to serve Llama 3.1 70B on a 2xA100 server at 60% of the cost of a 4xA100 full-precision deployment, meeting their quality requirements for content summarisation at scale.