Glossary term
Glossary term
Infrastructure and Serving
The process of running a trained model to produce outputs from inputs. Inference governance covers access control, logging, latency, cost, data exposure, model version, and output handling. Inference environments should be governed like production systems, with security controls, observability, capacity planning, vendor dependency review, and incident response.
In traditional machine learning, the process of making predictions by applying a trained model to unlabeled examples. See Supervised Learning in the Intro to ML course to learn more.
In large language models, inference is the process of using a trained model to generate a response to an input prompt.
Inference has a somewhat different meaning in statistics. See the Wikipedia article on statistical inference for details.
NVIDIA Triton Inference Server, vLLM, and TensorRT-LLM are widely used inference frameworks for LLM deployment.
AWS Bedrock, Azure OpenAI Service, and Google Vertex AI provide managed inference for foundation models with enterprise security controls.
OpenAI's API, Anthropic's API, and Mistral La Plateforme are commercial inference endpoints subject to enterprise procurement and contractual controls.
Created for this library
An ML platform team monitors inference latency and cost across production models to budget capacity ahead of seasonal peaks.
A retail recommendation team runs inference in real time for the homepage carousel and batch inference for nightly emails.
A medical AI team optimizes inference throughput on its triage model so it can keep up with hospital scan volumes.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License