Glossary term
Glossary term
Infrastructure and Serving
Technique to reduce model precision for efficiency.
Overloaded term that could be used in any of the following ways:
Implementing quantile bucketing on a particular feature.
Transforming data into zeroes and ones for quicker storing, training, and inferring. As Boolean data is more robust to noise and errors than other formats, quantization can improve model correctness. Quantization techniques include rounding, truncating, and binning.
Reducing the number of bits used to store a model's parameters. For example, suppose a model's parameters are stored as 32-bit floating-point numbers. Quantization converts those parameters from 32 bits down to 4, 8, or 16 bits. Quantization reduces the following:
Compute, memory, disk, and network usage
Time to infer a predication
Power consumption
However, quantization sometimes decreases the correctness of a model's predictions.
llama.cpp implements 4-bit GGUF quantisation, enabling Llama 3 70B to run on a MacBook Pro M3 with 96GB unified memory - making frontier-scale models accessible on consumer hardware.
GPTQ 4-bit quantisation is used by TheBloke (HuggingFace) to publish quantised versions of open-source LLMs - downloaded 50M+ times by developers running local agents on 24GB consumer GPUs.
Microsoft deployed Phi-3-mini with INT4 quantisation on Android and iOS devices, achieving GPT-3.5-class reasoning in a 2.3GB model that fits on smartphones - used in Microsoft Edge's on-device AI features.
Created for this library
An LLM team applies quantization to reduce model size and serving cost so it fits its production latency budget.
A mobile app team applies quantization to its on-device model so it runs efficiently on lower-end devices.
An ML platform team adopts quantization in its production serving stack to cut inference cost across many models.
Definition source: Google for Developers Machine Learning Glossary | Creative Commons Attribution 4.0 License