Glossary term
Glossary term
Infrastructure and Serving
Activation-aware Weight Quantisation - technique that identifies and preserves salient weights based on activation magnitudes before quantisation.
AWQ (Lin et al. 2023, MIT) achieves better perplexity than GPTQ at the same 4-bit precision by scaling salient channels before quantisation - used by TinyChat and LMDeploy for edge deployment.
Llama 3.1 70B AWQ-quantised models (3.8 bits/weight) fit on a dual-RTX-3090 machine (48GB VRAM) and run at 25 tokens/sec - used by startups needing frontier-scale intelligence on budget hardware.
NVIDIA's AutoAWQ library is integrated into Ollama and llama.cpp - enabling one-command AWQ quantisation of any HuggingFace model, used by enterprise teams to create deployment-ready models from fine-tuned checkpoints.