Glossary term
Glossary term
Infrastructure and Serving
Technique removing low-importance model weights or attention heads to create smaller, faster models.
SparseGPT (Frantar and Alistarh, 2023) prunes 50% of GPT-3-scale model weights in a single forward pass with minimal perplexity loss - used by NVIDIA to create sparse models for Ampere GPU sparse tensor cores.
Apple uses structured pruning to create compact on-device models for iPhone - feature-importance-based pruning of MobileNetV3 reduces model size by 30% while retaining 98% of accuracy for face detection.
LLM-Pruner (2023) prunes 20% of Llama parameters by removing low-importance attention heads and MLP neurons - reducing model size from 13GB to 10GB while recovering 95% of performance after LoRA fine-tuning.