Knowledge Distillation

Training a smaller student model to mimic the outputs or internal representations of a larger teacher model.

1.
DistilBERT (Hugging Face) is 40% smaller and 60% faster than BERT while retaining 97% of its performance via knowledge distillation - used in mobile NLP applications where full BERT is too slow.
2.
Microsoft Phi-2 (2.7B) achieves GPT-3.5-level reasoning on maths and coding benchmarks through distillation training on teacher-generated 'textbook quality' synthetic data - used in Azure AI on low-latency deployments.
3.
OpenAI's o1-mini is distilled from o1 to match its mathematical reasoning at lower inference cost - a common pattern where a frontier model trains a smaller production model via output distillation.

Loading…