Glossary term
Glossary term
Training and Fine-Tuning
Training a smaller student model to mimic the outputs or internal representations of a larger teacher model.
DistilBERT (Hugging Face) is 40% smaller and 60% faster than BERT while retaining 97% of its performance via knowledge distillation - used in mobile NLP applications where full BERT is too slow.
Microsoft Phi-2 (2.7B) achieves GPT-3.5-level reasoning on maths and coding benchmarks through distillation training on teacher-generated 'textbook quality' synthetic data - used in Azure AI on low-latency deployments.
OpenAI's o1-mini is distilled from o1 to match its mathematical reasoning at lower inference cost - a common pattern where a frontier model trains a smaller production model via output distillation.