Model Distillation

Knowledge distillation applied to compress large LLMs into smaller, deployable models.

1.
Phi-1.5 (Microsoft, 1.3B) is trained on GPT-4-generated 'textbook quality' synthetic data rather than web crawls - matching GPT-3.5 reasoning performance at 1/100th the parameter count through implicit distillation.
2.
OpenAI's GPT-3.5-turbo is widely considered a distillation of GPT-4 - trained on GPT-4 outputs to capture similar capability at lower inference cost, enabling the 10x price difference between the two models.
3.
Gemma 2B (Google) uses knowledge distillation from Gemini Pro - enabling a 2B model to achieve reasoning quality typically requiring a 7B model, used in Android on-device AI features.

Loading…