Definition
Model distillation (also called knowledge distillation) is a machine learning technique where a smaller "student" model is trained to reproduce the behavior of a larger "teacher" model. The student learns not just from raw training data but from the teacher's outputs, including the probability distributions and reasoning patterns that the teacher has learned. This transfer of knowledge allows the student to achieve performance approaching the teacher's level despite having far fewer parameters.
The technique has become especially important in the era of large language models, where the most capable models are too expensive or too slow for many production use cases. Distillation provides a practical path from "this works great in the demo with GPT-4" to "this works well enough in production at a cost we can sustain."
Why It Matters for Product Managers
Cost economics are one of the biggest constraints on AI product viability. A feature that costs $0.10 per API call during prototyping might work for demos, but at 10 million daily users that becomes $1 million per day. Model distillation is often the solution that makes the economics work, reducing per-request costs by an order of magnitude or more while preserving the quality that users expect.
Beyond cost, distillation also improves latency. Smaller models respond faster, which directly impacts user experience for interactive AI features. PMs building real-time AI experiences like chat interfaces, autocomplete, or inline suggestions need the speed that distilled models provide. Understanding the distillation tradeoff -- how much quality to sacrifice for how much cost and latency improvement -- is a critical product decision.
How It Works in Practice
Common Pitfalls
Related Concepts
Model distillation starts with a Foundation Model as the teacher and uses Fine-Tuning techniques to train the student. Synthetic Data generated by the teacher model serves as the primary training signal. Distilled models are often deployed for Edge Inference where Large Language Models would be too resource-intensive.