Comment by NitpickLawyer
5 hours ago
For completeness sake I'll add a bit more.
The concept of distillation is not new in ML, and there are nuances to it. Traditionally you would have access to the bigger model, and for LLMs specifically you can train the small model on the entire distribution of output logits at the same time. So this would train the small model to output scores for each token in a similar fashion to the large model. There's "more to learn" from the entire distribution, rather than just from the chosen token.
But since you don't have access to this from the API providers, the next best thing is to use the outputs themselves and train on those. That's more like a "poor man's distillation". It's still good, and as you mentioned worked fairly well for models catching up. But a lab that develops both the big model and the small model could make it better. (or you could choose to distill from an existing open model).
No comments yet
Contribute on Hacker News ↗