Comment by ACCount36
5 days ago
No, training a smaller model off a more capable larger model (or an ensemble of models) is the "usual" distillation.
"Self-distillation" refers to distilling from a model into a copy of itself. Which is of limited use - unless you can steer the teacher, and want the student to internalize that steering.
The reason for doing self-distillation here is that we have both access to a richer representation (logit stream), and want to capture a richer behavior - not the answers themselves, but better reasoning techniques that are downstream from better prompts.
self distillation and mutual distillation are used in MoE models. What you can do is freeze all but one expert and then train the model. If you want to do it again, you have to do self/mutual distillation to spread the training result onto the other experts.