← Back to context

Comment by ACCount36

5 days ago

And then you use self-distillation to wire the improved prompts back into the LLM. Bam, free metacognitive skills.

Self-distillation generally refers to training a smaller model, right? I suppose for full metacognition you would use it fine-tune the existing model based on its older self?

  • No, training a smaller model off a more capable larger model (or an ensemble of models) is the "usual" distillation.

    "Self-distillation" refers to distilling from a model into a copy of itself. Which is of limited use - unless you can steer the teacher, and want the student to internalize that steering.

    The reason for doing self-distillation here is that we have both access to a richer representation (logit stream), and want to capture a richer behavior - not the answers themselves, but better reasoning techniques that are downstream from better prompts.

    • self distillation and mutual distillation are used in MoE models. What you can do is freeze all but one expert and then train the model. If you want to do it again, you have to do self/mutual distillation to spread the training result onto the other experts.