← Back to context

Comment by cztomsik

2 months ago

Nope, it's quite obvious why distillation works. If you just predict next token, then the only information you can use to compute the loss is THE expected token. Whereas if you distill, you can also use (typically few) logits from the teacher.

"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.

Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.

(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")