← Back to context

Comment by girvo

17 hours ago

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)