Comment by ljosifov
1 day ago
I imagine - and sure hope so - everyone trains on everything else. Distillation - ofc if one has bigger/other models providing true posterior token probabilities in the (0,1) interval (a number between 0 and 1), rather than 1-hot-N targets that are '0 for 200K-sans-this-token, and 1 for the desired output token' - one should use the former instead of the latter. It's amazing how as a simple as straightforward idea should face so much resistance (paper rejected) and from the supposedly most open minded and devoted to knowing (academia) and on the wrong grounds ('will have no impact on industry'; in fact - it's had tremendous impact on industry; better rejection wd have been 'duh it is obvious'). We are not trying to torture the model and the gpu cluster to be learning from 0 - when knowledge is already available. :-)
No comments yet
Contribute on Hacker News ↗