Comment by Springtime
3 months ago
My understanding of distilling is one model 'teaching' another, in this case the main R1 model is fine-tuning the open weight Llama model (and a Qwen variant also). I'm not sure of a comparative analysis of vanilla Llama though they benchmarked their distilled version to other models on their Github readme and the distilled Llama 70B model scores higher than Claude 3.5 Sonnet and o1-mini in all but one test.
No comments yet
Contribute on Hacker News ↗