Thanks for what you're doing. Of all the various companies and orgs posting chatter about deep learning, I've come to really appreciate your efforts (and Anthropic), because you're USING MATH. :)
I have some understanding of applied math, continuous and discrete, and while I don't keep up to date with developments in deep learning/AI in general, I always look forward to unsloth posts because they tend to center on achieving a desirable result thanks to proper application of good old fashioned "understanding the damn math and once you do, then doing the obvious". :)
Reminds me of learning about how to optimize twiddle factors in wring a performant FFT (more than divide-and-conquer, one also uses trig identities and some algebra to reduce the number of multiplies), or of learning of elliptic minimal Q-factors (EMQF) filters -- clever IIR filters that give a sharp frequency response using less than 50% (or is it 25% or more?) of the computation required traditionally by optimizing for *coefficients with lots of zeros in the base 2 representation*. And computers, it turns out, can multiply numbers by zero really fast. ;-)
The throughline to me is that if you pause and think deeply about "wait, what are we really doing here?" and look at the whole math stack, and think about what computers are good at, sometimes you can achieve great results.
I always keep maths at the center of everything we do :) It's literally humanity's ultimate superpower if we can express everything in mathematical terms!
how practical do you think grpo is? (for most people)
here's my thoughts
- grpo starts off slow, with super small loss (likely because the rewards on all observations are the same)
- as you mentioned, some sft on reasoning data ought to help speed things up
- unless you're a lab with a gazillion gpus, wouldn't you be better off taking your non-reasoning dataset and converting it into a high quality reasoning dataset using frontier models (maybe deepseek)? could grpo be cheaper or better accuracy?
- maybe you do tons of sft and when you've reached the frontier models' perf on your task, then perhaps grpo could help more exploration
Thanks! Yes so synthetic data generation and data augmentation are also very useful! A trick one could employ is to first generate 1000s of possible answers then select the top 10 to be used in GRPO - it's kinda like o3 with majority voting!
Thanks for what you're doing. Of all the various companies and orgs posting chatter about deep learning, I've come to really appreciate your efforts (and Anthropic), because you're USING MATH. :)
I have some understanding of applied math, continuous and discrete, and while I don't keep up to date with developments in deep learning/AI in general, I always look forward to unsloth posts because they tend to center on achieving a desirable result thanks to proper application of good old fashioned "understanding the damn math and once you do, then doing the obvious". :)
Reminds me of learning about how to optimize twiddle factors in wring a performant FFT (more than divide-and-conquer, one also uses trig identities and some algebra to reduce the number of multiplies), or of learning of elliptic minimal Q-factors (EMQF) filters -- clever IIR filters that give a sharp frequency response using less than 50% (or is it 25% or more?) of the computation required traditionally by optimizing for *coefficients with lots of zeros in the base 2 representation*. And computers, it turns out, can multiply numbers by zero really fast. ;-)
The throughline to me is that if you pause and think deeply about "wait, what are we really doing here?" and look at the whole math stack, and think about what computers are good at, sometimes you can achieve great results.
Oh thank you a lot!!
I always keep maths at the center of everything we do :) It's literally humanity's ultimate superpower if we can express everything in mathematical terms!
I'll keep writing up blog posts with more maths!!
thanks for your efforts!
how practical do you think grpo is? (for most people)
here's my thoughts - grpo starts off slow, with super small loss (likely because the rewards on all observations are the same) - as you mentioned, some sft on reasoning data ought to help speed things up - unless you're a lab with a gazillion gpus, wouldn't you be better off taking your non-reasoning dataset and converting it into a high quality reasoning dataset using frontier models (maybe deepseek)? could grpo be cheaper or better accuracy? - maybe you do tons of sft and when you've reached the frontier models' perf on your task, then perhaps grpo could help more exploration
would be great to hear your thoughts
Thanks! Yes so synthetic data generation and data augmentation are also very useful! A trick one could employ is to first generate 1000s of possible answers then select the top 10 to be used in GRPO - it's kinda like o3 with majority voting!