Comment by lennxa
1 day ago
thanks for your efforts!
how practical do you think grpo is? (for most people)
here's my thoughts - grpo starts off slow, with super small loss (likely because the rewards on all observations are the same) - as you mentioned, some sft on reasoning data ought to help speed things up - unless you're a lab with a gazillion gpus, wouldn't you be better off taking your non-reasoning dataset and converting it into a high quality reasoning dataset using frontier models (maybe deepseek)? could grpo be cheaper or better accuracy? - maybe you do tons of sft and when you've reached the frontier models' perf on your task, then perhaps grpo could help more exploration
would be great to hear your thoughts
Thanks! Yes so synthetic data generation and data augmentation are also very useful! A trick one could employ is to first generate 1000s of possible answers then select the top 10 to be used in GRPO - it's kinda like o3 with majority voting!