Comment by sidkshatriya
2 days ago
Does this mean that you can only do GRPO on the training models that have reasoning traces in <think>...</think>
2 days ago
Does this mean that you can only do GRPO on the training models that have reasoning traces in <think>...</think>
Oh no at all!! You can actually convert a model to even generate the <think>...</think> tokens themselves! That's how DeepSeek trained R1 Zero, which essentially made the model have reasoning skills!
Wont you have to use a distilled DeepThink model then ? Because the training phase with GRPO required to its reasoning within <think></think> for least loss.
Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!!
The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt.
Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.
Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!
Nah, you can just request that in your prompt and then fail answers that are incorrect and/or don't include the think trace
Yes exactly! You can in fact add that has a reward function for style and format checking!