Comment by sidkshatriya

8 months ago

Does this mean that you can only do GRPO on the training models that have reasoning traces in <think>...</think>

7 comments

sidkshatriya

Oh no at all!! You can actually convert a model to even generate the <think>...</think> tokens themselves! That's how DeepSeek trained R1 Zero, which essentially made the model have reasoning skills!

sidkshatriya 8 months ago
Wont you have to use a distilled DeepThink model then ? Because the training phase with GRPO required to its reasoning within <think></think> for least loss.
- danielhanchen 8 months ago
  
  Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!!
  The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt.

codelion 8 months ago

Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.

danielhanchen 8 months ago

Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!

wrsh07 8 months ago

Nah, you can just request that in your prompt and then fail answers that are incorrect and/or don't include the think trace

danielhanchen 8 months ago

Yes exactly! You can in fact add that has a reward function for style and format checking!