Comment by danielhanchen

8 months ago

Yes you're correct!

Very good question on SFT vs GRPO!

Assume the dataset I have is "What is 2+2?", "The answer is 4".

1. If you have very high quality labelled data, SFT should work fine. Ie "What is 2+2? Let me think about it....., The Answer is 4"

2. If you only have the input "What is 2+2", and just the answer "4", but nothing in between, GRPO could be very helpful! GRPO can help produce the reasoning traces automatically - you will need to provide some scoring / reward functions though. For example if the answer == 4, + 1 score.

3. You can combine SFT and GRPO! Do SFT first, then GRPO - this actually makes GRPO most likely converge faster!

12 comments

danielhanchen

sidkshatriya 8 months ago

Does this mean that you can only do GRPO on the training models that have reasoning traces in <think>...</think>

danielhanchen 8 months ago
Oh no at all!! You can actually convert a model to even generate the <think>...</think> tokens themselves! That's how DeepSeek trained R1 Zero, which essentially made the model have reasoning skills!
- sidkshatriya 8 months ago
  
  Wont you have to use a distilled DeepThink model then ? Because the training phase with GRPO required to its reasoning within <think></think> for least loss.
  
  1 reply →
codelion 8 months ago
Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.
- danielhanchen 8 months ago
  
  Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!
wrsh07 8 months ago
Nah, you can just request that in your prompt and then fail answers that are incorrect and/or don't include the think trace
- danielhanchen 8 months ago
  
  Yes exactly! You can in fact add that has a reward function for style and format checking!

lyu07282 8 months ago

can you give some real-world examples for when this would be useful? Does this work for tasks requiring tool calling as well?

danielhanchen 8 months ago
Yes tool calling is a prime example!! Ie you have some specific task, and the final output involving some tools, but sadly the steps to call the tools / the stuff in between / the thinking process is missing.
You can employ GRPO and maybe add an actual Python environment for the model to learn to act in.
- byefruit 8 months ago
  
  I'm waiting for https://github.com/huggingface/trl/pull/2810 to land. I think this should work with the existing unsloth setup without changes.
  
  1 reply →