Comment by sidkshatriya

4 months ago

This is what I understood from the blog post (please correct me if I am wrong):

Unsloth allows you to give it a transformer model and additional training data to do LoRA/QLoRA. LoRA/QLoRA will keep the weights of the model as constant but as output some low rank adjustments to the weights which serves as the weight "delta".

Typically one would do SFT with the training data. But Unsloth allows you to do RL (Reinforcement learning) specifically GRPO on the model + training data you give it also! The output of the GRPO here is again in the form the LoRA/QLoRA weights.

You have found a way to reduce the memory requirements for GRPO.

Question: How does one decide whether the training data will be SFT (Supervised fine tuning) or GRPO ? When will you get better results with SFT and when with GRPO ?

15 comments

sidkshatriya

danielhanchen 4 months ago

Yes you're correct!

Very good question on SFT vs GRPO!

Assume the dataset I have is "What is 2+2?", "The answer is 4".

1. If you have very high quality labelled data, SFT should work fine. Ie "What is 2+2? Let me think about it....., The Answer is 4"

2. If you only have the input "What is 2+2", and just the answer "4", but nothing in between, GRPO could be very helpful! GRPO can help produce the reasoning traces automatically - you will need to provide some scoring / reward functions though. For example if the answer == 4, + 1 score.

3. You can combine SFT and GRPO! Do SFT first, then GRPO - this actually makes GRPO most likely converge faster!

sidkshatriya 4 months ago
Does this mean that you can only do GRPO on the training models that have reasoning traces in <think>...</think>
- danielhanchen 4 months ago
  
  Oh no at all!! You can actually convert a model to even generate the <think>...</think> tokens themselves! That's how DeepSeek trained R1 Zero, which essentially made the model have reasoning skills!
  
  2 replies →
- codelion 4 months ago
  
  Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.
  
  1 reply →
- wrsh07 4 months ago
  
  Nah, you can just request that in your prompt and then fail answers that are incorrect and/or don't include the think trace
  
  1 reply →
lyu07282 4 months ago
can you give some real-world examples for when this would be useful? Does this work for tasks requiring tool calling as well?
- danielhanchen 4 months ago
  
  Yes tool calling is a prime example!! Ie you have some specific task, and the final output involving some tools, but sadly the steps to call the tools / the stuff in between / the thinking process is missing.
  You can employ GRPO and maybe add an actual Python environment for the model to learn to act in.
  
  2 replies →

imjonse 4 months ago

Is it established whether GRPO is essential for this to work as it does, or could other RLHF-class methods provide similar results? My initial (possibly mistaken) impression was that GRPO was one of ways of mitigating the lack of enormous hardware resources.

danielhanchen 4 months ago

Yep so GRPO is much more memory efficient than PPO, but other RL type algorithms can work fine as well!