Comment by danielhanchen
2 days ago
Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!!
The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt.
No comments yet
Contribute on Hacker News ↗