Comment by danielhanchen

1 year ago

Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!!

The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt.