Comment by codelion
1 day ago
Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.
Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!