← Back to context

Comment by codelion

1 day ago

Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.

Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!