Comment by imjonse
1 day ago
Is it established whether GRPO is essential for this to work as it does, or could other RLHF-class methods provide similar results? My initial (possibly mistaken) impression was that GRPO was one of ways of mitigating the lack of enormous hardware resources.
Yep so GRPO is much more memory efficient than PPO, but other RL type algorithms can work fine as well!