Comment by theOGognf

2 months ago

I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs

care to correct the misunderstanding?

  • I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.

    They also force exploration as a part of the algorithm.

    They can be used for synthetic data generation once the reward model is good enough.