Comment by theOGognf
4 days ago
I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs
4 days ago
I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs
care to correct the misunderstanding?
I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.
They also force exploration as a part of the algorithm.
They can be used for synthetic data generation once the reward model is good enough.
Its reductive, but also roughly correct.
While collecting data according to policy is part of RL, 'reductive' is an understatement. It's like saying algebra is all about scalar products. Well yes, 1%