Comment by whatshisface

3 months ago

RL is barely even a training method, its more of a dataset generation method.

5 comments

whatshisface

I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs

mistercheph 3 months ago
care to correct the misunderstanding?
- mountainriver 3 months ago
  
  I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.
  They also force exploration as a part of the algorithm.
  They can be used for synthetic data generation once the reward model is good enough.
phyalow 3 months ago
Its reductive, but also roughly correct.
- singularity2001 3 months ago
  
  While collecting data according to policy is part of RL, 'reductive' is an understatement. It's like saying algebra is all about scalar products. Well yes, 1%