Slacker News Slacker News logo featuring a lazy sloth with a folded newspaper hat
  • top
  • new
  • show
  • ask
  • jobs
Library

Comment by mistercheph

3 days ago

care to correct the misunderstanding?

1 comment

mistercheph

Reply

mountainriver  3 days ago

I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.

They also force exploration as a part of the algorithm.

They can be used for synthetic data generation once the reward model is good enough.

Slacker News

Product

  • API Reference
  • Hacker News RSS
  • Source on GitHub

Community

  • Support Ukraine
  • Equal Justice Initiative
  • GiveWell Charities