← Back to context Comment by mistercheph 4 days ago care to correct the misunderstanding? 1 comment mistercheph Reply mountainriver 3 days ago I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.They also force exploration as a part of the algorithm.They can be used for synthetic data generation once the reward model is good enough.
mountainriver 3 days ago I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.They also force exploration as a part of the algorithm.They can be used for synthetic data generation once the reward model is good enough.
I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one.
They also force exploration as a part of the algorithm.
They can be used for synthetic data generation once the reward model is good enough.