Comment by storus 3 months ago We might not even need RL as DPO has shown. 1 comment storus Reply programjamesĀ 3 months ago > if you purely use policy optimization, RLHF will be biased towards short horizons> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly
programjamesĀ 3 months ago > if you purely use policy optimization, RLHF will be biased towards short horizons> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly
> if you purely use policy optimization, RLHF will be biased towards short horizons
> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly