Comment by Davidzheng

7 months ago

RLHF means Reinforcement Learning from Human Feedback. The right/wrong ones are either called RL or RLVR (Verfiable Rewards)

0 comments