Comment by rishabhaiover
3 days ago
I like to think of RLHF as a technique that I, as a student, used to apply to score good marks in my exam. As soon as I started working, I realized that out-of-distribution generalization can't be only achieved from practicing in an environment with verifiable rewards.
No comments yet
Contribute on Hacker News ↗