Comment by anticensor
6 days ago
I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
6 days ago
I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
No comments yet
Contribute on Hacker News ↗