Comment by anticensor
6 months ago
I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
6 months ago
I think the idea is they just feed each to the RLHF reward model used to train the model and return the most rewarded answer.
No comments yet
Contribute on Hacker News ↗