Comment by Bayano2
3 hours ago
This was true circa GPT2, less true after RLHF and not true at all after RLVR. It's trying to model the distribution of outputs most likely to solve the problem, not the average distribution.
3 hours ago
This was true circa GPT2, less true after RLHF and not true at all after RLVR. It's trying to model the distribution of outputs most likely to solve the problem, not the average distribution.
No comments yet
Contribute on Hacker News ↗