Comment by Bayano2

3 hours ago

This was true circa GPT2, less true after RLHF and not true at all after RLVR. It's trying to model the distribution of outputs most likely to solve the problem, not the average distribution.

0 comments