Comment by tasn
1 month ago
He's not advocating for "trust us", he's advocating for more information than just the benchmarks.
Unfortunately, I'm not sure what a solution that can't be gamed may even look like (which is what gp is asking for).
1 month ago
He's not advocating for "trust us", he's advocating for more information than just the benchmarks.
Unfortunately, I'm not sure what a solution that can't be gamed may even look like (which is what gp is asking for).
The best thing would be blind preference tests for a wide variety of problems across domains but unfortunately even these can be gamed if desired. The upside is that they are gamed by being explicitly malicious which I'd imagine would result in whistleblowing at some point. However Claude's position on leaderboards outside of webdev arena makes me skeptical.
My objection is not towards “advocating for more information”, my objection is towards “so focused on making those scores go up it’s becoming a bit of a perverse incentive”. That type of comment might apply in some other thread about some other release, but it doesn’t belong in this one.