Comment by BrawnyBadger53

1 month ago

The best thing would be blind preference tests for a wide variety of problems across domains but unfortunately even these can be gamed if desired. The upside is that they are gamed by being explicitly malicious which I'd imagine would result in whistleblowing at some point. However Claude's position on leaderboards outside of webdev arena makes me skeptical.