Comment by mentalgear
1 month ago
The trick is that the benchmarks must have a wide enough distribution so that a well scoring model is potentially useful for the widest span of users.
There also would need to be a guarantee (or checking of the model somehow) that model providers don't just train on the benchmarks. Solutions are dynamic components (random names, numbers, etc) or private parts of benchmarks.
A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.