← Back to context

Comment by mentalgear

1 month ago

The trick is that the benchmarks must have a wide enough distribution so that a well scoring model is potentially useful for the widest span of users.

There also would need to be a guarantee (or checking of the model somehow) that model providers don't just train on the benchmarks. Solutions are dynamic components (random names, numbers, etc) or private parts of benchmarks.

A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.