Comment by davebren

3 hours ago

This exploiting of benchmarks isn't that interesting to me since it would be obvious. The main way I assume they're gaming the benchmarks is by creating training data that closely matches the test data, even for ARC where the test data is secret.

3 comments

davebren

jmalicki 3 hours ago

They said they used things like submitted a `conftest.py` - e.g. what would be considered very blatant cheating, not just overfitting/benchmaxxing. Did you read the AI slop in the post?

This is basically a paper about security exploits for the benchmarks. This isn't benchmark hacking like having hand coded hot paths for a microbenchmarks, this is hacking like modifying the benchmark computation code itself at runtime.

davebren 3 hours ago

I get it, but why would anyone trust what these companies say about their model performance anyway. Everyone can see for themselves how well they complete whatever tasks they're interested in.