Comment by rvnx

3 hours ago

The problem is that we know in advance what is the benchmark, so Humanity's Last Exam for example, it's way easier to optimize your model when you have seen the questions before.

From https://lastexam.ai/: "The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting." [emphasis mine]

While the private questions don't seem to be included in the performance results, HLE will presumably flag any LLM that appears to have gamed its scores based on the differential performance on the private questions. Since they haven't yet, I think the scores are relatively trustworthy.

This. A lot of boosters point to benchmarks as justification of their claims, but any gamer who spent time in the benchmark trenches will know full well that vendors game known tests for better scores, and that said scores aren’t necessarily indicative of superior performance. There’s not a doubt in my mind that AI companies are doing the same.

Its the other way around too, HLE questions were selected adversarially to reduce the scores. I'd guess even if the questions were never released, and new training data was introduced, the scores would improve.

shouldn't we expect that all of the companies are doing this optimization, though? so, back to level playing field.