Comment by Imnimo
6 months ago
My guess is that OpenAI didn't cheat as blatantly as just training on the test set. If they had, surely they could have gotten themselves an even higher mark than 25%. But I do buy the comment that they soft-cheated by using elements of the dataset for validation (which is absolutely still a form of data leakage). Even so, I suspect their reported number is roughly legit, because they report numbers on many benchmarks, and they have a good track record of those numbers holding up to private test sets.
What's much more concerning to me than the integrity of the benchmark number is the general pattern of behavior here from OpenAI and Epoch. We shouldn't accept secretly (even secret to the people doing the creation!) funding the creation of a benchmark. I also don't see how we can trust in the integrity of EpochAI going forward. This is basically their only meaningful output, and this is how they handled it?
> If they had, surely they could have gotten themselves an even higher mark than 25%.
there is potentially some limitation of LLMs memorizing such complex proofs
They aren't proofs, they're just numbers. All the questions have numerical answers. That's how they're evaluated.
I think those reasoning models are smart enough to not emit memorized answer if they can't come with CoT proof.
But OAI could draw any result, no one was checking, they probably were not brave enough to declare math as solved topic.