Comment by optimalsolver

6 months ago

It's been found [0] that slightly varying Putnam problems causes a 30% drop in o1-Preview accuracy, but that hasn't put a dent in OAI's hype.

There's absolutely no comeuppance for juicing benchmarks, especially ones no one has access to. If performance of o3 doesn't meet expectations, there'll be plenty of people making excuses for it ("You're prompting it wrong!", "That's just not its domain!").

[0] https://openreview.net/forum?id=YXnwlZe0yf&noteId=yrsGpHd0Sf

1 comment

optimalsolver

menaerus 6 months ago

> If performance of o3 doesn't meet expectations, there'll be plenty of people making excuses for it

I agree and I can definitely see that happening but it is also not impossible, given the incentive and impact of this technology, for some other company/community to create yet another, perhaps, FrontierMath-like benchmark to cross-validate the results.

I also don't disagree that it is not impossible for OpenAI to have faked these results. Time will tell.