← Back to context

Comment by sillysaurusx

6 months ago

The questions are designed so that such training data is extremely limited. Tao said it was around half a dozen papers at most, sometimes. That’s not really enough to overfit on without causing other problems.

> That’s not really enough to overfit on without causing other problems.

"Causing other problems" is exactly what I'm worried about. I would not put it past OpenAI to deliberately overfit on a set of benchmarks in order to keep up the illusion that they're still progressing at the rate that the hype has come to expect, then keep the very-dangerous model under wraps for a while to avoid having to explain why it doesn't act as smart as they claimed. We still don't have access to this model (because, as with everything since GPT-2, it's "too dangerous"), so we have no way of independently verifying its utility, which means they have a window where they can claim anything they want. If they release a weaker model than claimed it can always be attributed to guardrails put in place after safety testing confirmed it was dangerous.

We'll see when the model actually becomes available, but in the meantime it's reasonable to guess that it's overfitted.

You're missing the part where 25% of the problems were representative of problems top tier undergrads would solve in competitions. Those problems are not based on material that only exists in half a dozen papers.

Tao saw the hardest problems, but there's no concrete evidence that o3 solved any of the hardest problems.