← Back to context

Comment by lolinder

6 months ago

A co-founder of Epoch left a note in the comments:

> We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.

Ouch. A verbal agreement. As the saying goes, those aren't worth the paper they're written on, and that's doubly true when you're dealing with someone with a reputation like Altman's.

And aside from the obvious flaw in it being a verbal agreement, there are many ways in which OpenAI could technically comply with this agreement while still gaining a massive unfair advantage on the benchmarks to the point of rendering them meaningless. For just one example, knowing the benchmark questions can help you select training data that is tailored to excelling at the benchmarks without technically including the actual question in the training data.

What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set, even though elsewhere Epoch AI strongly implied this already existed: https://xcancel.com/ElliotGlazer/status/1880809468616950187

It seems to me that o3's 25% benchmark score is 100% data contamination.

  • > I just saw Sam Altman speak at YCNYC and I was impressed. I have never actually met him or heard him speak before Monday, but one of his stories really stuck out and went something like this:

    > "We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.

    > We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.

    > We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract."

    > I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.

    https://news.ycombinator.com/item?id=3048944

    • Man, the real ugliness is the comments hooting and hollering for this amoral cynicism:

        Honesty is often overrated by geeks and it is very contextual
      
        He didn't misrepresent anything. They were actually working there, just only for one day.
      
        The effectiveness of deception is not mitigated by your opinions of its likability.
      

      Gross.

    • This sort of "adjusting the truth" is widespread in business. It's not OK, but people should not be shocked by this.

      Also, if marks want to be so gullible, it's on them. It's your money and YOUR due diligence.

  • > What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set,

    There is nothing suspicious about this and the wording seems to be incorrect.

    A hold-out set is a percentage of the overall data that is used to test a model. It is just not trained on it. Model developers normally have full access to it.

    There is nothing inherently wrong with training on a full/partial hold out set. It just means you have done a different split to train again.

    The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

    Even so blind sets can also go stale after a few runs and nothing is wrong with ingesting that blind set, as long as you have a new blind set to run against.

    Trying to game blind set tests is nothing new and it gets very quickly found out.

    What I took from the original article is that the blind set is likely unbalanced and it answered more easier questions than hard ones.

    • > The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

      What on earth? This is from Tamay Besiroglu at Epoch:

        Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training. 
      

      So this "confusion" is because Epoch AI specifically told people it was a blind set! Despite the condescending tone, your comment is just plain wrong.

      3 replies →

The questions are designed so that such training data is extremely limited. Tao said it was around half a dozen papers at most, sometimes. That’s not really enough to overfit on without causing other problems.

  • > That’s not really enough to overfit on without causing other problems.

    "Causing other problems" is exactly what I'm worried about. I would not put it past OpenAI to deliberately overfit on a set of benchmarks in order to keep up the illusion that they're still progressing at the rate that the hype has come to expect, then keep the very-dangerous model under wraps for a while to avoid having to explain why it doesn't act as smart as they claimed. We still don't have access to this model (because, as with everything since GPT-2, it's "too dangerous"), so we have no way of independently verifying its utility, which means they have a window where they can claim anything they want. If they release a weaker model than claimed it can always be attributed to guardrails put in place after safety testing confirmed it was dangerous.

    We'll see when the model actually becomes available, but in the meantime it's reasonable to guess that it's overfitted.

  • You're missing the part where 25% of the problems were representative of problems top tier undergrads would solve in competitions. Those problems are not based on material that only exists in half a dozen papers.

    Tao saw the hardest problems, but there's no concrete evidence that o3 solved any of the hardest problems.