Comment by aithrowawaycomm

6 months ago

What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set, even though elsewhere Epoch AI strongly implied this already existed: https://xcancel.com/ElliotGlazer/status/1880809468616950187

It seems to me that o3's 25% benchmark score is 100% data contamination.

10 comments

aithrowawaycomm

cma 6 months ago

> I just saw Sam Altman speak at YCNYC and I was impressed. I have never actually met him or heard him speak before Monday, but one of his stories really stuck out and went something like this:

> "We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.

> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.

> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract."

> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.

https://news.ycombinator.com/item?id=3048944

aithrowawaycomm 6 months ago

Man, the real ugliness is the comments hooting and hollering for this amoral cynicism:

  Honesty is often overrated by geeks and it is very contextual

  He didn't misrepresent anything. They were actually working there, just only for one day.

  The effectiveness of deception is not mitigated by your opinions of its likability.

Gross.

AyyEye 6 months ago

Nothing says genuine like lying to get a contract.
renegade-otter 6 months ago

This sort of "adjusting the truth" is widespread in business. It's not OK, but people should not be shocked by this.
Also, if marks want to be so gullible, it's on them. It's your money and YOUR due diligence.

teaearlgraycold 6 months ago

This was my assumption all along.

EagnaIonat 6 months ago

> What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set,

There is nothing suspicious about this and the wording seems to be incorrect.

A hold-out set is a percentage of the overall data that is used to test a model. It is just not trained on it. Model developers normally have full access to it.

There is nothing inherently wrong with training on a full/partial hold out set. It just means you have done a different split to train again.

The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.

Even so blind sets can also go stale after a few runs and nothing is wrong with ingesting that blind set, as long as you have a new blind set to run against.

Trying to game blind set tests is nothing new and it gets very quickly found out.

What I took from the original article is that the blind set is likely unbalanced and it answered more easier questions than hard ones.

aithrowawaycomm 6 months ago
> The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.
What on earth? This is from Tamay Besiroglu at Epoch:
Regarding training usage: We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
So this "confusion" is because Epoch AI specifically told people it was a blind set! Despite the condescending tone, your comment is just plain wrong.
- EagnaIonat 6 months ago
  
  Your quote literally says hold-out set.
  
  2 replies →