“… we have a verbal agreement that these materials will not be used in model training”
Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?
At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.
Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.
You can still game a test set without training on it, that’s why you usually have a validation set and a test set that you ideally seldom use. Routinely running an evaluation on the test set can get the humans in the loop to overfit the data
Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?
Why do HN commenters want OpenAI to be considered in violation of copyright here? Ok, so imagine you get your wish. Now all the big tech companies enter into billion dollar contracts with each other along with more traditional companies to get access to training data. So we close off the possibility of open development of AI even further. Every tech company with user-generated content over the last 20 years or so is sitting on a treasure trove now.
I’d prefer we go the other direction where something like archive.org archives all publicly accessible content and the government manages this, keeps it up-to-date, and gives cheap access to all of the data to anyone on request. That’s much more “democratizing” than further locking down training data to big companies.
Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation?
Are there other tricks they could have pulled?
It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating.
In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.
> mechanical turking a training set, fine tuning their model
You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)
> OpenAI to have gamed ARC-AGI by seeing the first few examples
not just few examples. o3 was evaluated on "semi-private" test, which was previously already used for evaluating OAI models, so OAI had access to it already for a long time.
"O3 performs spectacularly on a very hard dataset that was independently developed and that OpenAI does not have access to."
"O3 performs spectacularly on a very hard dataset that was developed for OpenAI and that only OpenAI has access to."
Or let's put it another way: If what they care about is benchmark integrity, what reason would they have for demanding access to the benchmark dataset and hiding the fact that they finance it? The obvious thing to do if integrity is your goal is to fund it, declare that you will not touch it, and be transparent about it.
If you’re a for profit company trying to raise funding and fend off skepticism that your models really aren’t that much better than any one else’s, then…
It would be dishonest, but as long as no one found out until after you closed your funding round, there’s plenty of reason you might do this.
It comes down to caring about benchmarks and integrity or caring about piles of money.
Judge for yourself which one they chose.
Perhaps they didn’t train on it.
Who knows?
It’s fair to be skeptical though, under the circumstances.
A co-founder of Epoch left a note in the comments:
> We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Ouch. A verbal agreement. As the saying goes, those aren't worth the paper they're written on, and that's doubly true when you're dealing with someone with a reputation like Altman's.
And aside from the obvious flaw in it being a verbal agreement, there are many ways in which OpenAI could technically comply with this agreement while still gaining a massive unfair advantage on the benchmarks to the point of rendering them meaningless. For just one example, knowing the benchmark questions can help you select training data that is tailored to excelling at the benchmarks without technically including the actual question in the training data.
What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set, even though elsewhere Epoch AI strongly implied this already existed: https://xcancel.com/ElliotGlazer/status/1880809468616950187
It seems to me that o3's 25% benchmark score is 100% data contamination.
> I just saw Sam Altman speak at YCNYC and I was impressed. I have never actually met him or heard him speak before Monday, but one of his stories really stuck out and went something like this:
> "We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.
> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.
> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract."
> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.
> What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set,
There is nothing suspicious about this and the wording seems to be incorrect.
A hold-out set is a percentage of the overall data that is used to test a model. It is just not trained on it. Model developers normally have full access to it.
There is nothing inherently wrong with training on a full/partial hold out set. It just means you have done a different split to train again.
The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.
Even so blind sets can also go stale after a few runs and nothing is wrong with ingesting that blind set, as long as you have a new blind set to run against.
Trying to game blind set tests is nothing new and it gets very quickly found out.
What I took from the original article is that the blind set is likely unbalanced and it answered more easier questions than hard ones.
The questions are designed so that such training data is extremely limited. Tao said it was around half a dozen papers at most, sometimes. That’s not really enough to overfit on without causing other problems.
> That’s not really enough to overfit on without causing other problems.
"Causing other problems" is exactly what I'm worried about. I would not put it past OpenAI to deliberately overfit on a set of benchmarks in order to keep up the illusion that they're still progressing at the rate that the hype has come to expect, then keep the very-dangerous model under wraps for a while to avoid having to explain why it doesn't act as smart as they claimed. We still don't have access to this model (because, as with everything since GPT-2, it's "too dangerous"), so we have no way of independently verifying its utility, which means they have a window where they can claim anything they want. If they release a weaker model than claimed it can always be attributed to guardrails put in place after safety testing confirmed it was dangerous.
We'll see when the model actually becomes available, but in the meantime it's reasonable to guess that it's overfitted.
You're missing the part where 25% of the problems were representative of problems top tier undergrads would solve in competitions. Those problems are not based on material that only exists in half a dozen papers.
Tao saw the hardest problems, but there's no concrete evidence that o3 solved any of the hardest problems.
Why do people keep taking OpenAIs marketing spin at face value? This keeps happening, like when they neglected to mention that their most impressive Sora demo involved extensive manual editing/cleanup work because the studio couldn't get Sora to generate what they wanted.
It might be because (very few!) mathematicians like Terence Tao make positive remarks. I think these mathematicians should be very careful to use reproducible and controlled setups that by their nature cannot take place on GPUs in the Azure cloud.
I have nothing against scientists promoting the Coq Proof Assistant. But that's open source, can be run at home and is fully reproducible.
Keep in mind those mathematicians were kept in the dark about the funding: it is incredibly unethical to invite a coauthor to your paper and not tell where the money came from.
It's just incredibly scummy behavior: I imagine some of those mathematicians would have declined the collaboration if the funding were transparent. More so than data contamination, this makes me deeply mistrustful of Epoch AI.
Because they are completely gullible and believe almost everything that OpenAI does without questioning the results.
On each product they release, their top researchers are gradually leaving.
Everyone now knows what happens when you go against or question OpenAI after working for them, which is why you don't see any criticism and more of a cult-like worship.
Because the models have continually matched the quality they claim.
Ex. look how much work "very few" has to do in the sibling comment. It's like saying "very few physicists [Einstein/Feynman/Witten]"
Its conveniently impossible to falsify the implication that the inverse of "very few" say not positive things. i.e. that the vast majority say negative things
You have to go through an incredible level of mental gymnastics, involving many months of gated decisions, where the route chosen involved "gee, I know this is suspectable to confirmation bias, but...", to end up wondering why people think the models are real if OpenAI has access to data that includes some set of questions.
> Because the models have continually matched the quality they claim.
That's very far from true.
"Yes, I know that the HuggingFace arena and coding assistant leaderboards both say that OpenAI's new model is really good, but in practice you should use Claude Sonnet instead" was a meme for good reason, as was "I know the benchmarks show that 4o is just as capable as ChatGPT4 but based on our internal evals it seems much worse". The latter to the extent that they had to use dark UI patterns to hide ChatGPT-4 from their users, because they kept using it, and it cost OpenAI much more than 4o.
OpenAI regularly messes with benchmarks to keep the investor money flowing. Slightly varying the wording of benchmark problems causes a 30% drop in o1 accuracy. That doesn't mean "LLMs don't work" but it does mean that you have to be very sceptical of OpenAI benchmark results when comparing them to other AI labs, and this has been the case for a long time.
The FrontierMath case just shows that they are willing to go much farther with their dishonesty than most people thought.
> Tamay from Epoch AI here. We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.
Not sure if "integrity of the benchmarks" should even be something that you negotiate over, what's the value of the benchmark if the results cannot be trusted because of undisclosed relationships and sharing of data? Why would they be restricted from disclosing stuff you normally disclose, and how doesn't that raise all sorts of warning flags when proposed even?
Thanks for the link. A holdout set which is yet to be used to verify the 25% claim. He also says that he doesn't believe that OpenAI would self-sabotage themselves by tricking the internal benchmarking performance since this will get easily exposed, either by the results from a holdout set or by the public repeating the benchmarks themselves. Seems reasonable to me.
A lot of the comments express some type of deliberate cheating the benchmark. However, even without intentionally trying to game it, if anybody can repeatedly take the same test, then they'll be nudged to overfit/p-hack.
For instance, suppose they conduct an experiment and find that changing some hyper-parameter yields a 2% boost. That could just be noise, it could be a genuine small improvement, or it may be a mix of a genuine boost along with some fortunate noise. An effect may be small enough that researchers would need to rely on their gut to interpret it. Researchers may jump on noise while believing they have discovered true optimizations. Enough of these types of nudges, and some serious benchmark gains can materialize.
(Hopefully my comment isn't entirely misguided, I don't know how they actually do testing or how often they probe their test set)
I cringe every time I see "my IQ increased by X points after doing Y" posts on Twitter - yes, you had a practice run on Raven's progressive matrices a month ago, that helped, these have a limited question bank and the effect of Y is marginal. That said, obviously, test taking is a skill (separate from background knowledge and both general/domain-specific ability) and should be trained if you expect to have life-altering events based on tests (i.e., do an LSAT course if you want to go to law school). Conversely, shouldn't be done if you think it will limit you through superstition ("I had a score of X, thus I can only perform around level of X+fudge factor"). For an LLM company a good test score is a valuation-altering event!
OpenAI played themselves here. Now nobody is going to take any of their results on this benchmark seriously, ever again. That o3 result has just disappeared in a poof of smoke. If they had blinded themselves properly then that wouldn't be the case.
Whereas other AI companies now have the opportunity to be first to get a significant result on FrontierMath.
I'd be surprised if any of their in-house benchmark results are taken seriously after this. As an extremely rough estimate, FrontierMath cost five to six figures to assemble [1] - so from an outside view, they clearly have no qualms with turning cash into quasi-guaranteed benchmark results.
Conversely, if they didn't cheat and they funded creation of the test suite to get "clean" problems (while hiding their participation to prevent getting problems that are somehow tailored to be hard for LLMs specifically), then they have no reasons to fear that all this looks fishy as the test results will soon be vindicated when they'll give wider access to the model.
I refrain from forming a strong opinion in such situations. My intuition tells me that it's not cheating. But, well, it's intuition (probably based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment, so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power. And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach). So, we'll see.
> based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment
Agreed (well as much as intuition goes), but current gen AI is not a brain, much less a human brain. It shows similarities, in particular emerging multi-modal pattern matching capabilities. There is nothing that says that’s all the neocortex does, in fact the opposite is a known truth in neuroscience. We just don’t know all functions yet - we can’t just ignore the massive Chesterton’s fence we don’t understand.
This isn’t even necessarily because the brain is more sophisticated than anything else, we don’t have models for the weather and immune system or anything chaotic really. Look, folding proteins is still a research problem and that’s at the level of known molecular structure. We greatly overestimate our abilities to model & simulate things. Todays AI is a prime example of our wishful thinking and glossing over ”details”.
> so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power.
Sure. That’s a reasonable hypothesis.
> And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach
You seem to be assuming ”ability” is single axis. It’s like assuming if we get 256 bit registers computers will start making coffee, or that going to the gym will eventually give you wings. There is nothing that suggests this. In fact, if you look at emerging ability in pattern matching that improved enormously, while seeing reasoning on novel problems sitting basically still, that suggests strongly that we are looking at a multi-axis problem domain.
Do people actually think OpenAI is gaming benchmarks?
I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work.
What works for enterprise is very different from “does it beat this benchmark”.
No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”.
In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works.
I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant.
Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group.
Gaming benchmarks has a lot of utility for openAI whether their product works or not.
Many people compare models based on benchmarks. So if openAI can appear better to Anthropic, Google, or Meta, by gaming benchmarks, it's absolutely in their interest to do so, especially if their product is only slightly behind, because evaluating model quality is very very tricky business these days.
In particular, if there is a new benchmark, it's doubly in their interest to game it, because they know that other providers will start using and optimizing performance towards that benchmark, in order to "beat" OpenAI and win market share.
On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?
This is a company which is shedding their coats of ethics and scientific rigor -- so as to be as unencumbered as possible in its footrace to the dollar.
I used to think this, but using o1 quite a bit lately has convinced me otherwise. It’s been 1-shotting the fairly non-trivial coding problems I throw at it and is good about outputting large, complete code blocks. By contrast, Claude immediately starts nagging you about hitting usage limits after a few back and forth and has some kind of hack in place to start abbreviating code when conversations get too long,
even when explicitly instructed to do otherwise. I would imagine that Anthropic can produce a good test time compute model as well, but until they have something publicly available, OpenAI has stolen back the lead.
I think “getting beat handily” is a HN bubble concept. Depends on what you’re using it for, but I personally prefer 4o for coding. In enterprise usage, i think 4o is smoking 3.5 sonnet, but that’s just my perception from folks I talk to.
Yes, it looks all but certain that OpenAI gamed this particular benchmark.
Otherwise, they would not have had a contract that prohibited revealing that OpenAI was involved with the project until after the o3 announcements were made and the market had time to react. There is no reason to have such a specific agreement unless you plan to use the backdoor access to beat the benchmark: otherwise, OpenAI would not have known in advance that o3 will perform well! In fact, if there was proper blinding in place (which Epoch heads confirmed was not the case), there would have been no reason for secrecy at all.
Google, xAI and Anthropic's test-time compute experiments were really underwhelming: if OpenAI has secret access to benchmarks, that explains why their performance is so different.
> Do people actually think OpenAI is gaming benchmarks?
I was blown away by chatgpt release and generally have admired OpenAI however I wouldn't put it past them
At this point their entire marketing strategy seems to be to do vague posting on X/Twitter and keep hyping the models so that investors always feel there is something around the corner
And I don't think they need to do that. Most investors will be throwing money at them either way but maybe when you are looking to raise _billions_ that's not enough
Well I certainly won't object if oai marketing was based on testimonials from their fanboy customers instead of rigged benchmark scores %)
Your fragrant disregard for ethics and focus on utilitarian aspects is certainly quite extreme to the extent that only a view people would agree with you in my view.
People on here were mocking me openly when I pointed out that you can't be sure LLMs (or any AIs) are actually smart unless you CAN PROVE that the question you're asking isn't in the training set (or adjacent like in this case).
So with this in mind now, let me repeat: Unless you know that the question AND/OR answer are not in the training set or adjacent, do not claim that the AI or similar black box is smart.
I ran a test yesterday on ChatGPT and co-pilot asking first if it knew of a specific paper which it confirmed and then to derive simple results from which it was completely incapable of. I know this paper is not widely referenced (ie few known results in the public domain) but has been available for over 15 years with publicly accessible code written by humans. The training set was so sparse it had no ability to "understand" or even regurgitate past the summary text which it listed almost verbatim.
> Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.
(1) Companies will probably increasingly invest in building their own evals for their use cases because its becoming clear public/allegedly private benchmarks have misaligned incentives with labs sponsoring/cheating
(2) Those evals will prob be proprietary "IP" - guarded as closely as the code or research itself
(3) Conversely, public benchmarks are exhausted and SOMEONE has to invest in funding more frontier benchmarks. So this is prob going to continue.
This kind of thing is so avoidable by anyone who has not sold their soul. The answer is: if a company wants you to do a deal but requires as a condition that you not reveal to anyone that you are doing a deal with that company, you just say no. It's that simple.
My guess is that OpenAI didn't cheat as blatantly as just training on the test set. If they had, surely they could have gotten themselves an even higher mark than 25%. But I do buy the comment that they soft-cheated by using elements of the dataset for validation (which is absolutely still a form of data leakage). Even so, I suspect their reported number is roughly legit, because they report numbers on many benchmarks, and they have a good track record of those numbers holding up to private test sets.
What's much more concerning to me than the integrity of the benchmark number is the general pattern of behavior here from OpenAI and Epoch. We shouldn't accept secretly (even secret to the people doing the creation!) funding the creation of a benchmark. I also don't see how we can trust in the integrity of EpochAI going forward. This is basically their only meaningful output, and this is how they handled it?
Elon definitely still has a grudge against Altman and OpenAI, so when Elon uses his new political power to bludgeon OpenAI to bankruptcy with new regulations and lawsuits, it won't be for the right reasons, but I'll still think Altman and the remaining employees deserve it.
Many of these evals are quite easy to game. Often the actual evaluation part of benchmarking is left up to a good-faith actor, which was usually reasonable in academic settings less polluted by capital. AI labs, however, have disincentives to do a thorough or impartial job, so IMO we should never take their word for it. To verify, we need to be able to run these evals ourselves – this is only sometimes possible, as even if the datasets are public, the exact mechanisms of evaluation are not. In the long run, to be completely resilient to gaming via training, we probably need to follow suit of other fields and have third-party non-profit accredited (!!) evaluators who's entire premise is to evaluate, red-team, and generally keep AI safe and competent.
I have been taking a course in AI policy and the O1 and the FrontierMath dataset has been an important mark for me to emphasize the world we are moving toward. It is incredibly sad to know about the conflict of interest here. However, those more knowledgeable, can you explain in plain words, does this revelation compromise OAI's claims regarding o3's performance on FrontierMath problems?
It's worse than just an undeclared conflict of interest. They gave OpenAI all questions and solutions behind the scenes. It's hard to chalk this up to only naivete. This is a "sorry you caught me" moment.
They have an oral agreement that OpenAI won't use the benchmark in training. Which means first and foremost you have to consider the possibility that they broke that oral agreement and actually included the problems in the training set. Even if they didn't, the fact that they had the problems means they could have selectively chosen the training set data to specialize in solving that class of problem, while still technically keeping the verbal agreement.
So, yeah, the benchmark needs to be treated as essentially worthless at this point.
If OpenAI wanted the questions/solutions, there is going to be a reason for that. This data is not sitting in an unopened folder on Sam's computer.
There are a lot of ways you can use data to improve a model without directly training on it. A train/test validation loop, for example. Or as a wellspring for synthetic data generation. But all of these ways involve some level of data contamination, it's unavoidable.
Its increasingly odd to see HN activity that assumes the premise: if the latest benchmark results involved a benchmark that can be shown to have any data that OpenAI could have accessed, then, the benchmark results were intentionally faked.
Last time this confused a bunch of people who didn't understand what test vs. train data meant and it resulted in a particular luminary complaining on Twitter, to much guffaws, how troubling the situation was.
Literally every comment currently, modulo [1] assumes this and then goes several steps more, and a majority are wildly misusing terms with precise meanings, explaining at least part of their confusion.
[1] modulo the one saying this is irrelevant because we'll know if it's bad when it comes out, which to be fair, if evaluated rationally, we know that doesn't help us narrowly with our suspicion FrontierMath benchmarks are all invalid because it trained on (most of) the solutions
Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.
And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.
As for "knowing it's bad", most people won't be able to tell a model scoring 25% and 10% apart. People who are using these models to solve math problems are tiny share of users and even tinier share of revenues. What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.
> Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.
This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.
> And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.
This part accidentally stumbles into a valid point about ML methodology while completely missing why it matters. Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The implication that this basic principle somehow proves misconduct is backwards logic.
> People who are using these models to solve math problems are tiny share of users and even tinier share of revenues.
This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. This is basic ML evaluation theory.
> What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.
This concludes with pure speculation presented as fact, combined with a conspiracy theory that lacks any actual evidence. It also displays a shallow understanding of how technical due diligence works in major AI investments - investors at this level typically have deep technical expertise, access to extensive testing and validation, and most damningly, given the reductive appeal to incentive structure:
They closed the big round weeks before.
The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.
OpenAI continues to muddy the benchmarks, while Claude continues to improve their intelligence. Claude will win long term. It'd be wise to not rely on OpenAI at all. They are the first comers who will just burn cash and crash out I suspect.
The problem is, any benchmark on a closed model couldn’t be private even in theory, as the model has to be called to run the benchmark, exposing the contents to whoever owns the model thereafter.
HN loves to speculate that OpenAI is some big scam whose seeming ascendance is based on deceptive marketing hype, but o1, to anyone who has tried it seriously is undoubtedly very much within the ballpark of what OpenAI claims it is able to do. If everything they are doing really is just overfitting and gaming the tests, that discrepancy will eventually catch up to them, and people will stop using the APIs and chatgpt
They should at least clarify it. The reason they don’t I feel is simply for the hype and mystique.
There are ways that you could game the benchmark without adding it to the training set. By repetitively evaluating on the dataset itself it will regress into a validation set, not a test set, even in black box setting, as you can simply evaluating 100 checkpoints and pick the one that performs the best, rinse and repeat
I still believe o3 is the real deal, BUT this gimmick kind sour my appetite a bit, for that those who run the company
Even if OpenAI does not use these materials to directly train its models, OpenAI can collect or construct more data based on the knowledge points and test points of these questions to gain an unfair competitive advantage.
It's like before the Gaokao, a teacher reads some of the Gaokao questions and then marks the test points in the book for you. This is cheating.
This isn't news, the other popular benchmarks are just as gamed and worthless, it would be shocking if this one wasn't. The other frontier model providers game them just as hard, it's not an OpenAI thing. Any benchmark that a provider themselves mentions is not worth the pixels its written on.
> if they used it in training it should be 100% hit.
Not necessarily, no.
A statistical model will attempt to minimise overall loss, generally speaking.
If it gets 100% accuracy on the training data it's usually an overfit. (Hugging the data points too tightly, thereby failing to predict real life cases)
you are mostly right. but seeing almost perfectly reconstructed images from training set it's obvious model -can- memorize samples. in this case it would reproduce the answers too close to the original to be just 'accidental'. should be easy to test.
My guess samples could be used to find good enough stopping point for o1, o3 models. which is hardcoded.
which should really be “we now know how to improve associative reasoning but we still need to cheat when it comes to math because the bottom line is that the models can only capture logic associatively, not synthesize deductively, which is what’s needed for math beyond recipe-based reasoning"
“… we have a verbal agreement that these materials will not be used in model training”
Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?
At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.
Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.
[0: https://news.ycombinator.com/threads?id=agnosticmantis#42476... ]
You can still game a test set without training on it, that’s why you usually have a validation set and a test set that you ideally seldom use. Routinely running an evaluation on the test set can get the humans in the loop to overfit the data
OpenAI doesn't respect copyright so why would they let a verbal agreement get in the way of billion$
Can somehow explain to me how they can simply not respect copyright and get away with it? Also is this a uniquely open-ai problem, or also true of the other llm makers?
69 replies →
Why do HN commenters want OpenAI to be considered in violation of copyright here? Ok, so imagine you get your wish. Now all the big tech companies enter into billion dollar contracts with each other along with more traditional companies to get access to training data. So we close off the possibility of open development of AI even further. Every tech company with user-generated content over the last 20 years or so is sitting on a treasure trove now.
I’d prefer we go the other direction where something like archive.org archives all publicly accessible content and the government manages this, keeps it up-to-date, and gives cheap access to all of the data to anyone on request. That’s much more “democratizing” than further locking down training data to big companies.
OpenAI's benchmark results looking like Musk's Path of Exile character..
This has me curious about ARC-AGI.
Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation?
Are there other tricks they could have pulled?
It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating.
> This has me curious about ARC-AGI
In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.
> mechanical turking a training set, fine tuning their model
You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)
3 replies →
> OpenAI to have gamed ARC-AGI by seeing the first few examples
not just few examples. o3 was evaluated on "semi-private" test, which was previously already used for evaluating OAI models, so OAI had access to it already for a long time.
In their benchmark, they have a tag "tuned" attached to their o3 result. I guess we need they to inform us of the exact meaning of it to gauge.
Why would they use the materials in model training? It would defeat the purpose of having a benchmarking set
Compare:
"O3 performs spectacularly on a very hard dataset that was independently developed and that OpenAI does not have access to."
"O3 performs spectacularly on a very hard dataset that was developed for OpenAI and that only OpenAI has access to."
Or let's put it another way: If what they care about is benchmark integrity, what reason would they have for demanding access to the benchmark dataset and hiding the fact that they finance it? The obvious thing to do if integrity is your goal is to fund it, declare that you will not touch it, and be transparent about it.
If you’re a research lab then yes.
If you’re a for profit company trying to raise funding and fend off skepticism that your models really aren’t that much better than any one else’s, then…
It would be dishonest, but as long as no one found out until after you closed your funding round, there’s plenty of reason you might do this.
It comes down to caring about benchmarks and integrity or caring about piles of money.
Judge for yourself which one they chose.
Perhaps they didn’t train on it.
Who knows?
It’s fair to be skeptical though, under the circumstances.
1 reply →
>perhaps they were routing API calls to human workers
Honest question, did they?
How would that even work? Aren’t the responses to the API equally fast as the Web interface? Can any human write a response with the speed of an LLM?
1 reply →
verbal agreement ... that's just saying that you're a little dumb or you're playing dumb cause you're in on it.
Not used in model training probably means it was used in model validation.
A co-founder of Epoch left a note in the comments:
> We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Ouch. A verbal agreement. As the saying goes, those aren't worth the paper they're written on, and that's doubly true when you're dealing with someone with a reputation like Altman's.
And aside from the obvious flaw in it being a verbal agreement, there are many ways in which OpenAI could technically comply with this agreement while still gaining a massive unfair advantage on the benchmarks to the point of rendering them meaningless. For just one example, knowing the benchmark questions can help you select training data that is tailored to excelling at the benchmarks without technically including the actual question in the training data.
What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set, even though elsewhere Epoch AI strongly implied this already existed: https://xcancel.com/ElliotGlazer/status/1880809468616950187
It seems to me that o3's 25% benchmark score is 100% data contamination.
> I just saw Sam Altman speak at YCNYC and I was impressed. I have never actually met him or heard him speak before Monday, but one of his stories really stuck out and went something like this:
> "We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.
> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.
> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract."
> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.
https://news.ycombinator.com/item?id=3048944
3 replies →
This was my assumption all along.
> What's even more suspicious is that these tweets from Elliot Glazer indicate that they are still "developing" the hold-out set,
There is nothing suspicious about this and the wording seems to be incorrect.
A hold-out set is a percentage of the overall data that is used to test a model. It is just not trained on it. Model developers normally have full access to it.
There is nothing inherently wrong with training on a full/partial hold out set. It just means you have done a different split to train again.
The confusion I see here is that people are equating a hold out set to a blind set. That's a set of data to test against that the model developers (and model) cannot see.
Even so blind sets can also go stale after a few runs and nothing is wrong with ingesting that blind set, as long as you have a new blind set to run against.
Trying to game blind set tests is nothing new and it gets very quickly found out.
What I took from the original article is that the blind set is likely unbalanced and it answered more easier questions than hard ones.
4 replies →
The questions are designed so that such training data is extremely limited. Tao said it was around half a dozen papers at most, sometimes. That’s not really enough to overfit on without causing other problems.
> That’s not really enough to overfit on without causing other problems.
"Causing other problems" is exactly what I'm worried about. I would not put it past OpenAI to deliberately overfit on a set of benchmarks in order to keep up the illusion that they're still progressing at the rate that the hype has come to expect, then keep the very-dangerous model under wraps for a while to avoid having to explain why it doesn't act as smart as they claimed. We still don't have access to this model (because, as with everything since GPT-2, it's "too dangerous"), so we have no way of independently verifying its utility, which means they have a window where they can claim anything they want. If they release a weaker model than claimed it can always be attributed to guardrails put in place after safety testing confirmed it was dangerous.
We'll see when the model actually becomes available, but in the meantime it's reasonable to guess that it's overfitted.
You're missing the part where 25% of the problems were representative of problems top tier undergrads would solve in competitions. Those problems are not based on material that only exists in half a dozen papers.
Tao saw the hardest problems, but there's no concrete evidence that o3 solved any of the hardest problems.
Why do people keep taking OpenAIs marketing spin at face value? This keeps happening, like when they neglected to mention that their most impressive Sora demo involved extensive manual editing/cleanup work because the studio couldn't get Sora to generate what they wanted.
https://news.ycombinator.com/item?id=40359425
It might be because (very few!) mathematicians like Terence Tao make positive remarks. I think these mathematicians should be very careful to use reproducible and controlled setups that by their nature cannot take place on GPUs in the Azure cloud.
I have nothing against scientists promoting the Coq Proof Assistant. But that's open source, can be run at home and is fully reproducible.
Keep in mind those mathematicians were kept in the dark about the funding: it is incredibly unethical to invite a coauthor to your paper and not tell where the money came from.
It's just incredibly scummy behavior: I imagine some of those mathematicians would have declined the collaboration if the funding were transparent. More so than data contamination, this makes me deeply mistrustful of Epoch AI.
3 replies →
Because they are completely gullible and believe almost everything that OpenAI does without questioning the results.
On each product they release, their top researchers are gradually leaving.
Everyone now knows what happens when you go against or question OpenAI after working for them, which is why you don't see any criticism and more of a cult-like worship.
Once again, "AGI" is a complete scam.
Because the models have continually matched the quality they claim.
Ex. look how much work "very few" has to do in the sibling comment. It's like saying "very few physicists [Einstein/Feynman/Witten]"
Its conveniently impossible to falsify the implication that the inverse of "very few" say not positive things. i.e. that the vast majority say negative things
You have to go through an incredible level of mental gymnastics, involving many months of gated decisions, where the route chosen involved "gee, I know this is suspectable to confirmation bias, but...", to end up wondering why people think the models are real if OpenAI has access to data that includes some set of questions.
> Because the models have continually matched the quality they claim.
That's very far from true.
"Yes, I know that the HuggingFace arena and coding assistant leaderboards both say that OpenAI's new model is really good, but in practice you should use Claude Sonnet instead" was a meme for good reason, as was "I know the benchmarks show that 4o is just as capable as ChatGPT4 but based on our internal evals it seems much worse". The latter to the extent that they had to use dark UI patterns to hide ChatGPT-4 from their users, because they kept using it, and it cost OpenAI much more than 4o.
OpenAI regularly messes with benchmarks to keep the investor money flowing. Slightly varying the wording of benchmark problems causes a 30% drop in o1 accuracy. That doesn't mean "LLMs don't work" but it does mean that you have to be very sceptical of OpenAI benchmark results when comparing them to other AI labs, and this has been the case for a long time.
The FrontierMath case just shows that they are willing to go much farther with their dishonesty than most people thought.
> Tamay from Epoch AI here. We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.
Not sure if "integrity of the benchmarks" should even be something that you negotiate over, what's the value of the benchmark if the results cannot be trusted because of undisclosed relationships and sharing of data? Why would they be restricted from disclosing stuff you normally disclose, and how doesn't that raise all sorts of warning flags when proposed even?
>OpenAI has data access to much but not all of the dataset
Their head mathematician says they have the full dataset, except a holdout set which they're currently developing (i.e. doesn't exist yet):
https://www.reddit.com/r/singularity/comments/1i4n0r5/commen...
Thanks for the link. A holdout set which is yet to be used to verify the 25% claim. He also says that he doesn't believe that OpenAI would self-sabotage themselves by tricking the internal benchmarking performance since this will get easily exposed, either by the results from a holdout set or by the public repeating the benchmarks themselves. Seems reasonable to me.
4 replies →
This feels like a done deal. This benchmark should be discarded.
A lot of the comments express some type of deliberate cheating the benchmark. However, even without intentionally trying to game it, if anybody can repeatedly take the same test, then they'll be nudged to overfit/p-hack.
For instance, suppose they conduct an experiment and find that changing some hyper-parameter yields a 2% boost. That could just be noise, it could be a genuine small improvement, or it may be a mix of a genuine boost along with some fortunate noise. An effect may be small enough that researchers would need to rely on their gut to interpret it. Researchers may jump on noise while believing they have discovered true optimizations. Enough of these types of nudges, and some serious benchmark gains can materialize.
(Hopefully my comment isn't entirely misguided, I don't know how they actually do testing or how often they probe their test set)
I cringe every time I see "my IQ increased by X points after doing Y" posts on Twitter - yes, you had a practice run on Raven's progressive matrices a month ago, that helped, these have a limited question bank and the effect of Y is marginal. That said, obviously, test taking is a skill (separate from background knowledge and both general/domain-specific ability) and should be trained if you expect to have life-altering events based on tests (i.e., do an LSAT course if you want to go to law school). Conversely, shouldn't be done if you think it will limit you through superstition ("I had a score of X, thus I can only perform around level of X+fudge factor"). For an LLM company a good test score is a valuation-altering event!
OpenAI played themselves here. Now nobody is going to take any of their results on this benchmark seriously, ever again. That o3 result has just disappeared in a poof of smoke. If they had blinded themselves properly then that wouldn't be the case.
Whereas other AI companies now have the opportunity to be first to get a significant result on FrontierMath.
I'd be surprised if any of their in-house benchmark results are taken seriously after this. As an extremely rough estimate, FrontierMath cost five to six figures to assemble [1] - so from an outside view, they clearly have no qualms with turning cash into quasi-guaranteed benchmark results.
[1]: https://epoch.ai/math-problems/submit-problem - the benchmark is comprised of "hundreds" of questions, so at the absolute lowest it cost 300 * 200 = 60,000 dollars.
Conversely, if they didn't cheat and they funded creation of the test suite to get "clean" problems (while hiding their participation to prevent getting problems that are somehow tailored to be hard for LLMs specifically), then they have no reasons to fear that all this looks fishy as the test results will soon be vindicated when they'll give wider access to the model.
I refrain from forming a strong opinion in such situations. My intuition tells me that it's not cheating. But, well, it's intuition (probably based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment, so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power. And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach). So, we'll see.
> based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment
Agreed (well as much as intuition goes), but current gen AI is not a brain, much less a human brain. It shows similarities, in particular emerging multi-modal pattern matching capabilities. There is nothing that says that’s all the neocortex does, in fact the opposite is a known truth in neuroscience. We just don’t know all functions yet - we can’t just ignore the massive Chesterton’s fence we don’t understand.
This isn’t even necessarily because the brain is more sophisticated than anything else, we don’t have models for the weather and immune system or anything chaotic really. Look, folding proteins is still a research problem and that’s at the level of known molecular structure. We greatly overestimate our abilities to model & simulate things. Todays AI is a prime example of our wishful thinking and glossing over ”details”.
> so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power.
Sure. That’s a reasonable hypothesis.
> And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach
You seem to be assuming ”ability” is single axis. It’s like assuming if we get 256 bit registers computers will start making coffee, or that going to the gym will eventually give you wings. There is nothing that suggests this. In fact, if you look at emerging ability in pattern matching that improved enormously, while seeing reasoning on novel problems sitting basically still, that suggests strongly that we are looking at a multi-axis problem domain.
1 reply →
This risk could be mitigated by publishing the test.
Do people actually think OpenAI is gaming benchmarks?
I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work.
What works for enterprise is very different from “does it beat this benchmark”.
No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”.
In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works.
I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant.
Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group.
Gaming benchmarks has a lot of utility for openAI whether their product works or not.
Many people compare models based on benchmarks. So if openAI can appear better to Anthropic, Google, or Meta, by gaming benchmarks, it's absolutely in their interest to do so, especially if their product is only slightly behind, because evaluating model quality is very very tricky business these days.
In particular, if there is a new benchmark, it's doubly in their interest to game it, because they know that other providers will start using and optimizing performance towards that benchmark, in order to "beat" OpenAI and win market share.
On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?
This is a company which is shedding their coats of ethics and scientific rigor -- so as to be as unencumbered as possible in its footrace to the dollar.
I used to think this, but using o1 quite a bit lately has convinced me otherwise. It’s been 1-shotting the fairly non-trivial coding problems I throw at it and is good about outputting large, complete code blocks. By contrast, Claude immediately starts nagging you about hitting usage limits after a few back and forth and has some kind of hack in place to start abbreviating code when conversations get too long, even when explicitly instructed to do otherwise. I would imagine that Anthropic can produce a good test time compute model as well, but until they have something publicly available, OpenAI has stolen back the lead.
1 reply →
> On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?
I do use Sonnet 3.5 personally, but this "beat handily" doesn't show on LLM arena. Do OpenAI game that too?
I think “getting beat handily” is a HN bubble concept. Depends on what you’re using it for, but I personally prefer 4o for coding. In enterprise usage, i think 4o is smoking 3.5 sonnet, but that’s just my perception from folks I talk to.
2 replies →
Yes, it looks all but certain that OpenAI gamed this particular benchmark.
Otherwise, they would not have had a contract that prohibited revealing that OpenAI was involved with the project until after the o3 announcements were made and the market had time to react. There is no reason to have such a specific agreement unless you plan to use the backdoor access to beat the benchmark: otherwise, OpenAI would not have known in advance that o3 will perform well! In fact, if there was proper blinding in place (which Epoch heads confirmed was not the case), there would have been no reason for secrecy at all.
Google, xAI and Anthropic's test-time compute experiments were really underwhelming: if OpenAI has secret access to benchmarks, that explains why their performance is so different.
> Do people actually think OpenAI is gaming benchmarks?
I was blown away by chatgpt release and generally have admired OpenAI however I wouldn't put it past them
At this point their entire marketing strategy seems to be to do vague posting on X/Twitter and keep hyping the models so that investors always feel there is something around the corner
And I don't think they need to do that. Most investors will be throwing money at them either way but maybe when you are looking to raise _billions_ that's not enough
> Do people actually think OpenAI is gaming benchmarks?
Yes, they 100% do. So do their main competitors. All of them do.
> Do people actually think OpenAI is gaming benchmarks?
Yes, there's no reason not to do it, only upsides when you try to sell it to enterprises and governments.
Well I certainly won't object if oai marketing was based on testimonials from their fanboy customers instead of rigged benchmark scores %)
Your fragrant disregard for ethics and focus on utilitarian aspects is certainly quite extreme to the extent that only a view people would agree with you in my view.
People on here were mocking me openly when I pointed out that you can't be sure LLMs (or any AIs) are actually smart unless you CAN PROVE that the question you're asking isn't in the training set (or adjacent like in this case).
So with this in mind now, let me repeat: Unless you know that the question AND/OR answer are not in the training set or adjacent, do not claim that the AI or similar black box is smart.
I ran a test yesterday on ChatGPT and co-pilot asking first if it knew of a specific paper which it confirmed and then to derive simple results from which it was completely incapable of. I know this paper is not widely referenced (ie few known results in the public domain) but has been available for over 15 years with publicly accessible code written by humans. The training set was so sparse it had no ability to "understand" or even regurgitate past the summary text which it listed almost verbatim.
It is known that current models have terrible sample efficiency. I've been told that it's better than I thought it was, but it still isn't good.
This all smells like the OpenAI CEO's MO. Stupid drama for stupid reasons.
It doesn't need to be smart to be useful. A lot of the kind of work I do seems to be in the training set.
I don't think the OP is talking about usefulness at all, that is on a completely different dimension I would say.
There's something gross about OpenAI constantly misleading the public.
This maneuver by their CEO will destroy FrontierMath and Epoch AI's reputation
Reminds me of the following proverb:
"The integrity of the upright guides them, but the unfaithful are destroyed by their duplicity."
(Proverbs 11:3)
> Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset.
Man, this is huge.
My takeaways
(1) Companies will probably increasingly invest in building their own evals for their use cases because its becoming clear public/allegedly private benchmarks have misaligned incentives with labs sponsoring/cheating (2) Those evals will prob be proprietary "IP" - guarded as closely as the code or research itself (3) Conversely, public benchmarks are exhausted and SOMEONE has to invest in funding more frontier benchmarks. So this is prob going to continue.
So in conclusion, any evaluation of openai models on frontier math is increadibly invalidated.
I would even go so far as to say this invalidates not only FrontierMath but also anything Epoch AI has and will touch.
Any academic misjudgement like this massive conflict and cheating makes you unthrustworthy in a academic context.
This kind of thing is so avoidable by anyone who has not sold their soul. The answer is: if a company wants you to do a deal but requires as a condition that you not reveal to anyone that you are doing a deal with that company, you just say no. It's that simple.
My guess is that OpenAI didn't cheat as blatantly as just training on the test set. If they had, surely they could have gotten themselves an even higher mark than 25%. But I do buy the comment that they soft-cheated by using elements of the dataset for validation (which is absolutely still a form of data leakage). Even so, I suspect their reported number is roughly legit, because they report numbers on many benchmarks, and they have a good track record of those numbers holding up to private test sets.
What's much more concerning to me than the integrity of the benchmark number is the general pattern of behavior here from OpenAI and Epoch. We shouldn't accept secretly (even secret to the people doing the creation!) funding the creation of a benchmark. I also don't see how we can trust in the integrity of EpochAI going forward. This is basically their only meaningful output, and this is how they handled it?
> If they had, surely they could have gotten themselves an even higher mark than 25%.
there is potentially some limitation of LLMs memorizing such complex proofs
They aren't proofs, they're just numbers. All the questions have numerical answers. That's how they're evaluated.
1 reply →
Elon definitely still has a grudge against Altman and OpenAI, so when Elon uses his new political power to bludgeon OpenAI to bankruptcy with new regulations and lawsuits, it won't be for the right reasons, but I'll still think Altman and the remaining employees deserve it.
Many of these evals are quite easy to game. Often the actual evaluation part of benchmarking is left up to a good-faith actor, which was usually reasonable in academic settings less polluted by capital. AI labs, however, have disincentives to do a thorough or impartial job, so IMO we should never take their word for it. To verify, we need to be able to run these evals ourselves – this is only sometimes possible, as even if the datasets are public, the exact mechanisms of evaluation are not. In the long run, to be completely resilient to gaming via training, we probably need to follow suit of other fields and have third-party non-profit accredited (!!) evaluators who's entire premise is to evaluate, red-team, and generally keep AI safe and competent.
At this point eval results presented by AI companies are a joke and should not be trusted
I have been taking a course in AI policy and the O1 and the FrontierMath dataset has been an important mark for me to emphasize the world we are moving toward. It is incredibly sad to know about the conflict of interest here. However, those more knowledgeable, can you explain in plain words, does this revelation compromise OAI's claims regarding o3's performance on FrontierMath problems?
It's worse than just an undeclared conflict of interest. They gave OpenAI all questions and solutions behind the scenes. It's hard to chalk this up to only naivete. This is a "sorry you caught me" moment.
They have an oral agreement that OpenAI won't use the benchmark in training. Which means first and foremost you have to consider the possibility that they broke that oral agreement and actually included the problems in the training set. Even if they didn't, the fact that they had the problems means they could have selectively chosen the training set data to specialize in solving that class of problem, while still technically keeping the verbal agreement.
So, yeah, the benchmark needs to be treated as essentially worthless at this point.
If OpenAI wanted the questions/solutions, there is going to be a reason for that. This data is not sitting in an unopened folder on Sam's computer.
There are a lot of ways you can use data to improve a model without directly training on it. A train/test validation loop, for example. Or as a wellspring for synthetic data generation. But all of these ways involve some level of data contamination, it's unavoidable.
Related https://news.ycombinator.com/item?id=42761648
Its increasingly odd to see HN activity that assumes the premise: if the latest benchmark results involved a benchmark that can be shown to have any data that OpenAI could have accessed, then, the benchmark results were intentionally faked.
Last time this confused a bunch of people who didn't understand what test vs. train data meant and it resulted in a particular luminary complaining on Twitter, to much guffaws, how troubling the situation was.
Literally every comment currently, modulo [1] assumes this and then goes several steps more, and a majority are wildly misusing terms with precise meanings, explaining at least part of their confusion.
[1] modulo the one saying this is irrelevant because we'll know if it's bad when it comes out, which to be fair, if evaluated rationally, we know that doesn't help us narrowly with our suspicion FrontierMath benchmarks are all invalid because it trained on (most of) the solutions
Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.
And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.
As for "knowing it's bad", most people won't be able to tell a model scoring 25% and 10% apart. People who are using these models to solve math problems are tiny share of users and even tinier share of revenues. What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.
> Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.
This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.
> And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.
This part accidentally stumbles into a valid point about ML methodology while completely missing why it matters. Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The implication that this basic principle somehow proves misconduct is backwards logic.
> People who are using these models to solve math problems are tiny share of users and even tinier share of revenues.
This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. This is basic ML evaluation theory.
> What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.
This concludes with pure speculation presented as fact, combined with a conspiracy theory that lacks any actual evidence. It also displays a shallow understanding of how technical due diligence works in major AI investments - investors at this level typically have deep technical expertise, access to extensive testing and validation, and most damningly, given the reductive appeal to incentive structure:
They closed the big round weeks before.
The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.
8 replies →
Tim Gowers, one of the Fields medallists contributed problems to the benchmark dataset isn't happy about being misled about OpenAIs involvement. He retweeted this: https://x.com/Mihonarium/status/1880944026603376865?t=QN3i_X...
OpenAI continues to muddy the benchmarks, while Claude continues to improve their intelligence. Claude will win long term. It'd be wise to not rely on OpenAI at all. They are the first comers who will just burn cash and crash out I suspect.
The problem is, any benchmark on a closed model couldn’t be private even in theory, as the model has to be called to run the benchmark, exposing the contents to whoever owns the model thereafter.
HN loves to speculate that OpenAI is some big scam whose seeming ascendance is based on deceptive marketing hype, but o1, to anyone who has tried it seriously is undoubtedly very much within the ballpark of what OpenAI claims it is able to do. If everything they are doing really is just overfitting and gaming the tests, that discrepancy will eventually catch up to them, and people will stop using the APIs and chatgpt
They should at least clarify it. The reason they don’t I feel is simply for the hype and mystique.
There are ways that you could game the benchmark without adding it to the training set. By repetitively evaluating on the dataset itself it will regress into a validation set, not a test set, even in black box setting, as you can simply evaluating 100 checkpoints and pick the one that performs the best, rinse and repeat
I still believe o3 is the real deal, BUT this gimmick kind sour my appetite a bit, for that those who run the company
So basically when you need to look good in benchmarks you fund an organization that does benchmarks in which you look good.
Just like toothpaste manufacturers fund dentist's associations etc.
Unrelated to anything but what software is this blog running on? I love the sidenote feature.
Why does it have a customer service popover chat assistant?
The Lightcone Infrastructure forum stack. I don't know why it has an assistant.
Even if OpenAI does not use these materials to directly train its models, OpenAI can collect or construct more data based on the knowledge points and test points of these questions to gain an unfair competitive advantage. It's like before the Gaokao, a teacher reads some of the Gaokao questions and then marks the test points in the book for you. This is cheating.
I wonder if more companies should open source their eval model outputs alongside the eval results
We tried doing that here at Skyvern (eval.skyvern.com)
This isn't news, the other popular benchmarks are just as gamed and worthless, it would be shocking if this one wasn't. The other frontier model providers game them just as hard, it's not an OpenAI thing. Any benchmark that a provider themselves mentions is not worth the pixels its written on.
Unless you have been up to the shoulders in the hype-hole of Scam Altman's backside this should not come as the slightest surprise.
“… we have a verbal agreement that these materials will not be used in model training”
What about model testing before releasing it?
so it was overfit
if they used it in training it should be 100% hit. most likely they used it to verify and tune parameters.
> if they used it in training it should be 100% hit.
Not necessarily, no.
A statistical model will attempt to minimise overall loss, generally speaking.
If it gets 100% accuracy on the training data it's usually an overfit. (Hugging the data points too tightly, thereby failing to predict real life cases)
you are mostly right. but seeing almost perfectly reconstructed images from training set it's obvious model -can- memorize samples. in this case it would reproduce the answers too close to the original to be just 'accidental'. should be easy to test.
My guess samples could be used to find good enough stopping point for o1, o3 models. which is hardcoded.
2 replies →
Had they let it hit 100% it would have been obvious they had the data.
They've sure been careful to avoid that, by only using a portion of it or some other technique
This don’t really matter much because if the models suck when it comes out evals mean nothing next time
“we now know how to build AGI” --Sam Altman.
which should really be “we now know how to improve associative reasoning but we still need to cheat when it comes to math because the bottom line is that the models can only capture logic associatively, not synthesize deductively, which is what’s needed for math beyond recipe-based reasoning"
[dead]