Comment by ripped_britches

6 months ago

Do people actually think OpenAI is gaming benchmarks?

I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work.

What works for enterprise is very different from “does it beat this benchmark”.

No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”.

In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works.

I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant.

Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group.

14 comments

ripped_britches

mlsu 6 months ago

Gaming benchmarks has a lot of utility for openAI whether their product works or not.

Many people compare models based on benchmarks. So if openAI can appear better to Anthropic, Google, or Meta, by gaming benchmarks, it's absolutely in their interest to do so, especially if their product is only slightly behind, because evaluating model quality is very very tricky business these days.

In particular, if there is a new benchmark, it's doubly in their interest to game it, because they know that other providers will start using and optimizing performance towards that benchmark, in order to "beat" OpenAI and win market share.

On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?

This is a company which is shedding their coats of ethics and scientific rigor -- so as to be as unencumbered as possible in its footrace to the dollar.

bugglebeetle 6 months ago
I used to think this, but using o1 quite a bit lately has convinced me otherwise. It’s been 1-shotting the fairly non-trivial coding problems I throw at it and is good about outputting large, complete code blocks. By contrast, Claude immediately starts nagging you about hitting usage limits after a few back and forth and has some kind of hack in place to start abbreviating code when conversations get too long, even when explicitly instructed to do otherwise. I would imagine that Anthropic can produce a good test time compute model as well, but until they have something publicly available, OpenAI has stolen back the lead.
- maeil 6 months ago
  
  "Their model" here is referring to 4o as o1 is unviable for many production usecases due to latency.
raincole 6 months ago

> On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?
I do use Sonnet 3.5 personally, but this "beat handily" doesn't show on LLM arena. Do OpenAI game that too?
ripped_britches 6 months ago
I think “getting beat handily” is a HN bubble concept. Depends on what you’re using it for, but I personally prefer 4o for coding. In enterprise usage, i think 4o is smoking 3.5 sonnet, but that’s just my perception from folks I talk to.
- hatefulmoron 6 months ago
  
  I don't think that's true, you'll get the same sentiment ("Sonnet 3.5 is much better than GPT4/GPT4o [for coding]") pretty uniformly across Reddit/HN/Lobsters. I would strongly agree with it in my own testing, although o1 might be much better (I'm too poor to give it a fair shake.)
  > In enterprise usage, i think 4o is smoking 3.5 sonnet
  True. I'm not sure how many enterprise solutions have given their users an opportunity to test Claude vs. GPT. Most people just use whatever LLM API their software integrates.
- maeil 6 months ago
  
  This just isn't accurate, on the overwhelming majority of real-world tasks (>90%) 3.5 Sonnet beats 4o. FWIW I've spoken with a friend who's at OpenAI and they fully agree in private.

saithound 6 months ago

Yes, it looks all but certain that OpenAI gamed this particular benchmark.

Otherwise, they would not have had a contract that prohibited revealing that OpenAI was involved with the project until after the o3 announcements were made and the market had time to react. There is no reason to have such a specific agreement unless you plan to use the backdoor access to beat the benchmark: otherwise, OpenAI would not have known in advance that o3 will perform well! In fact, if there was proper blinding in place (which Epoch heads confirmed was not the case), there would have been no reason for secrecy at all.

Google, xAI and Anthropic's test-time compute experiments were really underwhelming: if OpenAI has secret access to benchmarks, that explains why their performance is so different.

jatins 6 months ago

> Do people actually think OpenAI is gaming benchmarks?

I was blown away by chatgpt release and generally have admired OpenAI however I wouldn't put it past them

At this point their entire marketing strategy seems to be to do vague posting on X/Twitter and keep hyping the models so that investors always feel there is something around the corner

And I don't think they need to do that. Most investors will be throwing money at them either way but maybe when you are looking to raise _billions_ that's not enough

maeil 6 months ago

> Do people actually think OpenAI is gaming benchmarks?

Yes, they 100% do. So do their main competitors. All of them do.

cbg0 6 months ago

> Do people actually think OpenAI is gaming benchmarks?

Yes, there's no reason not to do it, only upsides when you try to sell it to enterprises and governments.

331c8c71 6 months ago

Well I certainly won't object if oai marketing was based on testimonials from their fanboy customers instead of rigged benchmark scores %)

Your fragrant disregard for ethics and focus on utilitarian aspects is certainly quite extreme to the extent that only a view people would agree with you in my view.