Comment by mlsu
6 months ago
Gaming benchmarks has a lot of utility for openAI whether their product works or not.
Many people compare models based on benchmarks. So if openAI can appear better to Anthropic, Google, or Meta, by gaming benchmarks, it's absolutely in their interest to do so, especially if their product is only slightly behind, because evaluating model quality is very very tricky business these days.
In particular, if there is a new benchmark, it's doubly in their interest to game it, because they know that other providers will start using and optimizing performance towards that benchmark, in order to "beat" OpenAI and win market share.
On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?
This is a company which is shedding their coats of ethics and scientific rigor -- so as to be as unencumbered as possible in its footrace to the dollar.
I used to think this, but using o1 quite a bit lately has convinced me otherwise. It’s been 1-shotting the fairly non-trivial coding problems I throw at it and is good about outputting large, complete code blocks. By contrast, Claude immediately starts nagging you about hitting usage limits after a few back and forth and has some kind of hack in place to start abbreviating code when conversations get too long, even when explicitly instructed to do otherwise. I would imagine that Anthropic can produce a good test time compute model as well, but until they have something publicly available, OpenAI has stolen back the lead.
"Their model" here is referring to 4o as o1 is unviable for many production usecases due to latency.
> On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?
I do use Sonnet 3.5 personally, but this "beat handily" doesn't show on LLM arena. Do OpenAI game that too?
I think “getting beat handily” is a HN bubble concept. Depends on what you’re using it for, but I personally prefer 4o for coding. In enterprise usage, i think 4o is smoking 3.5 sonnet, but that’s just my perception from folks I talk to.
I don't think that's true, you'll get the same sentiment ("Sonnet 3.5 is much better than GPT4/GPT4o [for coding]") pretty uniformly across Reddit/HN/Lobsters. I would strongly agree with it in my own testing, although o1 might be much better (I'm too poor to give it a fair shake.)
> In enterprise usage, i think 4o is smoking 3.5 sonnet
True. I'm not sure how many enterprise solutions have given their users an opportunity to test Claude vs. GPT. Most people just use whatever LLM API their software integrates.
This just isn't accurate, on the overwhelming majority of real-world tasks (>90%) 3.5 Sonnet beats 4o. FWIW I've spoken with a friend who's at OpenAI and they fully agree in private.