Comment by saithound
6 months ago
> Because the models have continually matched the quality they claim.
That's very far from true.
"Yes, I know that the HuggingFace arena and coding assistant leaderboards both say that OpenAI's new model is really good, but in practice you should use Claude Sonnet instead" was a meme for good reason, as was "I know the benchmarks show that 4o is just as capable as ChatGPT4 but based on our internal evals it seems much worse". The latter to the extent that they had to use dark UI patterns to hide ChatGPT-4 from their users, because they kept using it, and it cost OpenAI much more than 4o.
OpenAI regularly messes with benchmarks to keep the investor money flowing. Slightly varying the wording of benchmark problems causes a 30% drop in o1 accuracy. That doesn't mean "LLMs don't work" but it does mean that you have to be very sceptical of OpenAI benchmark results when comparing them to other AI labs, and this has been the case for a long time.
The FrontierMath case just shows that they are willing to go much farther with their dishonesty than most people thought.
No comments yet
Contribute on Hacker News ↗