Comment by fumblebee
3 days ago
If indeed, as the new benchmarks suggest, this is the new "top dog" of models, why is the launch feeling a little flat?
For comparison, the Claude 4 hacker news post received > 2k upvotes https://news.ycombinator.com/item?id=44063703
Upvotes are a lagging indicator. Despite all the leaderboard scores presented, etc, no one actually knows how good a model is until they go use it for a while. When Claude 4 got ~2k upvotes, it was because everyone realized that Claude 3.7 was such a good model in practice - it had little to do with the actual performance of 4.
Other AI companies post a 5 minute article to read.
This is a 50 minute long video, many won't bother to watch
Because the benchmarks are likely gamed. Also Grok had an extremely negative news cycle right before this, so the average bloke is skeptical that the smartest AI in the world thinks the last name Steinberg means someone is a shadowy, evil, cabal-type figure. Even though they aren't totally related, most people aren't deep enough in the weeds to know this
Its a shame this model is performing so well because I can't in good conscience pay money to Elon Musk. Will just have to wait for the other labs to do their thing.
I think it's a shame that your emotions are so much in your way. It's an illusion to think you can assess Elon at his true worth, like AI hallucinating due to lack of context.
You misspelled "principles".
Psychopath.
I'm not sure there's any benchmark score that'd make me use a model that suddenly starts talking about racist conspiracy theories unprompted. Doubly so for anything intended for production use.
Nobody believes Elon anymore.
Hm, impartial benchmarks are independent of Elon's claims?
Impartial benchmarks are great, unless (1) you have so many to choose from that you can game them (which is still true even if the benchmark makers themselves are absolutely beyond reproach), or (2) there's a difference between what you're testing and what you care about.
Goodhart's Law means 2 is approximately always true.
As it happens, we also have a lot of AI benchmarks to choose from.
Unfortunately this means every model basically has a vibe score right now, as the real independent tests are rapidly saturated into the "ooh shiny" region of the graph. Even the people working on e.g. the ARC-AGI benchmark don't think their own test is the last word.
1 reply →
Likely they trained on test. Grok 3 had similarly remarkable benchmark scores but fell flat in real use.
"impartial" how? Do you have the training data, are you auditing to make sure they're not few-shotting the benchmarks?
The latest independent benchmark results consistently output "HEIL HITLER!"
[dead]
[flagged]
You can use a “formula” and make excel write offensive stuff too.
nobody would be claiming an excel spreadsheet is anything close to intelligent tho.
[flagged]
Maligning any alternative viewpoints to yours as just some indoctrinated people following “marching orders”, rather than addressing the substance of their critique, constitutes a “poisoning the well” fallacy.
Substance being ?
[flagged]
Probably more like Claude was slightly better than GPT-xx when the IDE integrations first got widely adopted (and this was also the time where there was another scandal about Altman/OpenAI on the front page of HN every other week) so most programmers preferred Claude, then it got into a virtuous cycle where Claude got the most coding-related user queries and became the better coding model among SOTA models, which resulted in the current situation today.