Comment by fumblebee

3 days ago

If indeed, as the new benchmarks suggest, this is the new "top dog" of models, why is the launch feeling a little flat?

For comparison, the Claude 4 hacker news post received > 2k upvotes https://news.ycombinator.com/item?id=44063703

24 comments

fumblebee

Upvotes are a lagging indicator. Despite all the leaderboard scores presented, etc, no one actually knows how good a model is until they go use it for a while. When Claude 4 got ~2k upvotes, it was because everyone realized that Claude 3.7 was such a good model in practice - it had little to do with the actual performance of 4.

v5v3 3 days ago

Other AI companies post a 5 minute article to read.

This is a 50 minute long video, many won't bother to watch

aprilthird2021 3 days ago

Because the benchmarks are likely gamed. Also Grok had an extremely negative news cycle right before this, so the average bloke is skeptical that the smartest AI in the world thinks the last name Steinberg means someone is a shadowy, evil, cabal-type figure. Even though they aren't totally related, most people aren't deep enough in the weeds to know this

typon 3 days ago

Its a shame this model is performing so well because I can't in good conscience pay money to Elon Musk. Will just have to wait for the other labs to do their thing.

brightfuturex 3 days ago
I think it's a shame that your emotions are so much in your way. It's an illusion to think you can assess Elon at his true worth, like AI hallucinating due to lack of context.
- fdsjgfklsfd 3 days ago
  
  You misspelled "principles".
- DonHopkins 3 days ago
  
  Psychopath.

ceejayoz 3 days ago

I'm not sure there's any benchmark score that'd make me use a model that suddenly starts talking about racist conspiracy theories unprompted. Doubly so for anything intended for production use.

Ocha 3 days ago

Nobody believes Elon anymore.

fumblebee 3 days ago
Hm, impartial benchmarks are independent of Elon's claims?
- ben_w 3 days ago
  
  Impartial benchmarks are great, unless (1) you have so many to choose from that you can game them (which is still true even if the benchmark makers themselves are absolutely beyond reproach), or (2) there's a difference between what you're testing and what you care about.
  Goodhart's Law means 2 is approximately always true.
  As it happens, we also have a lot of AI benchmarks to choose from.
  Unfortunately this means every model basically has a vibe score right now, as the real independent tests are rapidly saturated into the "ooh shiny" region of the graph. Even the people working on e.g. the ARC-AGI benchmark don't think their own test is the last word.
  
  1 reply →
- irthomasthomas 3 days ago
  
  Likely they trained on test. Grok 3 had similarly remarkable benchmark scores but fell flat in real use.
- bigyabai 3 days ago
  
  "impartial" how? Do you have the training data, are you auditing to make sure they're not few-shotting the benchmarks?
- DonHopkins 3 days ago
  
  The latest independent benchmark results consistently output "HEIL HITLER!"
brightfuturex 3 days ago

[dead]

Kapura 3 days ago

[flagged]

bilsbie 3 days ago
You can use a “formula” and make excel write offensive stuff too.
- Kapura 3 days ago
  
  nobody would be claiming an excel spreadsheet is anything close to intelligent tho.

bilsbie 3 days ago

[flagged]

mrtesthah 3 days ago
Maligning any alternative viewpoints to yours as just some indoctrinated people following “marching orders”, rather than addressing the substance of their critique, constitutes a “poisoning the well” fallacy.
- mgoetzke 3 days ago
  
  Substance being ?

mppm 3 days ago

[flagged]

Aerbil313 3 days ago

Probably more like Claude was slightly better than GPT-xx when the IDE integrations first got widely adopted (and this was also the time where there was another scandal about Altman/OpenAI on the front page of HN every other week) so most programmers preferred Claude, then it got into a virtuous cycle where Claude got the most coding-related user queries and became the better coding model among SOTA models, which resulted in the current situation today.