Comment by Ocha

3 days ago

Nobody believes Elon anymore.

7 comments

Ocha

Hm, impartial benchmarks are independent of Elon's claims?

ben_w 3 days ago
Impartial benchmarks are great, unless (1) you have so many to choose from that you can game them (which is still true even if the benchmark makers themselves are absolutely beyond reproach), or (2) there's a difference between what you're testing and what you care about.
Goodhart's Law means 2 is approximately always true.
As it happens, we also have a lot of AI benchmarks to choose from.
Unfortunately this means every model basically has a vibe score right now, as the real independent tests are rapidly saturated into the "ooh shiny" region of the graph. Even the people working on e.g. the ARC-AGI benchmark don't think their own test is the last word.
- irthomasthomas 3 days ago
  
  It's also possible they trained on test.
irthomasthomas 3 days ago

Likely they trained on test. Grok 3 had similarly remarkable benchmark scores but fell flat in real use.
bigyabai 3 days ago

"impartial" how? Do you have the training data, are you auditing to make sure they're not few-shotting the benchmarks?
DonHopkins 3 days ago

The latest independent benchmark results consistently output "HEIL HITLER!"

[dead]