Comment by slacktivism123

2 months ago

Fascinating case showing how LLM promoters will happily take "verified" benchmarks at their word.

It's easy to publish "$NEWMODEL received an X% bump in SWE-Bench Verified!!!!".

Proper research means interrogating the traces, like these researchers did (the Gist shows Claude 4 Sonnet): https://gist.github.com/jacobkahn/bd77c69d34040a9e9b10d56baa...

6 comments

slacktivism123

The best benchmark is the community vibe in the weeks following a release.

Claude benchmarks poorly but vibes well. Gemini benchmarks well and vibes well. Grok benchmarks well but vibes poorly.

(yes I know you are gushing with anecdotes, the vibes are simply the approximate color of gray born from the countless black and white remarks.)

diggan 2 months ago
> The best benchmark is the community vibe in the weeks following a release.
True, just be careful what community you use as a vibe-check. Most of the mainstream/big ones around AI and LLMs basically have influence campaigns run against them, are made of giant hive-minds that all think alike and you need to carefully asses if anything you're reading is true or not, and votes tend to make it even worse.
- theblazehen 2 months ago
  
  I generally check LM Arena as well as which models have had the most weekly tokens on openrouter
wubrr 2 months ago
the vibes are just a collection anecdotes
- ryoshu 2 months ago
  
  "qual"

Yes, often you see huge gains in some benchmark, then the model is ran through Aider's polyglot benchmark and doesn't even hit 60%.