Comment by fzzzy

1 day ago

Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.

9 comments

fzzzy

simonw 1 day ago

It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!

pama 1 day ago

How did the pelicans of point releases of V3 and of R1 (R1-0528) do compared to the original versions of the models?
og_kalu 1 day ago

LLMs also have a 'g factor' https://www.sciencedirect.com/science/article/pii/S016028962...
johnrob 1 day ago

Well, the most likely single random sample would be a “representative” one :)
tuananh 1 day ago
until they start targeting this benchmark
- simonw 1 day ago
  
  Right, that was the closing joke for the talk.
  
  1 reply →
MichaelZuo 1 day ago

I imagine the straightforward reason is that the “better” models are in fact significantly smarter in some tangible way, somehow.
more-nitor 1 day ago

I just don't get the fuss from the pro-LLM people who don't want anyone to shame their LLMs...
people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.
Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...
...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?
obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...