Comment by nathan_phoenix

1 day ago

My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

55 comments

nathan_phoenix

simonw 1 day ago

It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.

I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.

(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)

I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.

demosthanos 1 day ago
I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.
Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...
- diggan 1 day ago
  
  Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.
  
  4 replies →
- throwaway31131 21 hours ago
  
  I’d say it doesn’t really matter. There is no universally good benchmark and really they should only be used to answer very specific questions which may or may not be relevant to you.
  Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.
- 6LLvveMx2koXfwn 1 day ago
  
  I would definitely say he had no intention of doing that and was doubling down on the original joke.
  
  2 replies →
- telotortium 13 hours ago
  
  Yeah, Simon needs to release a new benchmark under a pen name, like Stephen King did with Richard Bachman.
fzzzy 1 day ago
Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.
- simonw 1 day ago
  
  It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!
  
  8 replies →
dilap 21 hours ago

Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!
ontouchstart 1 day ago
Very nice talk, acceptable by general public and by AI agent as well.
Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?
Your talk might influence the funding of AI startups.
#butterflyEffect
- threecheese 1 day ago
  
  I welcome a VC funded pelican … anything! Clippy 2.0 maybe?
  Simon, hope you are comfortable in your new role of AI Celebrity.

planb 1 day ago

And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.

criddell 1 day ago

And that’s why he says he’s going to have to find a new benchmark.
viraptor 1 day ago

Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.
I actually don't think I've seen a single correct svg drawing for that prompt.
cyanydeez 1 day ago
So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.
Call it wikipediaslop.org
- YuccaGloriosa 20 hours ago
  
  If the any other noun becomes fish... I think I disagree.

puttycat 1 day ago

You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.

In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.

ben_w 1 day ago
> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/
- rightbyte 1 day ago
  
  That blog post is a 10/10. Oh dear I miss the old internet.
- jodrellblank 1 day ago
  
  You claim those are drawn by people with "perfect knowledge about bikes" and "perfect drawing skills"?
  
  10 replies →
bufferoverflow 18 hours ago

> work discretely like humans
What kind of humans are you surrounded by?
Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.
cyanydeez 1 day ago
Humans absolutely do not work discretely.
- loloquwowndueo 1 day ago
  
  They probably meant deterministically as opposed to probabilistically. Which also humans dont work like that :)
  
  1 reply →

mooreds 1 day ago

My biggest gripe is that he outsourced evaluation of the pelicans to another LLM.

I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.

Other ways:

* wisdom of the crowds (have people vote on it)

* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)

* wisdom of the LLMs (use more than one LLM)

Would have been neat to see what the human consensus was and if it differed from the LLM consensus

Anyway, great talk!

zahlman 19 hours ago

It would have been interesting to see if the LLM that Claude judged worst would have attempted to justify itself....

timewizard 19 hours ago

My biggest gripe is he didn't include a picture of an actual pelican.

https://www.google.com/search?q=pelican&udm=2

The "closest pelican" is not even close.

qeternity 1 day ago

I think you mean non-deterministic, instead of probabilistic.

And there is no reason that these models need to be non-deterministic.

skybrian 1 day ago

A deterministic algorithm can still be unpredictable in a sense. In the extreme case, a procedural generator (like in Minecraft) is deterministic given a seed, but you will still have trouble predicting what you get if you change the seed, because internally it uses a (pseudo-)random number generator.
So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.
rvz 1 day ago

> I think you mean non-deterministic, instead of probabilistic.
My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".

cyanydeez 1 day ago

[flagged]