Comment by Jimmc414
15 hours ago
We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
[flagged]
Whatever you think Jimmc414's _concerns_ are (they merely state a possibility) Simon enumerates a number of concerns in the linked article, and then addresses those. So I'm not sure why you think this is so.
1 reply →
Condescending and disrespectful to whom? Everybody wholsale? This doesnt seem reasonable? Please elaborate.
15 replies →
It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.
So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.
So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?
That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.
I assume all of the models also have variations on, “how many ‘r’s in strawberry”.
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...
The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.
Honestly though, the benchmark was originally meant to be a stupid joke.
I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.
If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!
If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.
> If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things.
Why?
If I hired a worker that was really good at drawing pelicans riding a bike, it wouldn't tell me anything about his/her other qualities?!
5 replies →
It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.
Yes it's like the wine glass thing.
Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?
I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.
An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.
An OK AI will draw a penguin on top of a bicycle and just call it a day.
It's not as binary as the wine glass example.
> It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.
> Yes it's like the wine glass thing.
No, it's not!
That's part of my point; the wine glass scenario is a _realistic_ scenario. The pelican riding a bike is not. It's a _huge_ difference. Why should we measure intelligence (...) in regards to something that is realistic and something that is unrealistic?
I just don't get it.
2 replies →
If this had any substance then it could be criticized, which is what they're trying to avoid.
How? There's no way for you to verify if they put synthetic data for that into the dataset or not.