Comment by simonw

2 months ago

  llm install llm-mistral
  llm mistral refresh
  llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"

https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...

Pretty good for a 123B model!

(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)

99 comments

simonw

Jimmc414 2 months ago

We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.

simonw 2 months ago
I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
- armcat 2 months ago
  
  Hi Simon! Love your work! Our of curiosity - how many pelican-cycling samples do you produce. Curious about the variance here. Thanks!
  
  4 replies →
- jgalt212 2 months ago
  
  Aiden is perhaps misinformed. From a Bing search performed just now.
  > Yes, I am familiar with the "pelican riding a bicycle" SVG generation test. It is a benchmark for evaluating the ability of AI models, particularly large language models (LLMs) and multi-modal systems, to generate original, high-quality SVG vector graphics based on a deliberately unusual and complex prompt. The benchmark was popularized by Simon Willison, who selected the prompt because:
  
  3 replies →
- th0ma5 2 months ago
  
  [flagged]
  
  27 replies →
thatwasunusual 2 months ago
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...
- simonw 2 months ago
  
  The fact that pelicans can't ride bicycles is pretty much the point of the benchmark! Asking an LLM to draw something that's physically impossible means it can't just "get it right" - seeing how different models (especially at different sizes) handle the problem is surprisingly interesting.
  Honestly though, the benchmark was originally meant to be a stupid joke.
  I only started taking it slightly more seriously about six months ago, when I noticed that the quality of the pelican drawings really did correspond quite closely to how generally good the underlying models were.
  If a model draws a really good picture of a pelican riding a bicycle there's a solid chance it will be great at all sorts of other things. I wish I could explain why that was!
  If you start here and scroll through and look at the progression of pelican on bicycle images it's honestly spooky how well they match the vibes of the models they represent: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...
  So ever since then I've continue to get models to draw pelicans. I certainly wouldn't suggest anyone take serious decisions on model usage based on my stupid benchmark, but it's a fun first-day initial impression thing and it appears to be a useful signal for which models are worth diving into in more detail.
  
  7 replies →
- theshrike79 2 months ago
  
  It's just a variant of the wine glass - something that doesn't exist in the source material as-is. I have a few of my own I don't share publicly.
  Basically in my niche I _know_ there are no original pictures of specific situations and my prompts test whether the LLM is "creative" enough to combine multiple sources into one that matches my prompt.
  I think of if like this: there are three things I want in the picture (more actually, but for the example assume 3). All three are really far from each other in relevance, in the very corner of an equilateral triangle (in the vector space of the LLM's "brain"). What I'm asking it to do is in the middle of all three things.
  Every model so far tends to veer towards one or two of the points more than others because it can't figure out how to combine them all into one properly.
- wisty 2 months ago
  
  It's not nessessarily the best benchmark, it's a popular one, probably because it's funny.
  Yes it's like the wine glass thing.
  Also it's kind of got depth. Does it draw the pelican and the bicycle? Can the penguin reach the peddles? How?
  I can imagine a really good AI finding a funny or creative or realistic way for the penguin to reach the peddles.
  An slightly worse AI will do an OK job, maybe just making the bike small or the legs too long.
  An OK AI will draw a penguin on top of a bicycle and just call it a day.
  It's not as binary as the wine glass example.
  
  4 replies →
Workaccount2 2 months ago
It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.
So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.
So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?
- majormajor 2 months ago
  
  That depends on if "SVG generation" is a particularly useful LLM/coding model skill outside of benchmarking. I.e., if they make that stronger with some params that otherwise may have been used for "rust type system awareness" or somesuch, it might be a net loss outside of the benchmarks.
th0ma5 2 months ago
If this had any substance then it could be criticized, which is what they're trying to avoid.
- Etheryte 2 months ago
  
  How? There's no way for you to verify if they put synthetic data for that into the dataset or not.
0cf8612b2e1e 2 months ago
I assume all of the models also have variations on, “how many ‘r’s in strawberry”.
- theshrike79 2 months ago
  
  The easiest way to fix these is give the model an environment to run code.
  Any model can easily one-shot a python script that can count the occurrence of any letter anywhere and return the result.
  It's just a tooling issue. You really can't "train" an LLM to do it because tokenisation and ... stuff.
  
  2 replies →

baq 2 months ago

but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html

aschobel 2 months ago

in case folks are missing the context
https://news.ycombinator.com/item?id=46183294
lagniappe 2 months ago
That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.
- tarsinge 2 months ago
  
  In what year was it meaningful to have pelicans riding bicycles?
  
  5 replies →
- utopiah 2 months ago
  
  > neither do our web standards
  I'd be curious about that actually, feel like W3C specifications (I don't mean browser support of them) rarely deprecate and precisely try to keep the Web running.
- baq 2 months ago
  
  Yes, now please prepare an email template which renders fine in outlook using modern web standards. Write it up if you succeed, front page of HN guaranteed!
- tomashubelbauer 2 months ago
  
  The parent comment is a reference to a different story that was on the HN home page yesterday where someone attempted that with Claude.
  
  2 replies →

cpursley 2 months ago

Skipped the bicycle entirely and upgraded to a sweet motorcycle :)

aorth 2 months ago
Looks like a Cybertruck actually!
- BudaDude 2 months ago
  
  I was thinking a Warthog
  https://www.halopedia.org/Warthog
lubujackson 2 months ago
The Batman motorcycle!
- troyvit 2 months ago
  
  I'm Pelicanman </raspy voice>
  
  1 reply →

willahmad 2 months ago

I think this benchmark could be slightly misleading to assess coding model. But still very good result.

Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.

jstummbillig 2 months ago
I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.
- andrepd 2 months ago
  
  It's not even halfway up the list of inane things of the AI hype cycle.
hdjrudni 2 months ago

But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.

iberator 2 months ago

Where did you get llm tool from?!

fauigerzigerk 2 months ago
He made it: https://github.com/simonw/llm
- techsystems 2 months ago
  
  Cool! I can't find it on the read me, but can it run Qwen locally?
  
  1 reply →

lacoolj 2 months ago

How did you run a 123B model locally? Or did you do this on a GPU host somewhere? If so, what spec was it?

simonw 2 months ago

I haven't run the 123B one locally yet. I used Mistral's own API models for this.

felixg3 2 months ago

Is it really an svg if it’s just embedded base64 of a jpg

joombaga 2 months ago

You were seeing the base64 image tag output at the bottom. The SVG input is at the top.

samgutentag 2 months ago

"Generate an SVG of a pelican riding a bicycle" is the new "but can it run Crysis"

breedmesmn 2 months ago

[flagged]