Comment by diggan
1 day ago
Lets give it a try, if you're willing to be the experiment subject :)
The prompt is "Generate an SVG of a pelican riding a bicycle" and you're supposed to write it by hand, so no graphical editor. The specification is here: https://www.w3.org/TR/SVG2/
I'm fairly certain I'd lose interest in getting it right before I got something better than most of those.
> The colors use traditional bicycle brown (#8B4513) and a classic blue for the pelican (#4169E1) with gold accents for the beak (#FFD700).
The output pelican is indeed blue. I can't fathom where the idea that this is "classic", or suitable for a pelican, could have come from.
My guess would be that it doesn't see the web colors (CSS color hexes) as proper hex triplets, but because of tokenization it could be something dumb like '#8B','451','3' instead. I think the same issue happens around multiple special characters after each other too.
No, it's understanding the colors properly. The SVG that the LLM created does use #4169E1 for the pelican color, and the LLM correctly describes this color as blue. The problem is that pelicans should not be blue.
Qwen3, at least, tokenizes each character of "#8B4513" separately.
Did the testing prompt for LLMs include a clause forbidding the use of any tools? If not, why are you adding it here?
The way I run the pelican on a bicycle benchmark is to use this exact prompt:
And execute it via the model's API with all default settings, not via their user-facing interface.
Currently none of the model APIs enable tools unless you ask them to, so this method excludes the use of additional tools.
The models that are being put under the "Pelican" testing don't use a GUI to create SVGs (either via "tools" or anything else), they're all Text Generation models so they exclusively use text for creating the graphics.
There are 31 posts listed under "pelican-riding-a-bicycle" in case you wanna inspect the methodology even closer: https://simonwillison.net/tags/pelican-riding-a-bicycle/