Comment by simonw
2 days ago
Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.
2 days ago
Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.
It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.
Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
Just asking "Explain what this service does?" turns into
[No response for three minutes...]
+729 -522
it's also so aggressive about taking out debug log statements and in-progress code. I'll ask it to fill in a new function somewhere else and it will remove all of the half written code from the piece I'm currently working on.
4 replies →
What. You don't have yours ask for edit approval?
12 replies →
if you had to ask it obviously needs to refactor code for clarity so next person does not need to ask
"I don't know what did it, but here's what it does now"
I've seen Kimi do this a ton as well, so insufferable.
[dead]
Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.
Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.
being TOO steerable is another issue though.
Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.
Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.
Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global
2 replies →
For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".
I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.
I find Copilot leans the other way. It'll myopically focus its work in the exact function I point it at, even when it's clear that adding a new helper would be a logical abstraction to share behaviour with the function right beside it.
Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"
Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way
What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.
Out of curiosity, do you have any public projects (with public source code) you've made exclusively with Gemini, so one could take a look? I've tried a bunch of times to use Gemini to at least finish something small but I always end up sufficiently frustrated to abort it as the instruction-following seems so bad.
1 reply →
Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.
Which is how it works with people as well
I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".
That helped quite a bit but it would still go off on it's own from time to time.
This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
Not like human programmers. I would never do this and have never struggled with it in the past, no...
Fairer comparison would be against other models, which are typically better at instruction following. You say "don't change anything not explicitly mentioned" or "Don't add any new code comments" and they tend to follow that.
Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.
Do you have Personalization Instructions set up for your LLM models?
You can make their responses fairly dry/brief.
I'm mostly using them via my own harnesses, so I have full control of the system prompts and so on. And no matter what I try, Gemini keeps "helpfully" adding code comments every now and then. With every other model, "- Don't add code comments" tends to be enough, but with Gemini I'm not sure how I could stop the comments from eventually appearing.
5 replies →
I'd love to hear some examples!
3 replies →
true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.
What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.
The most absurd benchmaxxing.
https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...
I like how they also did a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.
Ok Google what are some other examples like a pelican riding a bicycle
reminds me of andor, luthen, positive reinforcing wasting time of emperor
Animated SVG is huge. People in different professions are worrying to different degrees in terms of being replaced by ML, but this one is huge with regards to digital art.
yeah, complex SVG's are so much more bandwidth, computation and energy efficient than raster images - up to a point! but in general use we are not at that point and there's so much more we can do with it
I've been meaning to let coding agents take a stab at using the lottie library https://github.com/airbnb/lottie-web to supercharge the user experience without needing to make it a full time job
Can't wait until they finally get to real world CAD
There's a CAD example in that same thread: https://x.com/JeffDean/status/2024528776856817813
I know this isn’t necessarily “real world CAD” but Claude Code is not too shabby at OpenSCAD.
He's svg-mogging
So let's put things we're interested in in the benchmarks.
I'm not against pelicans!
I think the reason the pelican example is great is because it's bizarre enough that it's unlikely that to appear in the training as one unified picture.
If we picked something more common, like say, a hot dog with toppings, then the training contamination is much harder to control.
2 replies →
You don't have to benchmax everything, just the benchmarks in the right social circles
It if funny to think that Jeff Dean personally worked to optimize the pelican riding a bike benchmark.
Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.
A few thoughts:
- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).
- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.
We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.
1 reply →
My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.
All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.
Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.
The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here
> Does anyone understand why LLMs have gotten so good at this?
Added more IF/THEN/ELSE conditions.
More wires and jumpers on the breadboard.
Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes
That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
So once that's achieved, I wonder how well it deals with unsuspected variations. E.g.
"Give me an illustration of a bicycle riding by a pelican"
"Give me an illustration of a bicycle riding over a pelican"
"Give me an illustration of a bicycle riding under a flying pelican"
So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE
Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.
If only there were some way to test it, like swapping the two nouns in the sentence. Alas.
Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.
Exactly. As far as I'm concerned, the benchmark is useless. It's way too easy and rewarding to train on it.
5 replies →
Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.
It seems they trained the model to output good svg’s.
In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.
[1] https://blog.google/innovation-and-ai/models-and-research/ge...
Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/
It seems to be having capacity problems right now but I'll run that as soon as I can get it to work.
Pretty solid: https://gist.github.com/simonw/f5c893203621a7631ff178d9093a8...
Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.
https://www.svgviewer.dev/s/dEdbH8Sw
I don't know what of this is the prompt and what was the output, but that's a pretty bad schematic (for both aesthetic and circuit-design reasons).
The prompts were doing the design, reference voltage, hysteresis, output stage, all the maths and then the SVG is from asking the model to take all that and the current BOM to make an SVG schematic of it. In the past models would just output totally incoherent messes of lines and shapes.
I did a larger circuit too that this is part of, but it's not really for sharing online.
Yes but you concede it is a schematic.
1 reply →
that's pretty amazing for an LLM but as an EE, if my intern did this i would sigh inwardly and pull up some existing schematics for some brief guidance on symbol layout.
At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?
Loads of examples here https://x.com/jeffdean/status/2024525132266688757
Or, even more challenging, an okapi on a recumbent ?!
Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub
But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.
Is this capability “emergent”, or do AI firms specifically target SVG generation in order to improve it? How would we be able to tell?
I asked myself the same thing as I typed that comment, and I'm not sure what the answer is. I don't think models are specifically trained on this (though of course they're trained on how to generate SVGs in general), but I'm prepared to be wrong.
I have a feeling the most 'emergent' aspect was that LLMs have generally been able to produce coherent SVG for quite a while, likely without specific training at first. Since then I suspect there has been more tailored training because improvements have been so dramatic. Of course it makes sense that text-based images using very distinct structure and properties could be manipulated reasonably well by a text-based language model, but it's still fascinating to me just how well it can work.
Perhaps what's most incredible about it is how versatile human language is, even when it lacks so many dimensions as bits on a machine. Yet it's still cool that we can resurrect those bits at rest and transmogrify them back into coherent projections of photons from a screen.
I don't think LLMs are AGI or about to completely flip the world upside down or whatever, but it seems undeniably magical when you break it down.
Google specifically boast about their SVG performance in the announcement post: https://blog.google/innovation-and-ai/models-and-research/ge...
You can try any combination of animal on vehicle to confirm that they likely didn't target pelicans directly though.
next time you host a party, have people try to draw a bicycle on your whiteboard (you have a whiteboard in your house right? you should, anyway...)
human adults are generally quite bad at drawing them, unless they spend a lot of time actually thinking about bicycles as objects
They are, and it is very funny.
https://www.behance.net/gallery/35437979/Velocipedia
1 reply →
What’s your point? Yes, humans fail sometimes, as do AI models. Are you trying to imply that, in light of this, AI is now as capable as human beings? If so, that conclusion doesn’t follow logically.
1 reply →
And the left leg is straight while the right leg is bent.
EDIT: And the chain should pass behind the seat stay.
Cost per task has increased 4.2x but their ARC-AGI-2 score went from 33.6% to 77.1%
Cost per task is still significantly lower than Opus. Even Opus 4.5
https://arcprize.org/leaderboard
What is that, a snack in the basket?
"integrating a bicycle basket, complete with a fish for the pelican... also ensuring the basket is on top of the bike, and that the fish is correctly positioned with its head up... basket is orange, with a fish inside for fun."
how thoughtful of the ai to include a snack. truly a "thanks for all the fish"
A pelican already has an integrated snack-holder, though. It wouldn't need to put it in the basket.
1 reply →
A fish for the road
The number of snacks in the basket is a random variable with a Poisson distribution.
Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.
What do you think this particular prompt is evaluating for?
The more popular these particular evals are, the more likely the model will be trained for them.
Sea https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Great pelican but what’s up with that fish in the basket?
It's a pelican. What do you expect a pelican to have in his bike's basket?
It's a pretty funny and coherent touch!
> What do you expect a pelican to have in his bike's basket?
Probably stuff it cannot fit in the gullet, or don't want there (think trash). I wouldn't expect a pelican to stash fish there, that's for sure.
3 replies →
Yeah, why only _one_ fish?
It's obvious that pelican is riding long distance, no way a single fish is sufficiently energy dense for more than a few miles.
Can't the model do basic math???
Where else are cycling Pelican's meant to keep their fish?
I get it, I just meant the fish is poorly done, when I’d have guessed it would be relatively simple part. Maybe the black dot eye is misplaced idk.
Wonder when will we get something other than a side view
Another Jeff Dean post about this model shows it writing programs that generate CAD objects. I suspect if you ask it to, it will create a CAD pelican on a CAD bicycle and even make joints so you can turn the pedals.
That would be a especially challenging for vector output. I tried just now on ChatGPT 5.2 to jump straight to an image, with this prompt:
"make me a cartoon image of a pelican riding a bicycle, but make it from a front 3/4 view, that is riding toward the viewer."
The result was basically a head-on view, but I expect if you then put that back in and said, "take this image and vectorize it as an SVG" you'd have a much better time than trying to one-shot the SVG directly from a description.
... but of course, if that's so, then what's preventing the model from being smart enough to identify this workflow and follow it on its own to get the task completed?
is there something in your prompt about hats? why the pelican always wearing a hat recently?!
At this point, i think maybe they're training on all of the previous pelicans, and one of them decided to put a hat on it?
Disclaimer: This is an unsubstantiated claim that i made up
Not even animated? This is 2026.
Jeff Dean just posted an animated version: https://x.com/JeffDean/status/2024525132266688757
One underrated thing about the recent frontier models, IMO, is that they are obviating the need for image gen as a standalone thing. Opus 4.6 (and apparently 3.1 Pro as well) doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point. And the benefit of SVG is that it can be animated and interactive.
I find this fascinating because it literally just happened in the past few months. Up until ~summer of 2025, the SVG these models made was consistently buggy and crude. By December of 2026, I was able to get results like this from Opus 4.5 (Henry James: the RPG, made almost entirely with SVG): https://the-ambassadors.vercel.app
And now it looks like Gemini 3.1 Pro has vaulted past it.
4 replies →
That Ostrich Tho
1 reply →
You think they are able to see their output and iterate on it? Or is it pure token generation?
How about STL files for 3d printing pelicans!
Harder: the bike must work
Hardest: the pelican must work
I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output
I hope we keep beating this dead horse some more, I'm still not tired of it.