Comment by simonw

2 days ago

Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.

153 comments

simonw

embedding-shape 2 days ago

It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.

Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.

mullingitover 2 days ago
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
Just asking "Explain what this service does?" turns into
[No response for three minutes...]
+729 -522
- cowmoo728 2 days ago
  
  it's also so aggressive about taking out debug log statements and in-progress code. I'll ask it to fill in a new function somewhere else and it will remove all of the half written code from the piece I'm currently working on.
  
  4 replies →
- quotemstr 2 days ago
  
  What. You don't have yours ask for edit approval?
  
  12 replies →
- BartShoot 2 days ago
  
  if you had to ask it obviously needs to refactor code for clarity so next person does not need to ask
- kylec 2 days ago
  
  "I don't know what did it, but here's what it does now"
- moffkalast 1 day ago
  
  I've seen Kimi do this a ton as well, so insufferable.
- SignalStackDev 2 days ago
  
  [dead]
h14h 2 days ago
Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.
Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.
- cglan 2 days ago
  
  being TOO steerable is another issue though.
  Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.
  Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.
  Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global
  
  2 replies →
- h14h 1 day ago
  
  For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".
enobrev 2 days ago
I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.
- mikepurvis 2 days ago
  
  I find Copilot leans the other way. It'll myopically focus its work in the exact function I point it at, even when it's clear that adding a new helper would be a logical abstraction to share behaviour with the function right beside it.
  Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"
  Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.
neya 2 days ago
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way
What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.
- embedding-shape 2 days ago
  
  Out of curiosity, do you have any public projects (with public source code) you've made exclusively with Gemini, so one could take a look? I've tried a bunch of times to use Gemini to at least finish something small but I always end up sufficiently frustrated to abort it as the instruction-following seems so bad.
  
  1 reply →
Yizahi 1 day ago
Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.
- kolinko 1 day ago
  
  Which is how it works with people as well
tyfon 2 days ago

I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".
That helped quite a bit but it would still go off on it's own from time to time.
apitman 2 days ago

This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.
msteffen 2 days ago
> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors
Not like human programmers. I would never do this and have never struggled with it in the past, no...
- embedding-shape 2 days ago
  
  Fairer comparison would be against other models, which are typically better at instruction following. You say "don't change anything not explicitly mentioned" or "Don't add any new code comments" and they tend to follow that.
JLCarveth 2 days ago

Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.
gavinray 2 days ago
Do you have Personalization Instructions set up for your LLM models?
You can make their responses fairly dry/brief.
- embedding-shape 2 days ago
  
  I'm mostly using them via my own harnesses, so I have full control of the system prompts and so on. And no matter what I try, Gemini keeps "helpfully" adding code comments every now and then. With every other model, "- Don't add code comments" tends to be enough, but with Gemini I'm not sure how I could stop the comments from eventually appearing.
  
  5 replies →
- metal_am 2 days ago
  
  I'd love to hear some examples!
  
  3 replies →
zengineer 2 days ago

true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.

jasonjmcghee 2 days ago

What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.

The most absurd benchmaxxing.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

simonw 2 days ago
I like how they also did a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.
- jasonjmcghee 2 days ago
  
  Ok Google what are some other examples like a pelican riding a bicycle
- simultsop 1 day ago
  
  reminds me of andor, luthen, positive reinforcing wasting time of emperor
threatofrain 2 days ago
Animated SVG is huge. People in different professions are worrying to different degrees in terms of being replaced by ML, but this one is huge with regards to digital art.
- yieldcrv 1 day ago
  
  yeah, complex SVG's are so much more bandwidth, computation and energy efficient than raster images - up to a point! but in general use we are not at that point and there's so much more we can do with it
  I've been meaning to let coding agents take a stab at using the lottie library https://github.com/airbnb/lottie-web to supercharge the user experience without needing to make it a full time job
eurekin 2 days ago
Can't wait until they finally get to real world CAD
- tngranados 2 days ago
  
  There's a CAD example in that same thread: https://x.com/JeffDean/status/2024528776856817813
- gibspaulding 1 day ago
  
  I know this isn’t necessarily “real world CAD” but Claude Code is not too shabby at OpenSCAD.
tantalor 2 days ago

He's svg-mogging
gnatolf 2 days ago
So let's put things we're interested in in the benchmarks.
I'm not against pelicans!
- ghurtado 2 days ago
  
  I think the reason the pelican example is great is because it's bizarre enough that it's unlikely that to appear in the training as one unified picture.
  If we picked something more common, like say, a hot dog with toppings, then the training contamination is much harder to control.
  
  2 replies →
casey2 2 days ago

You don't have to benchmax everything, just the benchmarks in the right social circles
UltraSane 2 days ago

It if funny to think that Jeff Dean personally worked to optimize the pelican riding a bike benchmark.

MrCheeze 2 days ago

Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.

tedsanders 2 days ago
A few thoughts:
- One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).
- Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.
- emp17344 1 day ago
  
  We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.
  
  1 reply →
simonw 2 days ago

My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.
mitkebes 2 days ago

All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.
Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.
The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here
pknerd 2 days ago
> Does anyone understand why LLMs have gotten so good at this?
Added more IF/THEN/ELSE conditions.
- kridsdale3 2 days ago
  
  More wires and jumpers on the breadboard.

sam_1421 2 days ago

Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes

cbsks 2 days ago
That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
- travisgriggs 2 days ago
  
  So once that's achieved, I wonder how well it deals with unsuspected variations. E.g.
  "Give me an illustration of a bicycle riding by a pelican"
  "Give me an illustration of a bicycle riding over a pelican"
  "Give me an illustration of a bicycle riding under a flying pelican"
  So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE
embedding-shape 2 days ago
Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.
- Rudybega 1 day ago
  
  If only there were some way to test it, like swapping the two nouns in the sentence. Alas.
jsheard 2 days ago
Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.
- stri8ted 2 days ago
  
  Exactly. As far as I'm concerned, the benchmark is useless. It's way too easy and rewarding to train on it.
  
  5 replies →
ks2048 2 days ago

Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.

SoKamil 2 days ago

It seems they trained the model to output good svg’s.

In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.

[1] https://blog.google/innovation-and-ai/models-and-research/ge...

Arcuru 2 days ago

Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/

simonw 2 days ago
It seems to be having capacity problems right now but I'll run that as soon as I can get it to work.
- simonw 2 days ago
  
  Pretty solid: https://gist.github.com/simonw/f5c893203621a7631ff178d9093a8...

WarmWash 2 days ago

Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.

https://www.svgviewer.dev/s/dEdbH8Sw

InitialLastName 2 days ago
I don't know what of this is the prompt and what was the output, but that's a pretty bad schematic (for both aesthetic and circuit-design reasons).
- WarmWash 2 days ago
  
  The prompts were doing the design, reference voltage, hysteresis, output stage, all the maths and then the SVG is from asking the model to take all that and the current BOM to make an SVG schematic of it. In the past models would just output totally incoherent messes of lines and shapes.
  I did a larger circuit too that this is part of, but it's not really for sharing online.
- svnt 2 days ago
  
  Yes but you concede it is a schematic.
  
  1 reply →
0_____0 2 days ago

that's pretty amazing for an LLM but as an EE, if my intern did this i would sigh inwardly and pull up some existing schematics for some brief guidance on symbol layout.

AmazingTurtle 2 days ago

At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?

ascorbic 2 days ago

Loads of examples here https://x.com/jeffdean/status/2024525132266688757
tromp 2 days ago

Or, even more challenging, an okapi on a recumbent ?!

steve_adams_86 2 days ago

Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub

But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.

emp17344 2 days ago
Is this capability “emergent”, or do AI firms specifically target SVG generation in order to improve it? How would we be able to tell?
- steve_adams_86 2 days ago
  
  I asked myself the same thing as I typed that comment, and I'm not sure what the answer is. I don't think models are specifically trained on this (though of course they're trained on how to generate SVGs in general), but I'm prepared to be wrong.
  I have a feeling the most 'emergent' aspect was that LLMs have generally been able to produce coherent SVG for quite a while, likely without specific training at first. Since then I suspect there has been more tailored training because improvements have been so dramatic. Of course it makes sense that text-based images using very distinct structure and properties could be manipulated reasonably well by a text-based language model, but it's still fascinating to me just how well it can work.
  Perhaps what's most incredible about it is how versatile human language is, even when it lacks so many dimensions as bits on a machine. Yet it's still cool that we can resurrect those bits at rest and transmogrify them back into coherent projections of photons from a screen.
  I don't think LLMs are AGI or about to completely flip the world upside down or whatever, but it seems undeniably magical when you break it down.
- simonw 2 days ago
  
  Google specifically boast about their SVG performance in the announcement post: https://blog.google/innovation-and-ai/models-and-research/ge...
  You can try any combination of animal on vehicle to confirm that they likely didn't target pelicans directly though.
0_____0 2 days ago
next time you host a party, have people try to draw a bicycle on your whiteboard (you have a whiteboard in your house right? you should, anyway...)
human adults are generally quite bad at drawing them, unless they spend a lot of time actually thinking about bicycles as objects
- 542354234235 2 days ago
  
  They are, and it is very funny.
  https://www.behance.net/gallery/35437979/Velocipedia
  
  1 reply →
- emp17344 2 days ago
  
  What’s your point? Yes, humans fail sometimes, as do AI models. Are you trying to imply that, in light of this, AI is now as capable as human beings? If so, that conclusion doesn’t follow logically.
  
  1 reply →
HPsquared 2 days ago

And the left leg is straight while the right leg is bent.
EDIT: And the chain should pass behind the seat stay.

culi 1 day ago

Cost per task has increased 4.2x but their ARC-AGI-2 score went from 33.6% to 77.1%

Cost per task is still significantly lower than Opus. Even Opus 4.5

https://arcprize.org/leaderboard

bredren 2 days ago

What is that, a snack in the basket?

sigmar 2 days ago
"integrating a bicycle basket, complete with a fish for the pelican... also ensuring the basket is on top of the bike, and that the fish is correctly positioned with its head up... basket is orange, with a fish inside for fun."
how thoughtful of the ai to include a snack. truly a "thanks for all the fish"
- defen 2 days ago
  
  A pelican already has an integrated snack-holder, though. It wouldn't need to put it in the basket.
  
  1 reply →
WarmWash 2 days ago

A fish for the road
troymc 1 day ago

The number of snacks in the basket is a random variable with a Poisson distribution.

brikym 1 day ago

Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.

tarr11 2 days ago

What do you think this particular prompt is evaluating for?

The more popular these particular evals are, the more likely the model will be trained for them.

Gander5739 2 days ago

Sea https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

calny 2 days ago

Great pelican but what’s up with that fish in the basket?

coldtea 2 days ago
It's a pelican. What do you expect a pelican to have in his bike's basket?
It's a pretty funny and coherent touch!
- embedding-shape 2 days ago
  
  > What do you expect a pelican to have in his bike's basket?
  Probably stuff it cannot fit in the gullet, or don't want there (think trash). I wouldn't expect a pelican to stash fish there, that's for sure.
  
  3 replies →
nicr_22 2 days ago

Yeah, why only _one_ fish?
It's obvious that pelican is riding long distance, no way a single fish is sufficiently energy dense for more than a few miles.
Can't the model do basic math???
gavinray 2 days ago
Where else are cycling Pelican's meant to keep their fish?
- calny 2 days ago
  
  I get it, I just meant the fish is poorly done, when I’d have guessed it would be relatively simple part. Maybe the black dot eye is misplaced idk.

infthi 2 days ago

Wonder when will we get something other than a side view

dekhn 1 day ago

Another Jeff Dean post about this model shows it writing programs that generate CAD objects. I suspect if you ask it to, it will create a CAD pelican on a CAD bicycle and even make joints so you can turn the pedals.
mikepurvis 2 days ago

That would be a especially challenging for vector output. I tried just now on ChatGPT 5.2 to jump straight to an image, with this prompt:
"make me a cartoon image of a pelican riding a bicycle, but make it from a front 3/4 view, that is riding toward the viewer."
The result was basically a head-on view, but I expect if you then put that back in and said, "take this image and vectorize it as an SVG" you'd have a much better time than trying to one-shot the SVG directly from a description.
... but of course, if that's so, then what's preventing the model from being smart enough to identify this workflow and follow it on its own to get the task completed?

mohsen1 2 days ago

is there something in your prompt about hats? why the pelican always wearing a hat recently?!

bigfishrunning 2 days ago

At this point, i think maybe they're training on all of the previous pelicans, and one of them decided to put a hat on it?
Disclaimer: This is an unsubstantiated claim that i made up

xnx 2 days ago

Not even animated? This is 2026.

readitalready 2 days ago
Jeff Dean just posted an animated version: https://x.com/JeffDean/status/2024525132266688757
- benbreen 2 days ago
  
  One underrated thing about the recent frontier models, IMO, is that they are obviating the need for image gen as a standalone thing. Opus 4.6 (and apparently 3.1 Pro as well) doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point. And the benefit of SVG is that it can be animated and interactive.
  I find this fascinating because it literally just happened in the past few months. Up until ~summer of 2025, the SVG these models made was consistently buggy and crude. By December of 2026, I was able to get results like this from Opus 4.5 (Henry James: the RPG, made almost entirely with SVG): https://the-ambassadors.vercel.app
  And now it looks like Gemini 3.1 Pro has vaulted past it.
  
  4 replies →
- bigfishrunning 2 days ago
  
  That Ostrich Tho
  
  1 reply →

TZubiri 2 days ago

You think they are able to see their output and iterate on it? Or is it pure token generation?

DonHopkins 2 days ago

How about STL files for 3d printing pelicans!

baq 2 days ago

Harder: the bike must work
Hardest: the pelican must work

benatkin 2 days ago

I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output

saberience 2 days ago

I hope we keep beating this dead horse some more, I'm still not tired of it.