Comment by simonw

2 days ago

Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.

It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.

Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.

  • > it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

    Just asking "Explain what this service does?" turns into

    [No response for three minutes...]

    +729 -522

  • Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.

    Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.

    • being TOO steerable is another issue though.

      Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.

      Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.

      Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global

      2 replies →

    • For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".

  • I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.

    • I find Copilot leans the other way. It'll myopically focus its work in the exact function I point it at, even when it's clear that adding a new helper would be a logical abstraction to share behaviour with the function right beside it.

      Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"

      Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.

  • > it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

    This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way

    What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.

    • Out of curiosity, do you have any public projects (with public source code) you've made exclusively with Gemini, so one could take a look? I've tried a bunch of times to use Gemini to at least finish something small but I always end up sufficiently frustrated to abort it as the instruction-following seems so bad.

      1 reply →

  • Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.

  • I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".

    That helped quite a bit but it would still go off on it's own from time to time.

  • This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.

  • > it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

    Not like human programmers. I would never do this and have never struggled with it in the past, no...

    • Fairer comparison would be against other models, which are typically better at instruction following. You say "don't change anything not explicitly mentioned" or "Don't add any new code comments" and they tend to follow that.

  • Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.

  • Do you have Personalization Instructions set up for your LLM models?

    You can make their responses fairly dry/brief.

    • I'm mostly using them via my own harnesses, so I have full control of the system prompts and so on. And no matter what I try, Gemini keeps "helpfully" adding code comments every now and then. With every other model, "- Don't add code comments" tends to be enough, but with Gemini I'm not sure how I could stop the comments from eventually appearing.

      5 replies →

  • true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.

What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.

The most absurd benchmaxxing.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

  • I like how they also did a frog on a penny-farthing and a giraffe driving a tiny car and an ostrich on roller skates and a turtle kickflipping a skateboard and a dachshund driving a stretch limousine.

  • Animated SVG is huge. People in different professions are worrying to different degrees in terms of being replaced by ML, but this one is huge with regards to digital art.

    • yeah, complex SVG's are so much more bandwidth, computation and energy efficient than raster images - up to a point! but in general use we are not at that point and there's so much more we can do with it

      I've been meaning to let coding agents take a stab at using the lottie library https://github.com/airbnb/lottie-web to supercharge the user experience without needing to make it a full time job

  • So let's put things we're interested in in the benchmarks.

    I'm not against pelicans!

    • I think the reason the pelican example is great is because it's bizarre enough that it's unlikely that to appear in the training as one unified picture.

      If we picked something more common, like say, a hot dog with toppings, then the training contamination is much harder to control.

      2 replies →

  • You don't have to benchmax everything, just the benchmarks in the right social circles

  • It if funny to think that Jeff Dean personally worked to optimize the pelican riding a bike benchmark.

Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.

  • A few thoughts:

    - One thing to be aware of is that LLMs can be much smarter than their ability to articulate that intelligence in words. For example, GPT-3.5 Turbo was beastly at chess (1800 elo?) when prompted to complete PGN transcripts, but if you asked it questions in chat, its knowledge was abysmal. LLMs don't generalize as well as humans, and sometimes they can have the ability to do tasks without the ability to articulate things that feel essential to the tasks (like answering whether the bicycle is facing left or right).

    - Secondly, what has made AI labs so bullish on future progress over the past few years is that they see how little work it takes to get their results. Often, if an LLM sucks at something that's because no one worked on it (not always, of course). If you directly train a skill, you can see giant leaps in ability with fairly small effort. Big leaps in SVG creation could be coming from relatively small targeted efforts, where none existed before.

    • We’re literally at the point where trillions of dollars have been invested in these things and the surrounding harnesses and architecture, and they still can’t do economically useful work on their own. You’re way too bullish here.

      1 reply →

  • My best guess is that the labs put a lot of work into HTML and CSS spatial stuff because web frontend is such an important application of the models, and those improvements leaked through to SVG as well.

  • All models have improved, but from my understanding, Gemini is the main one that was specifically trained on photos/video/etc in addition to text. Other models like earlier chatgpt builds would use plugins to handle anything beyond text, such as using a plugin to convert an image into text so that chatgpt could "see" it.

    Gemini was multimodal from the start, and is naturally better at doing tasks that involve pictures/videos/3d spatial logic/etc.

    The newer chatgpt models are also now multimodal, which has probably helped with their svg art as well, but I think Gemini still has an edge here

Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes

  • That’s Simon’s goal. “All I’ve ever wanted from life is a genuinely great SVG vector illustration of a pelican riding a bicycle. My dastardly multi-year plan is to trick multiple AI labs into investing vast resources to cheat at my benchmark until I get one.”

    https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

    • So once that's achieved, I wonder how well it deals with unsuspected variations. E.g.

      "Give me an illustration of a bicycle riding by a pelican"

      "Give me an illustration of a bicycle riding over a pelican"

      "Give me an illustration of a bicycle riding under a flying pelican"

      So on and so forth. Or will it start to look like the Studio C sketch about Lobster Bisque: https://youtu.be/A2KCGQhVRTE

  • Soon? I'd be willing to bet it's been included in the training set at least 6 months by now. Not so obvious so it generates always perfect pelicans on bikes, but sufficiently for the "minibench" to be less useful today than in the past.

    • If only there were some way to test it, like swapping the two nouns in the sentence. Alas.

  • Simons been doing this exact test for nearly 18 months now, if vendors want to benchmaxx it then they've had more than enough time to do so already.

  • Forget the paperclip maximizer - AGI will turn the whole world into pelicans on bikes.

Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.

https://www.svgviewer.dev/s/dEdbH8Sw

  • I don't know what of this is the prompt and what was the output, but that's a pretty bad schematic (for both aesthetic and circuit-design reasons).

    • The prompts were doing the design, reference voltage, hysteresis, output stage, all the maths and then the SVG is from asking the model to take all that and the current BOM to make an SVG schematic of it. In the past models would just output totally incoherent messes of lines and shapes.

      I did a larger circuit too that this is part of, but it's not really for sharing online.

  • that's pretty amazing for an LLM but as an EE, if my intern did this i would sigh inwardly and pull up some existing schematics for some brief guidance on symbol layout.

Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub

But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.

  • Is this capability “emergent”, or do AI firms specifically target SVG generation in order to improve it? How would we be able to tell?

    • I asked myself the same thing as I typed that comment, and I'm not sure what the answer is. I don't think models are specifically trained on this (though of course they're trained on how to generate SVGs in general), but I'm prepared to be wrong.

      I have a feeling the most 'emergent' aspect was that LLMs have generally been able to produce coherent SVG for quite a while, likely without specific training at first. Since then I suspect there has been more tailored training because improvements have been so dramatic. Of course it makes sense that text-based images using very distinct structure and properties could be manipulated reasonably well by a text-based language model, but it's still fascinating to me just how well it can work.

      Perhaps what's most incredible about it is how versatile human language is, even when it lacks so many dimensions as bits on a machine. Yet it's still cool that we can resurrect those bits at rest and transmogrify them back into coherent projections of photons from a screen.

      I don't think LLMs are AGI or about to completely flip the world upside down or whatever, but it seems undeniably magical when you break it down.

  • next time you host a party, have people try to draw a bicycle on your whiteboard (you have a whiteboard in your house right? you should, anyway...)

    human adults are generally quite bad at drawing them, unless they spend a lot of time actually thinking about bicycles as objects

  • And the left leg is straight while the right leg is bent.

    EDIT: And the chain should pass behind the seat stay.

What is that, a snack in the basket?

  • "integrating a bicycle basket, complete with a fish for the pelican... also ensuring the basket is on top of the bike, and that the fish is correctly positioned with its head up... basket is orange, with a fish inside for fun."

    how thoughtful of the ai to include a snack. truly a "thanks for all the fish"

  • The number of snacks in the basket is a random variable with a Poisson distribution.

Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.

Great pelican but what’s up with that fish in the basket?

  • It's a pelican. What do you expect a pelican to have in his bike's basket?

    It's a pretty funny and coherent touch!

    • > What do you expect a pelican to have in his bike's basket?

      Probably stuff it cannot fit in the gullet, or don't want there (think trash). I wouldn't expect a pelican to stash fish there, that's for sure.

      3 replies →

  • Yeah, why only _one_ fish?

    It's obvious that pelican is riding long distance, no way a single fish is sufficiently energy dense for more than a few miles.

    Can't the model do basic math???

  • Where else are cycling Pelican's meant to keep their fish?

    • I get it, I just meant the fish is poorly done, when I’d have guessed it would be relatively simple part. Maybe the black dot eye is misplaced idk.

Wonder when will we get something other than a side view

  • Another Jeff Dean post about this model shows it writing programs that generate CAD objects. I suspect if you ask it to, it will create a CAD pelican on a CAD bicycle and even make joints so you can turn the pedals.

  • That would be a especially challenging for vector output. I tried just now on ChatGPT 5.2 to jump straight to an image, with this prompt:

    "make me a cartoon image of a pelican riding a bicycle, but make it from a front 3/4 view, that is riding toward the viewer."

    The result was basically a head-on view, but I expect if you then put that back in and said, "take this image and vectorize it as an SVG" you'd have a much better time than trying to one-shot the SVG directly from a description.

    ... but of course, if that's so, then what's preventing the model from being smart enough to identify this workflow and follow it on its own to get the task completed?

is there something in your prompt about hats? why the pelican always wearing a hat recently?!

  • At this point, i think maybe they're training on all of the previous pelicans, and one of them decided to put a hat on it?

    Disclaimer: This is an unsubstantiated claim that i made up

Not even animated? This is 2026.

  • Jeff Dean just posted an animated version: https://x.com/JeffDean/status/2024525132266688757

    • One underrated thing about the recent frontier models, IMO, is that they are obviating the need for image gen as a standalone thing. Opus 4.6 (and apparently 3.1 Pro as well) doesn't have the ability to generate images but it is so good at making SVG that it basically doesn't matter at this point. And the benefit of SVG is that it can be animated and interactive.

      I find this fascinating because it literally just happened in the past few months. Up until ~summer of 2025, the SVG these models made was consistently buggy and crude. By December of 2026, I was able to get results like this from Opus 4.5 (Henry James: the RPG, made almost entirely with SVG): https://the-ambassadors.vercel.app

      And now it looks like Gemini 3.1 Pro has vaulted past it.

      4 replies →

You think they are able to see their output and iterate on it? Or is it pure token generation?