Comment by simonw

7 days ago

The pelican riding a bicycle is excellent. I think it's the best I've seen.

https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

58 comments

simonw

So, you've said multiple times in the past that you're not concerned about AI labs training for this specific test because if they did, it would be so obviously incongruous that you'd easily spot the manipulation and call them out.

Which tbh has never really sat right with me, seemingly placing way too much confidence in your ability to differentiate organic vs. manipulated output in a way I don't think any human could be expected to.

To me, this example is an extremely neat and professional SVG and so far ahead it almost seems too good to be true. But like with every previous model, you don't seem to have the slightest amount of skepticism in your review. I don't think I truly believe Google cheated here, but it's so good it does therefore make me question whether there could ever be an example of a pelican SVG in the future that actually could trigger your BS detector?

I know you say it's just a fun/dumb benchmark that's not super important, but you're easily in the top 3 most well known AI "influencers" whose opinion/reviews about model releases carry a lot of weight, providing a lot of incentive with trillions of dollars flying around. Are you still not at all concerned by the amount of attention this benchmark receives now/your risk of unwittingly being manipulated?

simonw 7 days ago
The other SVGs I tried from my private collection of prompts were all similarly impressive.
- buttered_toast 7 days ago
  
  Is there a way you can showcase a few of these?
  
  4 replies →

tasuki 7 days ago

Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...

steve_adams_86 7 days ago

We've reached PGI

zozbot234 7 days ago

This benchmark outcome is actually really impressive given the difficulty of this task. It shows that this particular model manages to "think" coherently and maintain useful information in its context for what has to be an insane overall amount of tokens, likely across parallel "thinking" chains. Likely also has access to SVG-rendering tools and can "see" and iterate on the result via multimodal input.

mikestaas 7 days ago

Wow. I wonder how it would do with pure CSS a la https://diana-adrianne.com/

ramesh31 7 days ago

>"The pelican riding a bicycle is excellent. I think it's the best I've seen. https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/"

Yeah this is nuts. First real step-change we've seen since Claude 3.5 in '24.

nickthegreek 7 days ago

I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators.

enraged_camel 7 days ago

Is there a list of these for each model, that you've catalogued somewhere?

simonw 7 days ago

At the moment that's mostly my tag page here but I really need to formalize it: https://simonwillison.net/tags/pelican-riding-a-bicycle/

Manabu-eo 7 days ago

How likely this problem is already on the training set by now?

simonw 7 days ago
If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
- suddenlybananas 7 days ago
  
  Why would they train on that? Why not just hire someone to make a few examples.
  
  15 replies →
throwup238 7 days ago
For every combination of animal and vehicle? Very unlikely.
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
- recursive 7 days ago
  
  No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
  
  3 replies →
verdverm 7 days ago

I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too
zarzavat 7 days ago

You can always ask for a tyrannosaurus driving a tank.

throwup238 7 days ago

The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)

margalabargala 7 days ago
It's not actually, look up some photos of the sun setting over the ocean. Here's an example:
https://stockcake.com/i/sunset-over-ocean_1317824_81961
- throwup238 7 days ago
  
  That’s only if the sun is above the horizon entirely.
  
  2 replies →

dfdsf2 7 days ago

Highly disagree.

I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.

If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.

chriswarbo 7 days ago

I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.
In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
peaseagee 7 days ago

The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.

saberience 7 days ago

Do you have to still keep trying to bang on about this relentlessly?

It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.

Again, like I said before, it's also a terrible benchmark.

jeanloolz 7 days ago

I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?
odiroot 7 days ago

It's HN's Carthago delenda est moment.
simonw 7 days ago

It being a terrible benchmark is the bit.
Davidzheng 7 days ago

Eh, i find it more of a not very informative but lighthearted commentary

deron12 7 days ago

It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!

Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?

gs17 7 days ago
It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
- fvdessen 7 days ago
  
  maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh
dfdsf2 7 days ago

Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.