← Back to context

Comment by simonw

5 days ago

That's definitely not the case here. The new o3-pro is slow - it took two minutes just to draw me an SVG of a pelican riding a bicycle. o3-preview was much faster than that.

https://simonwillison.net/2025/Jun/10/o3-pro/

Do you think a cycling pelican is still a valid cursory benchmark? By now surely discussions about it are in the training set.

There is quite a few on Google Image search.

On the other hand they still seem to struggle!

Wow! pelican benchmark is now saturated

  • Not until I can count the feathers, ask for a front view of the same pelican, then ask for it to be animated, all still using SVG.

  • I wonder how much of that is because it's getting more and more included in training data.

    We now need to start using walrusses riding rickshaws

Would you say this is the best cycling pelican to date? I don't remember any of the others looking better than this.

Of course by now it'll be in-distribution. Time for a new benchmark...

  • I love that we are in the timeline where we are somewhat seriously evaluating probably super human intelligence by their ability to draw a svg of a cycling pelican.

    • I still remember my jaw hitting the floor when the first DALL-E paper came out, with the baby daikon radish walking a dog. How the actual fuck...? Now we're probably all too jaded to fully appreciate the next advance of that magnitude, whatever that turns out to be.

      E.g., the pelicans all look pretty cruddy including this one, but the fact that they are being delivered in .SVG is a bigger deal than the quality of the artwork itself, IMHO. This isn't a diffusion model, it's an autoregressive transformer imitating one. The wonder isn't that it's done badly, it's that it's happening at all.

      1 reply →

    • I don't love that this is the conversation and when these models bake-in these silly scenarios with training data, everyone goes "see, pelican bike! super human intelligence!"

      The point is never the pelican. The point is that if a thing has information about pelicans, and has information about bicycles, then why can't it combine those ideas? Is it because it's not intelligent?

      3 replies →

This made me think of the 'draw a bike experiment', where people were asked to draw a bike from memory, and were suprisingly bad at recreating how the parts fit together in a sensible manner:

https://road.cc/content/blog/90885-science-cycology-can-you-...

ChatGPT seems to perform better than most, but with notable missing elements (where's the chain or the handlebars?). I'm not sure if those are due to a lack of understanding, or artistic liberties taken by the model?

Well, that might be more of a function of how long they let it 'reason' than anything intrinsic to the model?