← Back to context

Comment by simonw

5 hours ago

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

Would love to find out they're overfitting for pelican drawings.

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!

Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?

Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?

Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)

  • I've stuck with "Generate an SVG of a pelican riding a bicycle" because it's the same prompt I've been using for over a year now and I want results that are sort-of comparable to each other.

    I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.

Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.

As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.

There's no way they actually work on training this.

  • I suspect they're training on this.

    I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.

    https://i.imgur.com/UvlEBs8.png

    • It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.

      5 replies →

    • Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?

  • The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?

    $200 * 1,000 = $200k/month.

    I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.

I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.

Can it draw a different bird on a bike?

[flagged]

  • I'll bite. The benchmark is actually pretty good. It shows in an extremely comprehensible way how far LLMs have come. Someone not in the know has a hard time understanding what 65.4% means on "Terminal-Bench 2.0". Comparing some crappy pelicans on bicycles is a lot easier.

    • it ceases to be a useful benchmark of general ability when you post it publicly for them to train against

  • the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

    Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

    • A benchmark only tests what the benchmark is doing, the goal is to make that task correlate with actually valuable things. Graphic benchmarks is a good example, extremely hard to know what you will get in a game by looking at 3D Mark scores, it varies by a lot. Making a SVG of a single thing doesn’t help much unless that applies to all SVG tasks.