← Back to context

Comment by throwaw12

10 hours ago

I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":

https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...

  • compared to your test with GLM 5.1, this indeed looks off

    https://xcancel.com/simonw/status/2041646779553476801

    • Yeah GLM 5.1 did an outstanding job on the possum - better than Opus 4.7 or GPT-5.4 and I think better than Gemini 3.1 Pro too.

      But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.

      1 reply →

    • Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.

      11 replies →

I think at this point we can safely put the pelican test in the category of Goodhart's law.

If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.

if they cook these in, i wonder what else was cooked in there to make it look good.

  • Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.