← Back to context

Comment by throwaw12

10 hours ago

I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

20 comments

throwaw12

Reply

simonw 10 hours ago

It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":

https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...

throwaw12 9 hours ago
compared to your test with GLM 5.1, this indeed looks off
https://xcancel.com/simonw/status/2041646779553476801
- simonw 9 hours ago
  
  Yeah GLM 5.1 did an outstanding job on the possum - better than Opus 4.7 or GPT-5.4 and I think better than Gemini 3.1 Pro too.
  But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.
  
  1 reply →
- refulgentis 9 hours ago
  
  Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.
  
  11 replies →

sifar 5 hours ago

I think at this point we can safely put the pelican test in the category of Goodhart's law.

amelius 5 hours ago

If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.

m3kw9 9 hours ago

if they cook these in, i wonder what else was cooked in there to make it look good.

zargon 9 hours ago

Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.