Comment by simonw

7 hours ago

The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/

I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.

Performance numbers:

  Reading: 20 tokens, 0.4s, 54.32 tokens/s
  Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s

I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

28 comments

simonw

throwaw12 7 hours ago

I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

simonw 6 hours ago
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":
https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...
- throwaw12 6 hours ago
  
  compared to your test with GLM 5.1, this indeed looks off
  https://xcancel.com/simonw/status/2041646779553476801
  
  14 replies →
sifar 2 hours ago

I think at this point we can safely put the pelican test in the category of Goodhart's law.
amelius 2 hours ago

If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.
m3kw9 6 hours ago
if they cook these in, i wonder what else was cooked in there to make it look good.
- zargon 6 hours ago
  
  Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.

ahoog42 6 hours ago

at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)

hansonkd 5 hours ago

They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.
simonw 5 hours ago
See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
- mudkipdev 2 hours ago
  
  Why is the assumption that they trained for a pelican on a bicycle, rather than running RL for all kinds of 'generate an SVG' tasks?

sbinnee 2 hours ago

I don’t think I ever heard you said excellent for the pelican test. It looks excellent indeed!

The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.

halJordan 3 hours ago

These are the stupidest things to cleave to.

echelon 4 hours ago

Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?

Can you replace Claude Code Opus or Codex with this?

Does it feel >80% as good on "real world" tasks you do on a day to day basis.