Comment by simonw

4 hours ago

The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/

I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.

Performance numbers:

  Reading: 20 tokens, 0.4s, 54.32 tokens/s
  Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s

I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

24 comments

simonw

I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

simonw 4 hours ago
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":
https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...
- throwaw12 4 hours ago
  
  compared to your test with GLM 5.1, this indeed looks off
  https://xcancel.com/simonw/status/2041646779553476801
  
  14 replies →
m3kw9 4 hours ago
if they cook these in, i wonder what else was cooked in there to make it look good.
- zargon 4 hours ago
  
  Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.

at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)

hansonkd 3 hours ago

They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.
simonw 3 hours ago

See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?

Can you replace Claude Code Opus or Codex with this?

Does it feel >80% as good on "real world" tasks you do on a day to day basis.

These are the stupidest things to cleave to.