Comment by simonw
4 hours ago
The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/
I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.
Performance numbers:
Reading: 20 tokens, 0.4s, 54.32 tokens/s
Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/
I feel like this time it is indeed in the training set, because it is too good to be true.
Can you run your other tests and see the difference?
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":
https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...
compared to your test with GLM 5.1, this indeed looks off
https://xcancel.com/simonw/status/2041646779553476801
14 replies →
if they cook these in, i wonder what else was cooked in there to make it look good.
Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.
at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)
They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.
See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?
Can you replace Claude Code Opus or Codex with this?
Does it feel >80% as good on "real world" tasks you do on a day to day basis.
These are the stupidest things to cleave to.