Comment by simonw
7 hours ago
The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/
I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.
Performance numbers:
Reading: 20 tokens, 0.4s, 54.32 tokens/s
Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/
I feel like this time it is indeed in the training set, because it is too good to be true.
Can you run your other tests and see the difference?
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":
https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...
compared to your test with GLM 5.1, this indeed looks off
https://xcancel.com/simonw/status/2041646779553476801
14 replies →
I think at this point we can safely put the pelican test in the category of Goodhart's law.
If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.
if they cook these in, i wonder what else was cooked in there to make it look good.
Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.
at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)
They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.
See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Why is the assumption that they trained for a pelican on a bicycle, rather than running RL for all kinds of 'generate an SVG' tasks?
I don’t think I ever heard you said excellent for the pelican test. It looks excellent indeed!
The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.
These are the stupidest things to cleave to.
Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?
Can you replace Claude Code Opus or Codex with this?
Does it feel >80% as good on "real world" tasks you do on a day to day basis.