Comment by consumer451
2 days ago
It's related to the history of Simon Willison[0] having used this as a benchmark on many models.[1]
I believe this model's output is noticeably superior... but yeah, people do tend to get hyperbolic when new stuff happens it their domain of interest.
[0] https://news.ycombinator.com/user?id=simonw
[1] https://www.google.com/search?q=simon+willison+pelican+ridin...
And nowadays a better known benchmark, so data scientists can overfit their models to it even more, even when LLMs are famous for overfitting. So, I wouldn’t trust any results regarding this specific test nowadays.
> I believe this model's output is noticeably superior
Sure, but at the same time Qwen3-30B-A3-2507 is also doing much better than most older models, even the bigger — and more capable — so I don't know how much is due to actual progress and how much is a new version of benchmaxxing.