Comment by consumer451

2 days ago

It's related to the history of Simon Willison[0] having used this as a benchmark on many models.[1]

I believe this model's output is noticeably superior... but yeah, people do tend to get hyperbolic when new stuff happens it their domain of interest.

[0] https://news.ycombinator.com/user?id=simonw

[1] https://www.google.com/search?q=simon+willison+pelican+ridin...

> I believe this model's output is noticeably superior

Sure, but at the same time Qwen3-30B-A3-2507 is also doing much better than most older models, even the bigger — and more capable — so I don't know how much is due to actual progress and how much is a new version of benchmaxxing.

And nowadays a better known benchmark, so data scientists can overfit their models to it even more, even when LLMs are famous for overfitting. So, I wouldn’t trust any results regarding this specific test nowadays.