Comment by Bjorkbat
4 days ago
Somewhat related, but I’ve been feeling as of late what can best be described as “benchmark fatigue”.
The latest models can score something like 70% on SWE-bench verified and yet it’s difficult to say what tangible impact this has on actual software development. Likewise, they absolutely crush humans at sport programming but are unreliable software engineers on their own.
What does it really mean that an LLM got gold on this year’s IMO? What if it means pretty much nothing at all besides the simple fact that this LLM is very, very good at IMO style problems?
Far as i can tell here, the actual advancement is in the methodology used to create a model tuned for this problem domain, and how efficient that method is. Theoretically then, making it easier to build other problem-domain-specific models.
That a highly tuned model designed to solve IMO problems can solve IMO problems is impressive, maybe, but yeah it doesn't really signal any specific utility otherwise.