Comment by comex
25 days ago
Probably the biggest thing that serious predictions are relying on is the METR graph:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...
It shows a remarkably consistent curve for AI completing increasingly difficult coding tasks over time. In fact, the curve is exponential, where the X axis is time and the Y axis is task difficulty as measured by how long a human would take to perform the task. The current value for 80% success rate is only 45 minutes, but if it continues to follow the exponential curve, it will only take 3 years and change to get to a full 40 hour human work week's worth of work. The 50% success rate graph is also interesting, as it's similarly exponential and is currently at 6 hours.
Of course, progress could fall off as LLMs hit various scaling limits or as the nature of the difficulty changes. But I for one predicted that progress would fall off before, and was wrong. (And there is nothing saying that progress can't speed up.)
On the other hand, I do find it a little suspicious that so many eggs are in the one basket of METR, prediction-wise.
> It shows a remarkably consistent curve for AI completing increasingly difficult coding tasks over time.
I'm not convinced that "long" is equivalent to "difficult". Traditional computer can also solve tasks that would take extemely long for humans, but that doesn't make them intelligent.
This is not to say that this is useless, quite the opposite! Traditional computers shown that being able to shorten the time needed for certain tasks is extremely valuable, and AI shown this can be extender to other (but not necessarily all) tasks as well.
That's true, but length is a good proxy for three of the biggest difficulties faced by LLMs when coding:
1. Ability to take large amounts of information into consideration, specifically large codebases (longer tasks usually involve larger codebases). LLMs struggle with this due to context window limitations.
2. Ability to make and execute on long-term plans. Also related to context window limitations, as well as what for a human would be called "executive functioning skills".
3. Consistency. If you have an x% chance to get stuck on each step of a multi-step task, then the more steps, the higher the failure rate. This is true for both LLMs and humans, but LLMs tend to have more random failures, both due to hallucinations and due to being worse at recovering if their initial attempt fails (they can have a hard time remembering what they're supposed to do differently).
These difficulties seem to generalize beyond coding to almost any kind of knowledge work. A system that could solve them all would be, if not AGI, at least a heck of a lot closer.
Wouldn’t actual “AGI” require an ~80 year timeframe ;)? After all most humans are able to achieve the task of “survival” over that period.
Very interesting thought! TY for sharing