Comment by gm678

15 hours ago

I don't know what the Y-axis is supposed to be on that Wharton AI capabilities graph, but I am not really convinced that Opus 4.6 has more than double the intelligence/capability/whatever of GPT 5.1 Max.

19 comments

gm678

NitpickLawyer 15 hours ago

IIRC that graph tracks capabilities as time_to_solve a task for humans (i.e. the model can now handle tasks that usually take a human ~8h). Which, depending on what tasks you look at, could be a reasonable finding. I could see Opus 4.6 handling tasks that take ~8h for humans, and that 5.1 couldn't previously handle (with 5.1 being "limited" at 4h tasks let's say). It is a bit arbitrary, but I think this is what they're tracking.

lukan 14 hours ago
"It is a bit arbitrary, but I think this is what they're tracking."
I don't know if they can get their numbers right this way, but this seems a way more useful metric, than theoretic capabilities.
- cyanydeez 14 hours ago
  
  ok, but arn't you just measuring efficiency and not the big I in AGI improvements.
  
  3 replies →
jrumbut 14 hours ago
Without knowing more about their methodology, it seems like a lot of the recent improvements have involved the AI itself taking time to complete the task.
At first the models turned a 5 minute task into a 5 second task (by 5 seconds I mean a very short amount of time, not precisely 5 seconds). Then they turned a 15 minute task into a 5 second task.
Opus 4.6 completes 8 hour tasks all the time but (at least in my experience) it isn't spitting the answer out in 5 seconds anymore. It's using chain of thought and tools and the time to completion is measured in minutes or maybe hours.
In my experiments with local LLMs, a substantial part of the gap between frontier and local (for everyday use) is in tooling and infrastructure.
That is why I am sympathetic to the idea we are leveling off. But to bring in the air speed example from the article, I don't think we've reached the equivalent of the ramjet yet. I suspect in the coming years there will be new architectures, new hardware, and new ways to get even more capable models.
- Leynos 12 hours ago
  
  It measures ability to complete (with a given success rate) a task with a known human benchmark time to complete. I.e., they set the task to human volunteers and timed how long they took the complete that task.
MadxX79 14 hours ago
I don't know why people are so impressed by 8h.
I trained an LLM to write the whole Harry Potter series, and that took JK Rowling like 17 years.
For my next point on the graph, I'll train the LLM to write the Bible, something that took humans >1500 years.
- Leynos 11 hours ago
  
  Look at the tasks in the benchmark (see §2 https://arxiv.org/html/2503.14499v3)
  
  1 reply →

strken 14 hours ago

Check out Re-Bench and HCAST.

The tasks are obviously all of the form "Go do this, and if you get the following output you passed". Setting up a web server apparently takes 15 minutes for a human, which is news to me since I'm able to search for https://gist.github.com/willurd/5720255, find the python one-liner, and copy it within about ten seconds.

Anyway, this is cool but it does not mean Claude can perform any human tasks that take less than 8 hours and are within its physical capabilities.

throwaway27448 14 hours ago

> more than double the intelligence/capability/whatever

I'm curious what people really mean when they say this. Intelligence is famously hard to define, let alone measure; it certainly doesn't scale linearly; it only loosely correlates to real-world qualities that are easy to measure; etc. Are you referring to coding ability or...?

adw 14 hours ago

https://podcasts.apple.com/us/podcast/machine-learning-stree... is a pretty good primer on METR, what it measures, and its limitations.

myhf 14 hours ago

According to this article: whenever someone games a benchmark to make an upward chart on some y-axis, it's YOUR responsibility to prove how and why that trend can't continue indefinitely.

emoji face with eyes rolling upward

skybrian 14 hours ago

Seems to me that the default is "I don't know what's going to happen" and if you're making a confident prediction, bring evidence.
Scott makes a Lindy effect argument which is plausible, but don't let that fool you, we still don't know what's going to happen.
AnimalMuppet 14 hours ago

I'm pretty sure that gaming benchmarks can continue indefinitely.

BoredPositron 15 hours ago

https://metr.org/time-horizons/ on linear scale. Clickbait garbage article as most of his in the last year.

afthonos 14 hours ago

…yeah, that’s where you see the exponential?