← Back to context

Comment by nopinsight

4 days ago

Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.

https://x.com/METR_Evals/status/1912594122176958939

—-

The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.

o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.

Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.

It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.

Ref:

- https://proceedings.neurips.cc/paper_files/paper/2012/file/c...

- https://livebench.ai/#/

Imagenet had improved the error rate by 100*11/25=44%.

o1 to o3 error rate went from 28 to 19, so 100*9/28=32%.

But these are meaningless comparisons because it’s typically harder to improve already good results.