Is this because the models improved, or the tooling around models improved (both visible and not visible to the end user).
My impression is that the base models have not improved dramatically in the last 6 months and incremental improvements in those models is becoming extremely expensive.
At the same time, we have the first really useful 1mm token context model available with reasonably good skills across the context window (Gemini Pro 2.5), and that opens up a different category of work altogether. Reasoning models got launched to the world in the last six months, another significant dimension of improvement.
TLDR: Massive, massive increase in quality for coding models. And o3 is to my mind over the line people had in mind for generally intelligent in, say, 2018 — o3 alone is a huge improvement launched in the last six months. You can now tell o3 something like: “research the X library and architect a custom extension to that library that interfaces with my weird garage door opener; after writing the architecture implement the extension in (node/python/go) and come back in 20 minutes with something that almost certainly compiles and likely largely interfaces properly, leaving touch-up work to be done.
For a fair definition of able, yes. Those models had no ability to engage in a search of emails.
What’s special about it is that it required no handholding; that is new.
Is this because the models improved, or the tooling around models improved (both visible and not visible to the end user).
My impression is that the base models have not improved dramatically in the last 6 months and incremental improvements in those models is becoming extremely expensive.
Resist getting your news from Brooklyn journalists. :)
Tooling has improved, and the models have. The combo is pretty powerful.
https://aider.chat/docs/leaderboards/ will give you a flavor of the last six months of improvements. Francois Cholet (ARC AGI: https://arcprize.org/leaderboard) has gone from “No current architecture will ever beat ARC” to “o3 has beaten ARC and now we have designed ARC 2”.
At the same time, we have the first really useful 1mm token context model available with reasonably good skills across the context window (Gemini Pro 2.5), and that opens up a different category of work altogether. Reasoning models got launched to the world in the last six months, another significant dimension of improvement.
TLDR: Massive, massive increase in quality for coding models. And o3 is to my mind over the line people had in mind for generally intelligent in, say, 2018 — o3 alone is a huge improvement launched in the last six months. You can now tell o3 something like: “research the X library and architect a custom extension to that library that interfaces with my weird garage door opener; after writing the architecture implement the extension in (node/python/go) and come back in 20 minutes with something that almost certainly compiles and likely largely interfaces properly, leaving touch-up work to be done.
2 replies →