Comment by DanMcInerney
6 months ago
These articles kill me. The reason LLMs (or next-gen AI architecture) is inevitably going to take over the world in one way or another is simple: recursive self-improvement.
3 years ago they could barely write a coherent poem and today they're performing at at least graduate student level across most tasks. As of today, AI is writing a significant chunk of the code around itself. Once AI crosses that threshold of consistently being above senior-level engineer level at coding it will reach a tipping point where it can improve itself faster than the best human expert. That's core technological recursive self-improvement but we have another avenue of recursive self-improvement as well: Agentic recursive self-improvement.
First there was LLMs, then there was LLMs with tool usage, then we abstracted the tool usage to MCP servers. Next, we will create agents that autodiscover remote MCP servers, then we will create agents which can autodiscover tools as well as write their own.
Final stage of agents are generalized agents similar to Claude Code which can find remote MCP servers, perform a task, then analyze their first run of completing a task to figure out how to improve the process. Then write its own tools to use to complete the task faster than they did before. Agentic recursive self-improvement. As an agent engineer, I suspect this pattern will become viable in about 2 years.
> recursive self-improvement.
What LLM is recursively self-improving?
I thought, to date, all LLM improvements come from the hundreds of billions of dollars of investment and the millions of software engineer hours spent on better training and optimizations.
And, my understanding is, there are "mixed" findings on whether LLMs assisting those software engineers help or hurt their performance.
> they're performing at at least graduate student level across most tasks
I strongly disagree with this characterization. I have yet to find an application that can reliably execute this prompt:
"Find 90 minutes on my calendar in the next four weeks and book a table at my favorite Thai restaurant for two, outside if available."
Forget "graduate-level work," that's stuff I actually want to engage with. What many people really need help with is just basic administrative assistance, and LLMs are way too unpredictable for those use cases.
I've found that they struggle with understanding time and dates, and are sometimes weird about numbers. I asked Grok to guess the likelihood of something happening, and it gave me percentages for that day, the next day, the next week, and so on. Good enough. But the next day it was still predicting a 5-10% chance of the thing happening the previous day. I had to explain to it that the percentage for yesterday should now be 0%, since it was in the past.
In another example, I asked it to turn one of its bullet-point answers into a conversational summary that I could turn into an audio file to listen to later. It kicked out something that converted into about 6 minutes of audio, so I asked if it could expand on the details and give me something about 20 minutes. It kicked out a text that made about 7 minutes. So I explained that that was X words and only lasted 7 minutes, so I needed about 3X words. It kicked out about half that but claimed it was giving me 3X words or 20 minutes.
It's the little stuff like that that makes me think that, no matter how useful it might be for some things, it's a long way from being able to just hand it tasks and expect them to be done as reliably as a fairly dim human intern. If an intern kept coming up with half the job I asked for, I'd assume he was being lazy and let him go, but these things are just dumb in certain odd ways.
This is similar to many experiences I've had with LLM tools as well; the more complex and/or multi-step the task, the less reliable they become. This is why I object to the "graduate-level" label that Sam Altman et al. use. It fundamentally misrepresents the skill pyramid that makes a researcher (or any knowledge worker) effective. If a researcher can't reliably manage a to-do list, they can't be left unsupervised with any critical tasks, despite the impressive amount of information they can bring to bear and the efficiency with which they can search the web.
That's fine, I get a lot of value out of AI tooling between ChatGPT, Cursor, Claude+MCP, and even Apple Intelligence. But I have yet to use an agent that has come close to the capabilities that AI optimists claim with any consistency.
This is absolutely doable right now. Just hook claude code up with your calendar MCP server and any one of these restaurant/web browser MCP servers and it'll do this for you.
https://apify.com/canadesk/opentable/api/mcp https://github.com/BrowserMCP/mcp https://github.com/samwang0723/mcp-booking
How reliable are the results? I can expect a human with graduate-level execution to get this right almost 100% of the time and adapt to unforeseen extenuating circumstances.
OpenAI Operator can do that task easily, assuming you've configured it with your calendar and Yelp login.
That's great to hear - do you know what success rate it might have? I've used scheduled tasks in ChatGPT and they fail regularly enough to fall into the "toy" category for me. But if Operator is operating significantly above that threshold, that would be remarkable and I'd gladly eat my words.
Recursive self improvement is not inevitable.
Well... I guess we'll see.
!remindme