Comment by datpuz
1 day ago
Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.
1 day ago
Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.
That's why in practice you need more than this simple loop!
Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]
This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].
- [1]: https://arxiv.org/abs/2505.06120
- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts
They're extremely good at burning through budgets, and get even better when unattended
Maximising paperclip production too.
Is that really true? I though there free models and $200 all you can eat models.
These tools require API calls which usually aren’t priced like the consumer plans
7 replies →
Read that you can very quickly blow the budget on the 200/mo ones too
They've written most of the recent iterations of X11 bindings for Ruby, including a complete, working example of a systray for me.
They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).
You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.
How do they read the screen?
Agents? Doubt.
You can doubt it all you want - it doesn't make it any less true.
1 reply →
The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough. Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases.
They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.
I think they are capable of doing it, but it requires prompting.
I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify
And they mostly do this.
But this needs to be default behavior!
I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.
Is there value in adding an overseer LLM that measures the progress between n steps and if it's too low stops and calls out to a human?
I don't think you need an overseer for this, you can just have the agent self-assess at each step whether it's making material progress or if it's caught in a loop, and if it's caught in a loop to pause and emit a prompt for help from a human. This would probably require a bit of tuning, and the agents need to be setup with a blocking "ask for help" function, but it's totally doable.
Bruh, we're inventing robot PMs for our robot developers now? We're so fucked
Yes it works really well. We do something like that at NonBioS.ai - longer post below. The agent self reflects if it is stuck or confused and calls out the human for help.
And how does it effectively measure progress?
21 replies →
The hope is that the ground truth from calling out to tools (like compilers or test runs) will eventually be enough keep them on track.
Just like humans and human organisations also tend to experience drift, unless anchored in reality.
I built android-use[1] using LLM. It is pretty good at self healing due to the "loop", it constantly checks if the current step is actually a progress or regress and then determines next step. And the thing is nothing is explicitly coded, just a nudge in the prompts.
1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)
You don't have to. Most of the appeal is automatically applying fixes like "touch file; make" after spotting a trivial mistake. Just let it at it.
Definitely true currently, which is why there's so much focus on using them to write real code that humans have to actually commit and put their names on.
Longer term, I don't think this holds due to the nature of capitalism.
If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.