Comment by datpuz

5 months ago

Can't think of anything an LLM is good enough at to let them do on their own in a loop for more than a few iterations before I need to reign it back in.

55 comments

datpuz

hbbio 5 months ago

That's why in practice you need more than this simple loop!

Pretty much WIP, but I am experimenting with simple sequence-based workflows that are designed to frequently reset the conversation [2]

This goes well with Microsoft paper "LLMs Get Lost In Multi-Turn Conversation " that was published Friday [1].

- [1]: https://arxiv.org/abs/2505.06120

- [2]: https://github.com/hbbio/nanoagent/blob/main/src/workflow.ts

Groxx 5 months ago

They're extremely good at burning through budgets, and get even better when unattended

_kb 5 months ago

Maximising paperclip production too.
mycall 5 months ago
Is that really true? I though there free models and $200 all you can eat models.
- nsomaru 5 months ago
  
  These tools require API calls which usually aren’t priced like the consumer plans
  
  7 replies →
- piuantiderp 5 months ago
  
  Read that you can very quickly blow the budget on the 200/mo ones too

CuriouslyC 5 months ago

The main problem with agents is that they aren't reflecting on their own performance and pausing their own execution to ask a human for help aggressively enough. Agents can run on for 20+ iterations in many cases successfully, but also will need hand holding after every iteration in some cases.

They're a lot like a human in that regard, but we haven't been building that reflection and self awareness into them so far, so it's like a junior that doesn't realize when they're over their depth and should get help.

vendiddy 5 months ago

I think they are capable of doing it, but it requires prompting.
I constantly have to instruct them: - Go step by step, don't skip ahead until we're done with a step - Don't make assumptions, if you're unsure ask questions to clarify
And they mostly do this.
But this needs to be default behavior!
I'm surprised that, unless prompted, LLMs never seem to ask follow-up questions as a smart coworker might.
ariwilson 5 months ago
Is there value in adding an overseer LLM that measures the progress between n steps and if it's too low stops and calls out to a human?
- CuriouslyC 5 months ago
  
  I don't think you need an overseer for this, you can just have the agent self-assess at each step whether it's making material progress or if it's caught in a loop, and if it's caught in a loop to pause and emit a prompt for help from a human. This would probably require a bit of tuning, and the agents need to be setup with a blocking "ask for help" function, but it's totally doable.
- solumunus 5 months ago
  
  And how does it effectively measure progress?
  
  24 replies →
- p_v_doom 5 months ago
  
  Bruh, we're inventing robot PMs for our robot developers now? We're so fucked
- suninsight 5 months ago
  
  Yes it works really well. We do something like that at NonBioS.ai - longer post below. The agent self reflects if it is stuck or confused and calls out the human for help.

vidarh 5 months ago

They've written most of the recent iterations of X11 bindings for Ruby, including a complete, working example of a systray for me.

They also added the first pass of multi-monitor support for my WM while I was using it (restarted it repeatedly while Claude Code worked, in the same X session the terminal it was working in was running).

You do need to reign them back in, sure, but they can often go multiple iterations before they're ready to make changes to your files once you've approved safe tool uses etc.

TZubiri 5 months ago
How do they read the screen?
- vidarh 5 months ago
  
  It didn't. For those tasks there were no need for it to. For the systray, I had to a couple of times provide debug feedback of the type "the window is visible now", and that was it.
datpuz 5 months ago
Agents? Doubt.
- vidarh 5 months ago
  
  You can doubt it all you want - it doesn't make it any less true.
  
  2 replies →

eru 5 months ago

The hope is that the ground truth from calling out to tools (like compilers or test runs) will eventually be enough keep them on track.

Just like humans and human organisations also tend to experience drift, unless anchored in reality.

mkagenius 5 months ago

I built android-use[1] using LLM. It is pretty good at self healing due to the "loop", it constantly checks if the current step is actually a progress or regress and then determines next step. And the thing is nothing is explicitly coded, just a nudge in the prompts.

1. clickclickclick - A framework to let local LLMs control your android phone (https://github.com/BandarLabs/clickclickclick)

loa_in_ 5 months ago

You don't have to. Most of the appeal is automatically applying fixes like "touch file; make" after spotting a trivial mistake. Just let it at it.

JeremyNT 5 months ago

Definitely true currently, which is why there's so much focus on using them to write real code that humans have to actually commit and put their names on.

Longer term, I don't think this holds due to the nature of capitalism.

If given a choice between paying for an LLM to do something that's mostly correct versus paying for a human developer, businesses are going to choose the former, even if it results in accelerated enshittification. It's all in service of reducing headcount and taking control of the means of production away from workers.