Comment by davedx
1 day ago
One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
One of the more important features of agents is supposedly that they can stop and ask for human input when necessary. It seems it does do this for "hard stops" - like when it needed a human to setup API keys in their cloud console - but for "soft stops" it wouldn't.
By contrast, a human dev would probably throw in the towel after a couple of hours and ask a senior dev for guidance. The chat interface definitely supports that with this system but apparently the agent will churn away in a sort of "infinite thinking loop". (This matches my limited experience with other agentic systems too.)
LLMs can create infinite worlds in the error message it’s receiving. It probably needs some outside signal to stop and re-assess. I don’t think LLMs have any ability to reason if they’re lost in their own world on their own. They’ll just keep creating new less and less coherent context for themselves
If you correct an LLM based agent coder, you are always right. Often, if you give it advice, it pretends like it understands you, then goes on to do something different from what it said it was going to do. Likewise, it will outright lie to you telling you it did things it didn't do. (In my experience)
So when people say these things are like junior developers, they really mean that they’re like the worst _stereotype_ of junior developers, then?
1 reply →
For sure - but if I'm paying for a tool like Devin then I'd expect the infrastructure around it to do things like stop it if it looks like that has happened.
What you often see with agentic systems is that there's an agent whose role is to "orchestrate", and that's the kind of thing the orchestrator would do: every 10 minutes or so, check the output and elapsed time and decide if the "developer" agent needs a reality check.
How would it decide if it needs a reality check? Would the thing checking have the same limitations?
1 reply →
You can maybe have a supervisor AI agent trigger a retry / new approach
They need impatience!
I think training it to do that would be the hard part.
- stopping is probably the easy part
- I assume this happens during RLFH phase
- Does the model simply stop or does it ask a question?
- You need a good response or interaction, depending on the query? So probably sets or decision trees of them, or agentic even? (chicken-egg problem?)
- This happens 10s of thousands of times, having humans do it, especially with coding, is probably not realistic
- Incumbents like M$ with Copilot may have an advantage in crafting a dataset
> One thing that surprised me a little is that there doesn't seem to be an "ask for help" escape hatch in it - it would work away for literally days on a task where any human would admit they were stuck?
You are over-estimating the sophistication of their platform and infrastructure. Everyone was talking about Cursor (or maybe was it astroturfing?) but once I checked it out, it was not far from avante on neovim.
Cursor isn't designed to do long running tasks. As someone mentioned in another comment it's closer to a function call than a process like Devin.
It will only do one task at a time that it's asked to do.
...for now.
They are pushing in this direction with the Composer Agent mode which can carry out a sequence of multi-file changes without you having to specify the files. It's pretty decent. If you're feeling brave there is also a beta "YOLO" mode that will auto approve these changes and run console commands.
Devin does ask for help when it can't do something. I think I have it asked me how to use a testing suite it had trouble running.
The problem is it really really hate asking for help if it had a skill issue, it would prefer running in circles than admitting it just can't do something.
So they perfectly nailed the junior engineer. It’s just that that isn’t what people are looking for.
Maybe. It's pretty weird and I'm still thinking about it.
You can't throw junior engineers working on an issue under the bus when they clearly can't do that. Or at least it takes some effort. In return you may coach them and hope they eventually improves.
Devin does look like junior engineers, but I've learned to just click "Terminate Session" immediately after I spotted that it was doing something hopeless. I've managed to get some real work done out of it, without much effort on my side (just check what it's doing every 10~15 minutes and type a few lines or restart session).
If they had built that from the beginning people would have said "every other tasks it asks me for help, how is it a developer then if I have to assist it all the time?"
But now since you are okay with that, I think it's the right time to add that feature.
You can set a "max work time" before it pauses so it wont go for days endlessly spending your credits. By default its set to 10 credits.
So I'm not sure how the author got it to go for days.
There should be an energy coefficient to problems. You only get a set amount of energy to solve per issue. When the energy runs out. A human must help.