Comment by metalliqaz
6 days ago
Can someone explain to me what this means?
> People coding with LLMs today use agents. Agents get to poke around your codebase on their own. They author files directly. They run tools. They compile code, run tests, and iterate on the results. ...
Is this what people are really doing? Who is just turning AI loose to modify things as it sees fit? If I'm not directing the work, how does it even know what to do?
I've been subjected to forced LLM integration from management, and there are no "Agents" anywhere that I've seen.
Is anyone here doing this that can explain it?
I cut several paragraphs from this explaining how agents work, which I wrote anticipating this exact comment. I'm very happy to have brought you to this moment of understanding --- it's a big one. The answer is "yes, that's exactly what people are doing": "turning LLMs loose" (really, giving them some fixed number of tool calls, some of which might require human approval) to do stuff on real systems. This is exactly what Cursor is about.
I think it's really hard to undersell how important agents are.
We have an intuition for LLMs as a function blob -> blob (really, token -> token, but whatever), and the limitations of such a function, ping-ponging around in its own state space, like a billion monkeys writing plays.
But you can also get go blob -> json, and json -> tool-call -> blob. The json->tool interaction isn't stochastic; it's simple systems code (the LLM could indeed screw up the JSON, since that process is stochastic --- but it doesn't matter, because the agent isn't stochastic and won't accept it, and the LLM will just do it over). The json->tool-call->blob process is entirely fixed system code --- and simple code, at that.
Doing this grounds the code generation process. It has a directed stochastic structure, and a closed loop.
I'm sorry but this doesn't explain anything. Whatever it is you have in your mind, I'm afraid it's not coming across on the page. There is zero chance that I'm going to let an AI start running arbitrary commands on my PC, let alone anything that resembles a commit.
What is an actual, real world example?
This all works something like this: an "agent" is a small program that takes a prompt as input, say "//fix ISSUE-0451".
The agent code runs a regex that recognizes this prompt as a reference to a JIRA issue, and runs a small curl with predefined credentials to download the bug description.
It then assembles a larger text prompt such as "you will act as a master coder to understand and fix the following issue as faithfully as you can: {JIRA bug description inserted here}. You will do so in the context of the following code: {contents of 20 files retrieved from Github based on Metadata in the JIRA ticket}. Your answer must be in the format of a Git patch diff that can be applied to one of these files".
This prompt, with the JIRA bug description and code from your Github filled in, will get sent to some LLM chosen by some heuristic built into the agent - say it sends it to ChatGPT.
Then, the agent will parse the response from ChatGPT and try to parse it as a Git patch. If it respects git patch syntax, it will apply it to the Git repo, and run something like `make build test`. If that runs without errors, it will generate a PR in your Github and finally output the link to that PR for you to review.
If any of the steps fails, the agent will generate a new prompt for the LLM and try again, for some fixed number of iterations. It may also try a different LLM or try to generate various follow-ups to the LLM (say, it will send a new prompt in the same "conversation" like "compilation failed with the following issue: {output from make build}. Please fix this and generate a new patch."). If there is no success after some number of tries, it will give up and output error information.
You can imagine many complications to this workflow - the agent may interrogate the LLM for more intermediate steps, it may ask the LLM to generate test code or even to generate calls to other services that the agent will then execute with whatever credentials it has.
It's a byzantine concept with lots of jerry-rigging that apparently actually works for some use cases. To me it has always seemed far too much work to get started before finding out if there is any actual benefit for the codebases I work on, so I can't say I have any experience with how well these things work and how much they end up costing.
The commands aren't arbitrary. They're particular— you write the descriptions of the tools it's allowed to use and it can only invoke those commands.
I'm interested in playing with this, since reading the article, but I think I will only have it run things in some dedicated VM. If it seems better than other LLM use, I'll gradually rely on it more, but likely keep its actions confined to the VM.
> There is zero chance that I'm going to let an AI start running arbitrary commands on my PC
The interfaces prompt you when it wants to run a command, like "The AI wants to run 'cargo add anyhow', is that ok?"
They're not arbitrary, far from it. You have a very constrained set of tools each agent can do. An agent has a "job" if you will.
Maybe the agent feeds your PR to the LLM to generate some feedback, and posts a the text to the PR as a comment. Maybe it can also run the linters, and use that as input to the feedback.
But the at the end of the day, all it's really doing is posting text to a github comment. At worst it's useless feedback. And while I personally don't have much AI in my workflow today, when a bunch of smart people are telling me the feedback can be useful I can't help but be curious!
> Is this what people are really doing?
Some people are, and some people are not. This is where some of the disconnect is coming from.
> Who is just turning AI loose to modify things as it sees fit?
In the advent of source control, why not? If it does something egregiously wrong, you can throw it away easily and get back to a previous state with ease.
> If I'm not directing the work, how does it even know what to do?
You're directing the work, but at a higher level of abstraction.
> You're directing the work, but at a higher level of abstraction.
The article likens this to a Makefile. I gotta say, why not just use a Makefile and save the CO2?
Being kind of like a Makefile does not mean that they're equivalent. They're different tools, good for different things. That they happen to both be higher level than source code doesn't mean that they're substitutes.
This is how I work:
I use Cursor by asking it exactly what I want and how I want it. By default, Cursor has access to the files I open, and it can reference other files using grep or by running specific commands. It can edit files.
It performs well in a fairly large codebase, mainly because I don’t let it write everything. I carefully designed the architecture and chose the patterns I wanted to follow. I also wrote a significant portion of the initial codebase myself and created detailed style guides for my teammates.
As a result, Cursor (or you can say models you selecting because cursor is just a router for commercial models) handles small, focused tasks quite well. I also review every piece of code it generates. It's particularly good at writing tests, which saves me time.
Zed has a great four minute demo showing how it works: https://zed.dev/agentic
I personally have my Zed set up so the agent has to request every command be manually reviewed and approved before running.
I run Cursor in a mode that starts up shell processes, runs linters, tests etc on its own, updates multiple files, runs the linter and tests again, fixes failures, and so on. It auto stops at 20 iterations through the feedback loop.
Depending on the task it works really well.
This example seems to keep coming up. Why do you need an AI to run linters? I have found that linters actually add very little value to an experience programmer, and actually get in the way when I am in the middle of active development. I have to say I'm having a hard time visualizing the amazing revolution that is alluded to by the author.
Static errors are caught by linters before runtime errors are caught by a test suite. When you have an LLM in a feedback loop, otherwise known as an agent, then iterative calls to the LLM will include requests and responses from linters and test suites, which can assure the user, who typically follows along with the entire process, that the agent is writing better code than it would otherwise.
You're missing the point. The main thing the AI does is to generate code based on a natural-language description of a problem. The liners and tests and on exist to guide this process.
The initial AI-based work flows were "input a prompt into ChatGPT's web UI, copy the output into your editor of choice, run your normal build processes; if it works, great, if not, copy the output back to ChatGPT, get new code, rinse and repeat".
The "agent" stuff is trying to automate this loop. So as a human, you still write more or less the same prompt, but now the agent code automates that loop of generating code with an LLM and running regular tools on it and sending those tools' output back to the LLM until they succeed for you. So, instead of getting code that may not even be in the right programming language as you do from an LLM, you get code that is 100% guaranteed to run and passes your unit tests and any style constraints you may have imposed in your code base, all without extra manual interaction (or you get some kind of error if the problem is too hard for the LLM).
I let an agent upgrade some old C code that wouldn’t compile and had 100’s of warnings. It was running builds on its own, looking at new errors, etc. It even wrote some tests! I could’ve done this myself but it was a hobby project and tedious work. I was impressed.
you are giving it instructions but it's running a while loop with a list of tools and it can poke around in your code base until it thinks it's done whatever you ask for.
See Claude Code, windsurf, amp, Kilcode, roo, etc.
I might describe a change I need to have made and then it does it and then I might say "Now the tests are failing. Can you fix them?" and so on.
Sometimes it works very great. sometimes you find yourself arguing with the computer.