Comment by rtpg
15 days ago
I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.
The end result being these robots doing bikeshedding. When paired with junior engineers looking at this output and deciding to act on it, it just generates busywork. Not helping that everyone and their dog wants to automatically run their agent against PRs now
I'm trying to use these to some extent when I find myself in a canonical situation that should work and am not getting the value everyone else seems to get in many cases. Very much "trying to explain a thing to a junior engineer taking more time than doing it myself" thing, except at least the junior is a person.
When models start to forage around in the weeds, it's a good idea to restart the session and add more information to the prompt for what it should ignore or assume. For example in ML projects, Claude gets very worried that datasets aren't available or are perhaps responsible. Usually if you tell it outright where you suspect the bug to be (or straight up tell it, even if you're unsure) it will focus on that. Or, make it give you a list of concerns and ask you which are valid.
I've found that having local clones of large library repos (or telling it to look in the environment for packages) is far more effective than relying on built-in knowledge or lousy web search. It can also use ast-grep on those. For some reason the agent frameworks are still terrible about looking up references in a sane way (where in an IDE you would simply go to declaration).
Context7 MCP is the one I keep enabled for all sessions. Then there are MCPs that give LSP access to the models as well as tools like Crush[0] that have LSPs built in.
[0] https://github.com/charmbracelet/crush
Yeah, I do the same too, cloning reference repos into known paths, tell it to look there if unsure.
Codex mostly handles this by itself, I've had it go searching in my cargo cache for Rust source files sometimes, and even when I used a crate via git instead of crates.io, it went ahead and cloned the repo to /tmp to inspect it properly. Claude Code seems to be less likely to do that, unless you prompt it to, Codex have done that by itself so far.
[dead]
Sometimes you hit a wall where something is simply outside of the LLM's ability to handle, and it's best to give up and do it yourself. Knowing when to give up may be the hardest part of coding with LLMs.
Notably, these walls are never where I expect them to be—despite my best efforts, I can't find any sort of pattern. LLMs can find really tricky bugs and get completely stuck on relatively simple ones.
Doing it yourself is how you build and maintain the muscles to do it yourself. If you only do it yourself when the LLM fails, how will you maintain those muscles?
I agree, and I can actively feel myself slipping (and perhaps more critically, not learning new skills I would otherwise have been forced to learn). It's a big problem, but somewhat orthogonal to "what is the quickest way to solve the task currently in front of me."
4 replies →
By moving up a level in the abstraction layer similar to moving from Assembly to C++ to Python (to LLM). There’s speed in delegation (and checking as beneficial).
9 replies →
If the LLM is able to handle it why do you need to maintain those specific skills?
8 replies →
Sure, I agree with the "levels of automation" thought process. But I'm basically experiencing this from the start.
If at the first step I'm already dealing with a robot in the weeds, I will have to spend time getting it out of the weeds, all for uncertain results afterwards.
Now sometimes things are hard and tricky, and you might still save time... but just on an emotional level, it's unsatisfying
Communication with a person is more difficult and the feedback loop is much, much longer. I can almost instantly tell whether Claude has understood the mission or digested context correctly.
I would say a lot of people are only posting their positive experiences. Stating negative things about AI is mildly career-dangerous at the moment where as the opposite looks good. I found the results from using it on a complicated code base are similar to yours, but it is very good at slapping things on until it works.
If you're not watching it like a hawk it will solve a problem in a way that is inconsistent and, importantly, not integrated into the system. Which makes sense, it's been trained to generate code, and it will.
> I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.
How long/big do your system/developer/user prompts end up being typically?
The times people seem to be getting "less than ideal" responses from LLMs tend to be when they're not spending enough time setting up a general prompt they can reuse, describing exactly what they want and do not want.
So in your case, you need to steer it to do less outside of what you've told it. Adding things like "Don't do anything outside of what I've just told you" or "Focus only on the things inside <step>" for example, would fix those particular problems, as long as you're not using models that are less good at following instructions (some of Google's models are borderline impossible to prevent adding comments all over the place, as one example).
So prompt it to not care about solutions, and only care about finding the root cause, and you'll find that you can mostly avoid the annoying parts by either prescribing what you'd want instead, or just straight up tell it not to do those things.
Then you iterate on this reusable prompt across projects, and it builds up so eventually 99% of the times the models do exactly what you expect.
Just ask it to prioritize the top ones for your review. Yes, they can bikeshed, but because they don’t have egos, they don’t stick to it.
Alternatively, if it is in an area with good test coverage, let it go fix the minor stuff.
I don't like their fixes, so now I'm dealing with imperfect fixes to problems I don't care about. Tedium
> except at least the junior is a person.
+1 Juniors can learn over time.
Ok, fair critique.
EXCEPT…
What did you have for AI three years ago? Jack fucking shit is what.
Why is “wow that’s cool, I wonder what it’ll turn into” a forbidden phrase, but “there are clearly no experts on this topic but let me take a crack at it!!” important for everyone to comment on?
One word: Standby. Maybe that’s two words.
With all due respect, "wow this is cool, I wonder what it'll turn into" is basically the mandatory baseline stance to take. I'm lucky that's where I'm still basically at, because anyone in a technical position who shows even mild reticence beyond that is likely to be unable to hold a job in the face of their bosses' frothing enthusiastic optimism about these technologies
Is it that bad out there? Yeah, I don't think I could last in a job that tries to force these tools into my workflow.
2 replies →
Careful there, ChatGPT was initially released November 30, 2022, which was just about 3 years ago, and there were coding assistants before that.
If you find yourself saying the same thing every year and adding 1 to the total...
So you feed the output into another LLM call to re-evaluate and assess, until the number of actual reports is small enough to be manageable. Will this result in false negatives? Almost certainly. But what does come out the end of it has a higher prior for being relevant, and you just review what you can.
Again, worst case all you wasted was your time, and now you've bounded that.