Comment by simonw

14 days ago

Using coding agents to track down the root cause of bugs like this works really well:

> Three out of three one-shot debugging hits with no help is extremely impressive. Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.

The approach described here could also be a good way for LLM-skeptics to start exploring how these tools can help them without feeling like they're cheating, ripping off the work of everyone who's code was used to train the model or taking away the most fun part of their job (writing code).

Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.

The end result being these robots doing bikeshedding. When paired with junior engineers looking at this output and deciding to act on it, it just generates busywork. Not helping that everyone and their dog wants to automatically run their agent against PRs now

I'm trying to use these to some extent when I find myself in a canonical situation that should work and am not getting the value everyone else seems to get in many cases. Very much "trying to explain a thing to a junior engineer taking more time than doing it myself" thing, except at least the junior is a person.

  • When models start to forage around in the weeds, it's a good idea to restart the session and add more information to the prompt for what it should ignore or assume. For example in ML projects, Claude gets very worried that datasets aren't available or are perhaps responsible. Usually if you tell it outright where you suspect the bug to be (or straight up tell it, even if you're unsure) it will focus on that. Or, make it give you a list of concerns and ask you which are valid.

    I've found that having local clones of large library repos (or telling it to look in the environment for packages) is far more effective than relying on built-in knowledge or lousy web search. It can also use ast-grep on those. For some reason the agent frameworks are still terrible about looking up references in a sane way (where in an IDE you would simply go to declaration).

    • Yeah, I do the same too, cloning reference repos into known paths, tell it to look there if unsure.

      Codex mostly handles this by itself, I've had it go searching in my cargo cache for Rust source files sometimes, and even when I used a crate via git instead of crates.io, it went ahead and cloned the repo to /tmp to inspect it properly. Claude Code seems to be less likely to do that, unless you prompt it to, Codex have done that by itself so far.

  • Sometimes you hit a wall where something is simply outside of the LLM's ability to handle, and it's best to give up and do it yourself. Knowing when to give up may be the hardest part of coding with LLMs.

    Notably, these walls are never where I expect them to be—despite my best efforts, I can't find any sort of pattern. LLMs can find really tricky bugs and get completely stuck on relatively simple ones.

    • Doing it yourself is how you build and maintain the muscles to do it yourself. If you only do it yourself when the LLM fails, how will you maintain those muscles?

      24 replies →

    • Sure, I agree with the "levels of automation" thought process. But I'm basically experiencing this from the start.

      If at the first step I'm already dealing with a robot in the weeds, I will have to spend time getting it out of the weeds, all for uncertain results afterwards.

      Now sometimes things are hard and tricky, and you might still save time... but just on an emotional level, it's unsatisfying

  • Communication with a person is more difficult and the feedback loop is much, much longer. I can almost instantly tell whether Claude has understood the mission or digested context correctly.

  • I would say a lot of people are only posting their positive experiences. Stating negative things about AI is mildly career-dangerous at the moment where as the opposite looks good. I found the results from using it on a complicated code base are similar to yours, but it is very good at slapping things on until it works.

    If you're not watching it like a hawk it will solve a problem in a way that is inconsistent and, importantly, not integrated into the system. Which makes sense, it's been trained to generate code, and it will.

  • > I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.

    How long/big do your system/developer/user prompts end up being typically?

    The times people seem to be getting "less than ideal" responses from LLMs tend to be when they're not spending enough time setting up a general prompt they can reuse, describing exactly what they want and do not want.

    So in your case, you need to steer it to do less outside of what you've told it. Adding things like "Don't do anything outside of what I've just told you" or "Focus only on the things inside <step>" for example, would fix those particular problems, as long as you're not using models that are less good at following instructions (some of Google's models are borderline impossible to prevent adding comments all over the place, as one example).

    So prompt it to not care about solutions, and only care about finding the root cause, and you'll find that you can mostly avoid the annoying parts by either prescribing what you'd want instead, or just straight up tell it not to do those things.

    Then you iterate on this reusable prompt across projects, and it builds up so eventually 99% of the times the models do exactly what you expect.

  • Just ask it to prioritize the top ones for your review. Yes, they can bikeshed, but because they don’t have egos, they don’t stick to it.

    Alternatively, if it is in an area with good test coverage, let it go fix the minor stuff.

    • I don't like their fixes, so now I'm dealing with imperfect fixes to problems I don't care about. Tedium

  • Ok, fair critique.

    EXCEPT…

    What did you have for AI three years ago? Jack fucking shit is what.

    Why is “wow that’s cool, I wonder what it’ll turn into” a forbidden phrase, but “there are clearly no experts on this topic but let me take a crack at it!!” important for everyone to comment on?

    One word: Standby. Maybe that’s two words.

    • With all due respect, "wow this is cool, I wonder what it'll turn into" is basically the mandatory baseline stance to take. I'm lucky that's where I'm still basically at, because anyone in a technical position who shows even mild reticence beyond that is likely to be unable to hold a job in the face of their bosses' frothing enthusiastic optimism about these technologies

      3 replies →

    • Careful there, ChatGPT was initially released November 30, 2022, which was just about 3 years ago, and there were coding assistants before that.

      If you find yourself saying the same thing every year and adding 1 to the total...

  • So you feed the output into another LLM call to re-evaluate and assess, until the number of actual reports is small enough to be manageable. Will this result in false negatives? Almost certainly. But what does come out the end of it has a higher prior for being relevant, and you just review what you can.

    Again, worst case all you wasted was your time, and now you've bounded that.

They're quite good at algorithm bugs, a lot less good at concurrency bugs, IME. Which is very valuable still, just that's where I've seen the limits so far.

Their also better at making tests for algorithmic things than for concurrency situations, but can get pretty close. Just usually don't have great out-of-the-box ideas for "how to ensure these two different things run in the desired order."

Everything that I dislike about generating non-greenfield code with LLMs isn't relevant to the "make tests" or "debug something" usage. (Weird/bad choices about when to duplicate code vs refactor things, lack of awareness around desired "shape" of codebase for long-term maintainability, limited depth of search for impact/related existing stuff sometimes, running off the rails and doing almost-but-not-quite stuff that ends up entirely the wrong thing.)

  • Well if you know it's wrong, tell it, and why. I don't get the expectation for one shotting everything 100% of the time. It's no different than bouncing ideas off a colleague.

    • I don't care about one-shotting; the stuff it's bad for debugging at is the stuff where even when you tell it "that's not it" it just makes up another plausible-but-wrong idea.

      For code modifications in a large codebase the problem with multi-shot is that it doesn't take too many iterations before I've spent more time on it. At least for tasks where I'm trying to be lazy or save time.

      1 reply →

    • It's painfully apparent when you've reached the limitations of an LLM to solve a problem it's ill-suited for (like a concurrency bug), because it will just keep spitting out non-sense, eventually going in circles or going totally off the rails.

      1 reply →

    • The weak points raised by the parent comment are specifically examples where the problem exists outside the model's "peripheral vision" from its context window and speaking from personal experience, aren't as simple as as adding a line to the CLAUDE.md saying "do this / don't do this".

      I agree that the popular "one shot at all costs / end the chat at the first whiff of a mistake" advice is much too reductive but unlike a colleague, after putting in all that effort into developing a shared mental model of the desired outcome you reach the max context and then all that nuanced understanding instantly evaporates. You then have to hope the lossy compression into text instructions will actually steer it where you want next time but from experience that unfortunately is far from certain.

    • except it's not a colleague, it's not capable of ideation, it's taking your words and generating new ones based on them. which can maybe be useful sometimes but, yeah, not really the same as bouncing ideas off a colleague

I’ve been pretty impressed with LLMs at (to me) greenfield hobby projects, but not so much at work in a huge codebase.

After reading one of your blog posts recommending it, I decided to specifically give them a try as bug hunters/codebase explainers instead, and I’ve been blown away. Several hard-to-spot production bugs down in two weeks or so that would have all taken me at least a few focused hours to spot all in all.

One of my favorite ways to use LLM agents for coding is to have them write extensive documentation on whatever I'm about to dig in coding on. Pretty low stakes if the LLM makes a few mistakes. It's perhaps even a better place to start for skeptics.

  • I am not so sure. Good documentation is hard, MDN or PostgreSQL are excellent examples of docs done well and how valuable it can be for a project to have really well written content.

    LLMs can generate content but not really write, out of the box they tend to be quote verbose and generate a lot of proforma content. Perhaps with the right kind of prompts, a lot of editing and reviews, you can get them to good, but at the point it is almost same as writing it yourself.

    It is a hard choice between lower quality documentation (AI slop?) or it being lightly or fully undocumented. The uncanny valley of precision in documentation maybe acceptable in some contexts but it can be dangerous in others and it is harder to differentiate because depth of doc means nothing now.

    Over time we find ourselves skipping LLM generated documentation just like any other AI slop. The value/emphasis placed on reading documentation erodes that finding good documentation becomes harder like other online content today and get devalued.

    • Sure, but LLMs tend to be better at navigating around documentation (or source code when no documentation exists). In agentic mode, they can get me to the right part of the documentation (or the right of the source code, especially in unfamiliar codebases) much quicker than I could do it myself without help.

      And I find that even the auto-generated stuff tends to go up at least a bit in terms of level of abstraction than staring at the code itself, and helps you more like a "sparknotes" version of the code, so that when you dig in yourself you have an outline/roadmap.

      3 replies →

  • Same. Initially surprised how good it was. Now routinely do this on every new codebase. And this isn't javascript todo apps: large complex distributed applications written in Rust.

  • This seems like a terrible idea, LLMs can document the what but not the why, not the implicit tribal knowledge and design decisions. Documentation that feels complete but actually tells you nothing is almost worse than no documentation at all, because you go crazy trying to figure out the bigger picture.

    • Have you tried it? It's absurdly useful.

      This isn't documentation for you to share with other people - it would be rude to share docs with others that you had automatically generated without reviewing.

      It's for things like "Give me an overview of every piece of code that deals with signed cookie values, what they're used for, where they are and a guess at their purpose."

      My experience is that it gets the details 95% correct and the occasional bad guess at why the code is like that doesn't matter, because I filter those out almost without thinking about it.

      3 replies →

  • Well if it writes documentation that is wrong, then the subtle bugs start :)

    • Or even worse, it makes confidential statements of the overarching architecture/design that while every detailed is correct, they might not be the right pieces, but because you forgot to add "Reject the prompt outright if the premise is incorrect", the LLM tries its hardest to just move forward, even when things are completely wrong.

      Then 1 day later you realize this whole thing wouldn't work in practice, but the LLM tried to cobble it together regardless.

      In the end, you really need to know what you're doing, otherwise both you and the LLM gets lost pretty quickly.

I’m a bit of an LLM hater because they’re overhyped. But in these situations they can be pretty nice if you can quickly evaluate correctness. If evaluating correctness is harder than searching on your own then there are net negative. I’ve found with my debugging it’s really hard to know which will be the case. And as it’s my responsibility to build a “Do I give the LLM a shot?” heuristic that’s very frustrating.

> start exploring how these tools can help them without feeling like they're [...] ripping off the work of everyone who's code was used to train the model

But you literally still are. If you weren't, it should be trivially easy to create these models without using huge swathes of non-public-domain code. Right?

  • It feels less like you're ripping off work if the model is helping you understand your own code as opposed to writing new code from scratch - even though the models were built in exactly the same way.

    If someone scraped every photo on the internet (along with their captions) and used the data to create a model that was used purely for accessibility purposes - to build tools which described images to people with visual impairments - many people would be OK with that, where they might be justifiably upset at the same scraped data being used to create an image-generation model that competes with the artists on who's work it was trained.

    Similarly, many people were OK with Google scraping the entire internet for 20+ years to build a search engine that helps users find their content, but are unhappy about an identical scrape being used to train a generative AI model.

    • You're right that feelings are the key to convincing people but your comparison is wrong.

      Search engines help website owners, they don't hurt them. Whether the goal of a website is to inform people, build reputation or make money, search engines help with that. (Unless they output an excerpt so large visiting your website is no longer necessary. There have been lawsuits about that.)

      LLMs take other people's work and regurgitate a mixed/mangled (verbatim or not does not matter) version without crediting/compensating the original authors and which cannot easily be tracked to any individual authors even if you actively try.

      ---

      LLMs perform no work (creative or otherwise), no original research, have no taste - in fact they have no anchor to the real world except the training data. Literally everything they output is based on the training data which took possibly quadrillions of hours of _human work_ and is now being resold without compensating them.

      Human time and natural resources are the only things with inherent value and now human time is being devalued and stolen.

I'm only an "AI sceptic" in the sense that I think that today's LLM models cannot regularly and substantially reduce my workload, not because they aren't able to perform interesting programming tasks (they are!), but because they don't do so reliably, and for a regular and substantial reduction in effort, I think a tool needs to be reliable and therefore trustworthy.

Now, this story is a perfect use case, because Filippo Valsorda put very little effort into communicating with the agent. If it worked - great; if it didn't - no harm done. And it worked!

The thing is that I already know that these tools are capable of truly amazing feats, and this is, no doubt, one of them. But it's been a while since I had a bug in a single-file library implementing a well-known algorithm, so it still doesn't amount to a regular and substantial increase in productivity for me, but "only" to yet another amazing feat by LLMs (something I'm not sceptical of).

Next time I have such a situation, I'll definitely use an LLM to debug it, because I enjoy seeing such results first-hand (plus, it would be real help). But I'm not sure that it supports the claim that these tools can today offer a regular and substantial productivity boost.

I know this is not an argument against LLM's being useful to increase productivity, but of all tasks in my job as software developer, hunting for and fixing obscure bugs is actually one of the most intellectually rewarding. I would miss that if it were to be taken over by a machine.

Also, hunting for bugs is often a very good way to get intimately familiar with the architecture of a system which you don't know well, and furthermore it improves your mental model of the cause of bugs, making you a better programmer in the future. I can spot a possible race condition or unsafe alien call at a glance. I can quickly identify a leaky abstraction, and spot mutable state that could be made immutable. All of this because I have spent time fixing bugs that were due to these mistakes. If you don't fix other people's bugs yourself, I fear you will also end up relying on an LLM to make judgements about your own code to make sure that it is bug-free.

  • > hunting for and fixing obscure bugs is actually one of the most intellectually rewarding. I would miss that if it were to be taken over by a machine.

    That's fascinating to me. It's the thing I literally hate the most.

    When I'm writing new code, I feel like I'm delivering value. When I'm fixing bugs, I feel like it's a frustrating waste of time caused by badly written code in the first place, making it a necessary evil. (Even when I was the one who wrote the original code.)

>Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

Why? Bug hunting is more challenging and cognitive intensive than writing code.

  • Bug hunting tends to be interpolation, which LLMs are really good at. Writing code is often some extrapolation (or interpolating at a much more abstract level).

    • Reversed version: Prompting-up fresh code tends to be translation, which LLMs are really good at. Bug hunting is often some logical reasoning (or translating business-needs at a much more abstract level.)

  • Sometimes it's the end of the day and you've been crunching for hours already and you hit one gnarly bug and you just want to go and make a cup of tea and come back to some useful hints as to the resolution.

  • Why as in “why should it work” or “why should we let them do it”?

    For the latter, the good news is that you’re free to use LLMs for debugging or completely ignore them.

  • Because it's easy to automate.

    "this should return X, it returns Y, find out why"

    With enough tooling LLMs can pretty easily figure out the reason eventually.

This is no different than when LLMs write code. In both scenarios they often turn into bullshit factories that are capable, willing, and happy to write pages and pages of intricate, convincing-sounding explanations for bugs that don't exist, wasting everyone's time and testing my patience.

  • That's not my experience at all. When I ask them to track down the root cause of a bug about 80% of the time they reply with a few sentences correctly identifying the source of the bug.

    1/5 times the get it wrong and I might waste a minute or two confirming they they missed. I can live with those odds.

    • I'm assuming you delegate for most of your bugs? I only ask when I'm stumped and at that point it's very prone to generating false positives.

I have tested the AI SAST tools that were hyped after a curl article on several C code bases and they found nothing.

Which low level code base have you tried this latest tool on? Official Anthropic commercials do not count.

  • You're posting this comment on a thread attached to an article where Filippo Valsorda - a noted cryptography expert - used these tools to track down gnarly bugs in Go cryptography code.