← Back to context

Comment by pron

6 days ago

Test-driven development helps protect against wrong code, but it's not code I'm interested in, and it's not wrong code that I'm afraid of (I mean, that's table stakes). What I need is something that would help me generate understanding and do so reliably (even if the performance is poor). I can't exercise high-level knowledge efficiently if my only reliable input is code. Once you have to work at the code level at every step, there's no raising of the level of thought. The problem for me isn't that the agent might generate code that doesn't pass the test suite, but that it cannot reliably tell me what I need to know about the code. There's nothing I can reliably offload to the machine other than typing. That could still be useful, but it's not necessarily a game-changer.

Writing code in Java or Python as opposed to Assembly also raises the level of abstract thought. Not as much as we hope AI will be able to do someday, but at least it does the job reliably enough. Imagine how useful Java or Python would be if 10% of the time they would emit the wrong machine instructions. If there's no trust on anything, then the offloading of effort is drastically diminished.

In my experience with Claude Code and Sonnet, it is absolutely possible to have architectural and design-oriented conversations about the work, at an entirely different and higher level than using a (formerly) high-level programming language. I have been able to learn new systems and frameworks far faster with Claude than with any previous system I have used. It definitely does require close attention to detect mistakes it does not realize it is making, but that is where the skill comes in. I find it being right 80% of the time and wrong 20% of the time to be a hugely acceptable tradeoff, when it allows me to go radically faster because it can do that 80% much quicker than I could. Especially when it comes to learning new code bases and exploring new repos I have cloned -- it can read code superhumanly quickly and explain it to me in depth.

It is certainly a hugely different style of interaction, but it helps to think of it as a conversation, or more precisely, a series of individual small targeted specific conversations, each aimed at researching a specific issue or solving a specific problem.

  • Indeed, I successfully use LLMs for research, and they're an improvement because old-school search isn't very reliable either.

    But as to the 80-20 tradeoff on other tasks, the problem isn't that the tool is wrong 20% of the time, but that it's not trustworthy 100% of the time. I have to check the work. Maybe that's still valuable, but just how valuable that is depends on many factors, some of which are very domain-dependent and others are completely subjective. We're talking about replacing one style with another that is much better in some respects and much worse in others. If, on the whole, it was better in almost all cases, that would be one thing (and make the investment safer), but reports suggest it isn't.

    I've yet to try an LLM to learn a new codebase, and I have no doubt it will help a lot, but while that is undoubtedly a very expensive task, it's also not a very frequent one. It could maybe save me a week per year, amortised. That's not nothing (and I will certainly give it a try next time I need to learn a new codebase), but it's also not a game-changer.

Without meaning to sound flippant or dismissive, I think you're overthinking it. By the sounds of it, agents aren't offering what you say you need. What are _are_ offering is the boilerplate, the research, the planning etc. All the stuff that's ancillary. You could quite fairly say that it's in the pursuit of this stuff where details and ideas emerge and I would agree, but sometimes you don't need ideas. You need solutions which are run-of-the-mill and boring.

  • I'm well aware that LLMs are more than capable enough to successfully perform straightforward, boring tasks 90% of the time. The problem is that there's a small but significant enough portion of time where I think a problem is simple and straightforward, but it turns out not to be once you get into the weeds, and if I can't trust the tool to tell me if we're in the 90% problem or the 10% problem, then I have to carefully review everything.

    I'm used to working with tools, such as SMT solvers, that may fail to perform a task, but they don't lie about their success or failure. Automation that doesn't either succeed or report a failure reliably is not really automation.

    Again, I'm not saying that the work done by the LLM is useless, but the tradeoffs it requires make it dramatically different from how both tools and humans usually operate.