← Back to context

Comment by wild_egg

2 days ago

After a certain experience level though, I think most of us get to the point of knowing what that difference in quality actually matters.

Some seniors love to bikeshed PRs all day because they can do it better but generally that activity has zero actual value. Sometimes it matters, often it doesn't.

Stop with the "I could do this better by hand" and ask "is it worth the extra 4 hours to do this by hand, or is this actually good enough to meet the goals?"

LLM generated code is technical debt. If you are still working on the codebase the next day it will bite you. It might be as simple as an inconvenient interface, a bunch of duplicated functions that could just be imported, but eventually you are going to have to pay it.

  • All code is technical debt though. We can't spend infinite hours finding the absolute minima of technical debt introduced for a change, so it is just finding the right balance. That balance is highly dependent on a huge amount of factors: how core is the system, what is the system used for, what stage of development is the system, etc.

    • I spend about half my day working on LLM-generated code and half my day working on non-LLM-generated code, some written by senior devs, some written by juniors.

      The LLM-generated code is by far the worst technical debt. And a fair bit of that time is spent debugging subtle issues where it doesn't quite do what was prompted.

  • Untested undocumented LLM code is technical debt, but if you do specs and tests it's actually the opposite, you can go beyond technical debt and regenerate your code as you like. You just need testing to be so good it guarantees the behavior you care about, and that is easier in our age of AI coding agents.

    • > but if you do specs and tests it's actually the opposite, you can go beyond technical debt and regenerate your code as you like.

      Having to write all the specs and tests just right so you can regenerate the code until you get the desired output just sounds like an expensive version of the infinite monkey theorem, but with LLMs instead of monkeys.

      1 reply →

    • ... so you hand-write the specs and tests?

      I use LLMs to generate tests as well, but sometimes the tests are also buggy. As any competent dev knows, writing high-quality tests generally takes more time than writing the original code.

  • In your comment replace “LLM” with “Human SWE” and statement will still be correct in vast majority of the situations :)

"actually good enough to meet the goals?"

There's "okay for now" and then there's "this is so crap that if we set our bar this low we'll be knee deep in tech debt in a month".

A lot of LLM output in the specific areas _I_ work in is firmly in that latter category and many times just doesn't work.

  • So I can tell you don’t use these tools, or at least much, because at the speed of development with them you’ll be knee deep in tech debt in a day, not a month, but as a corollary can have the same agentic coding tools undergo the equivalent of weeks of addressing tech debt the next day. Well, I think this applies to greenfield AI-first oriented projects that work this way from the get go and with few humans in the loop (human to human communication definitely becomes the rate limiting step). But I imagine that’s not the nature of your work.

    • Yes if I went hard on something greenfield I'm sure I'll be knee deep in tech debt in less than a day.

      That being said, given the quality of code these things produce, I just don't see that ever stopping being the case. These things require a lot of supervision and at some point you are spending more time asking for revisions than just writing it yourself.

      There's a world of difference between an MPV which, in the right domain, you can get done much faster now, and a finished product.

    • I think you missed the your parent post's phrase "in the specific areas _I_ work in" ... LLMs are a lot better at crud and boilerplate than novel hardware interfaces and a bunch of other domains.

      1 reply →

  • I mean, there's also, "this looks fine but if I actually had written this code I would've naturally spent more time on it which would have led me to anticipate the future of this code just a little bit more and I will only feel that awkwardness when I come back to this code in two weeks, and then we'll do it all over again". It's a spectrum.

    • Right.

      And greenfield code is some of the most enjoyable to write, yet apparently we should let robots do the thing we enjoy the most, and reserve the most miserable tasks for humans, since the robots appear to be unable to do this.

      I have yet to see an LLM or coding agent that can be prompted with "Please fix subtle bugs" or "Please retire this technical debt as described in issue #6712."

      1 reply →

I think this is an opportunity for that bell curve/enlightenment meme. Of course as you get a little past junior, you often get hung up on the best way of doing things without worrying about best being the enemy of good enough. But truly senior devs know the difference. And those are the ones that by and large still think LLMs are bad at generating code where quality (reliability, sustainability, security, etc) matters. Everyone admits that LLMs are good for low stakes code.

Perhaps writing code by hand will be considered micro optimisation in the future.

Just like writing assembly is today.

now sometimes that's 4 hours, but I've had plenty of times where I'm "racing" people using LLMs and I basically get the coding done before them. Once I debugged an issue before the robot was done `ls`-ing the codebase!

The shape of the problem is super important in considering the results here

  • You have the upper hand with familiarity of the code base. Any "domain expert" also necessarily has a head start knowing which parts of a bespoke complex system need adjustment when making changes.

    On the other hand, a highly skilled worker who just joined the team won't have any of that tribal knowledge. There is a significant lag time getting ramped up, no matter how intelligent they are due to sheer scale (and complexity doesn't help).

    A general purpose model is more like the latter than the former. It would be interesting to compare how a model fine tuned on the specific shape of your code base and problem domain performs.

  • People usually talk about how they're better than LLMs in the domains they're experts and with known codebases.

    What about all the other, large amounts of cases? Don't you ever face situations in which an LLM can greatly help (and outrace) you?

    • Yeah totally, for unknown codebases it can help kick you off in the right direction (though it can send you down a totally wrong path as well... projects with good docs tend to be ones where I've found LLMs be worse at their job on this ironically).

      But well.. when working with coworkers on known projects it's a different story, right?

      My stance is these tools are, of course, useful, but humans can most definitely be faster than the current iteration of these tools in a good number of tasks, and some form of debugging tasks are like that for me. The ones I've tried have been too prone to meandering and trying too many "top results on Google"-style fixes.

      But hey maybe I'm just holding it wrong! Just seems like some of my coworkers are too