← Back to context

Comment by the_harpia_io

8 days ago

the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least

> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep

If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.

Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.

  • > programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.

    Is there a typo here? If they don't care about code why would they reject code based on quality?

    • > Is there a typo here?

      Indeed an accidental omission by me:

      programmers who don't care about code quality and thus don't brutally reject code that is not of exceptional quality.

  • I mean in theory yes, good abstractions solve a lot - but in practice you're rarely starting from a clean slate. you're integrating with third-party APIs that have weird edge cases, working with legacy code that wasn't designed for what you're doing now, dealing with requirements that change mid-implementation. even with great abstractions the real world bleeds through. and AI doesn't know which abstractions are 'right' for your specific context, it just pattern-matches what looks reasonable. so you end up reviewing not just for bugs but to make sure it's not subtly incompatible with your architecture

    • Good abstractions only get you easy wins for some percent of the desirable tasks. They never guarantee 100% edge case unless trivial.

      Choosing wrong means huge tech debt. Choosing righr just means most of your code will be happy path, and a few will need escape hatches. Not because of the abstract but because the target problem shifts uncontrollably. Because the problems you are solving typically require multiple abstractions and they are going to meet at the edges in the best case.

      1 reply →

> then you're stuck reading every line because it might've missed some edge case or broken something

This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.

Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.

  • Tests are not free, over proliferation of AI-touched tests is itself a problem, similar to the problem duplicative and verbose AI-generated code.

    And tests are inherently imperfect, they may not test the perfect layer, so they break when they shouldn't, and they certainly don't capture every premise.

    I'm on board with the tactics you suggest, but they are only incrementally helpful. What we really need is AI that removes duplicative code and unnecessary tests.

  • agree tests help but they only catch what you test for - and honestly a lot of codebases have patchy coverage at best. the bigger issue is when the AI misunderstands the task itself, like implementing the wrong thing correctly. tests won't catch that if they're based on the same misunderstanding. the reward hacking point is real though, seen that where it just makes tests pass by changing the test

> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep

I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.

I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.

  • yeah test-driven constraints help a lot - we've been moving that direction too, basically treating the agent like a junior dev who needs guard rails. the build+test gates catch the obvious stuff. but the trickier part is when tests pass but the code still isn't what you wanted - like it works but takes a fundamentally wrong approach, or adds unnecessary complexity. those are harder to catch with automation. re: LLM vs AI terminology - fair point, though I think the ship has sailed on general usage. most people just say AI to mean 'the thing that writes code' regardless of what's under the hood

  • > I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.

    I think you have that backwards.

    The resource and copyright concerns stem from any of these "AI" technologies which require a training phase. Which, to my knowledge, is all of them.

    LLMs are just the main targets because they are the most used. Diffusion models have the same concerns.

What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this

  • I think the issue is everyone's stuck in the same boat - the alternative to using AI and spending time reviewing is just writing it yourself, which takes even longer. so even if it's not a net win, it's still better than nothing. plus a lot of companies aren't actually measuring the review overhead properly - they see 'AI wrote 500 lines in 2 minutes' and call it a productivity win without tracking the 3 hours spent debugging it later. the inefficiency doesn't get competed out because everyone has the same constraints and most aren't measuring it honestly

To eliminate this tax I break anything gen-ai does in to the smallest chunks possible.

  • yeah that helps - smaller chunks means less surface area to audit and easier to spot when it goes wrong. trade-off is you spend more time on the prompting/task breakdown but at least you're not debugging a 500-line diff

    • Yea, I just get anxious when I am responsible for something I don't really "know".

      I haven't been a full-time professional software developer for a while, but I was one for years and when someone noticed a problem with one of my apps, I could mentally walk through the code and and pretty much know where to look before I even got to my desk.

      I can't imagine letting Gen-AI (that is flat out wrong ~30% of the time) write huge swathes of code that I am now responsible for.

      But maybe that's just a "me" thing. In this new economy words and activity have replaced value and productivity.

      1 reply →

[dead]

  •   > You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
    

    You should assume that if you are going to cover edge cases your tests will be tens to hundredths times as big as the code tested. It is the case for several database engines (MariaDB has 24M of C++ in sql directory and 288M of tests in mysql-test directory), it was the case when I developed VHDL/Verilog simulator. And not everything can be covered with type checking, many things, but not all.

    AMD's FPU had hundredths of millions test cases for its' FPU and formal modeling caught several errors [1].

    [1] https://www.cs.utexas.edu/~moore/acl2/v6-2/INTERESTING-APPLI...

    SQLite used to have 1100 LOC of tests per one LOC of C code, now the multiplier is smaller, but still is big.

  • That's a lovely idea but it's just not possible to have tests that are guaranteed to catch everything. Even if you can somehow cover every single corner case that might ever arise (which you can't), there's no way for a test to automatically distinguish between "this got 2x slower because we have to do more work and that's an acceptable tradeoff" and "this got 2x slower because the new code is poorly written."

  • the observability point hits hard - you're right that the review cost is really about not being able to tell what's happening across runs or track quality over time. the CI gate approach makes sense in theory but honestly most teams I've seen don't have the test coverage to make that work safely, so you end up needing manual review anyway. also the 'does this do what I asked' question is harder than it sounds because sometimes the AI builds the wrong thing correctly, and your existing tests don't catch that since they're testing different assumptions. but yeah, the lack of cost-per-task tracking and quality metrics is brutal - you're flying blind on whether you're actually saving time or just moving the work around

  • I'd absolutely want to review every single line of code made by a junior dev because their code quality is going to be atrocious. Just like with AI output.

    Sure, you can go ahead and just stick your head in the sand and pretend all that detail doesn't exist, look only at the tests and the very high level structure. But, 2 years later you have an absolutely unmaintainable mess where the only solution is to nuke it from orbit and start from scratch, because not even AI models are able to untangle it.

    I feel like there are really two camps of AI users: those who don't care about code quality and implementation, only intent. And those who care about both. And for the former camp, it's usually not because they are particularly pedantic personalities, but because they have to care about it. "Move fast and break things" webapps can easily be vibe coded without too much worry, but there are many systems which cannot. If you are personally responsible, in monetary and/or legal aspects, you cannot blame the AI for landing you in trouble, just as much as a carpenter cannot blame his hammer for doing a shit job.

  • > You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically.

    Because tests are always perfect and fetch every corner-case, and are even detecting all unusual behaviour they are not testing for? Seems unrealistic. But explains the sharp rise of AI-slop and self-inflicted harm.

I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.

Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.

Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.

I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.

Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.

  • > Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.

    I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully? Expectations should be defined, input validated, and anything that's unexpected should be rejected. Resilience against poorly formatted or otherwise nonsensical input is a pretty basic requirement.

    I hope I severely misunderstood what you meant to say because we can't be having serious discussions about how amazing this technology is if we're silently dropping the standards to make it happen.

    • yeah you're spot on - the whole "can't fault AI for bugs" mindset is exactly the problem. like, if a junior dev shipped code that crashed on malformed input we'd send it back for proper validation, why would we accept worse from AI? I keep seeing this pattern where people lower their bar because the AI "mostly works" but then you get these silent failures or weird edge case explosions that are way harder to debug than if you'd just written defensive code from the start. honestly the scariest bugs aren't the ones that blow up in your face, it's the ones that slip through and corrupt data or expose something three deploys later

    •   I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully?
      

      Because I, the spec writer, didn't think of it. I would have made the same mistake if I wrote the code.

      1 reply →

  • > Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.

    This is likely the future.

    That being said: "I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.".

    If you are spending a lot of time fixing syntax, have you looked into linters? If you are spending too much time thinking about how to structure the code, how about spending some days coming up with some general conventions or simply use existing ones.

    If you are getting so much productivity from LLMs, it is worth checking if you were simply unproductive relative to your average dev in the first place. If that's the case, you might want to think, what is going to happen to your productivity gains when everyone else jumps on the LLM train. LLMs might be covering for your unproductivity at the code level, but you might still be dropping the ball in non-code areas. That's the higher level pattern I would be thinking about.

    • I was a good dev but I did not love the code itself. I loved the outcome. Other devs would have done better on leetcode and they would have produced better code syntax than me.

      I’ve always been more of a product/business person who saw code as a way to get to the end goal.

      That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.

      Hence, LLMs have been far better for me in terms of productivity.

      4 replies →

  • You have way more trust in test suites than I do. How complex is the code you’re working with? In my line of work most serious bugs surface in complex interactions between different subsystems that are really hard to catch in a test suite. Additionally in my experience the bugs AI produces are completely alien. You can have perfect code for large functions and then somewhere in the middle absolutely nonsensical mistakes. Reviewing AI code is really hard because you can’t use your normal intuitions and really have to check everything meticulously.