> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.
Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
Good abstractions only get you easy wins for some percent of the desirable tasks. They never guarantee 100% edge case unless trivial.
Choosing wrong means huge tech debt. Choosing righr just means most of your code will be happy path, and a few will need escape hatches. Not because of the abstract but because the target problem shifts uncontrollably. Because the problems you are solving typically require multiple abstractions and they are going to meet at the edges in the best case.
> then you're stuck reading every line because it might've missed some edge case or broken something
This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.
Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.
Tests are not free, over proliferation of AI-touched tests is itself a problem, similar to the problem duplicative and verbose AI-generated code.
And tests are inherently imperfect, they may not test the perfect layer, so they break when they shouldn't, and they certainly don't capture every premise.
I'm on board with the tactics you suggest, but they are only incrementally helpful. What we really need is AI that removes duplicative code and unnecessary tests.
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.
I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
> I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
I think you have that backwards.
The resource and copyright concerns stem from any of these "AI" technologies which require a training phase. Which, to my knowledge, is all of them.
LLMs are just the main targets because they are the most used. Diffusion models have the same concerns.
What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this
Yea, I just get anxious when I am responsible for something I don't really "know".
I haven't been a full-time professional software developer for a while, but I was one for years and when someone noticed a problem with one of my apps, I could mentally walk through the code and and pretty much know where to look before I even got to my desk.
I can't imagine letting Gen-AI (that is flat out wrong ~30% of the time) write huge swathes of code that I am now responsible for.
But maybe that's just a "me" thing. In this new economy words and activity have replaced value and productivity.
> You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
You should assume that if you are going to cover edge cases your tests will be tens to hundredths times as big as the code tested. It is the case for several database engines (MariaDB has 24M of C++ in sql directory and 288M of tests in mysql-test directory), it was the case when I developed VHDL/Verilog simulator. And not everything can be covered with type checking, many things, but not all.
AMD's FPU had hundredths of millions test cases for its' FPU and formal modeling caught several errors [1].
That's a lovely idea but it's just not possible to have tests that are guaranteed to catch everything. Even if you can somehow cover every single corner case that might ever arise (which you can't), there's no way for a test to automatically distinguish between "this got 2x slower because we have to do more work and that's an acceptable tradeoff" and "this got 2x slower because the new code is poorly written."
I'd absolutely want to review every single line of code made by a junior dev because their code quality is going to be atrocious. Just like with AI output.
Sure, you can go ahead and just stick your head in the sand and pretend all that detail doesn't exist, look only at the tests and the very high level structure. But, 2 years later you have an absolutely unmaintainable mess where the only solution is to nuke it from orbit and start from scratch, because not even AI models are able to untangle it.
I feel like there are really two camps of AI users: those who don't care about code quality and implementation, only intent. And those who care about both. And for the former camp, it's usually not because they are particularly pedantic personalities, but because they have to care about it. "Move fast and break things" webapps can easily be vibe coded without too much worry, but there are many systems which cannot. If you are personally responsible, in monetary and/or legal aspects, you cannot blame the AI for landing you in trouble, just as much as a carpenter cannot blame his hammer for doing a shit job.
> You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
Because tests are always perfect and fetch every corner-case, and are even detecting all unusual behaviour they are not testing for? Seems unrealistic. But explains the sharp rise of AI-slop and self-inflicted harm.
I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.
Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.
Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.
> Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully? Expectations should be defined, input validated, and anything that's unexpected should be rejected. Resilience against poorly formatted or otherwise nonsensical input is a pretty basic requirement.
I hope I severely misunderstood what you meant to say because we can't be having serious discussions about how amazing this technology is if we're silently dropping the standards to make it happen.
> Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
This is likely the future.
That being said: "I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.".
If you are spending a lot of time fixing syntax, have you looked into linters? If you are spending too much time thinking about how to structure the code, how about spending some days coming up with some general conventions or simply use existing ones.
If you are getting so much productivity from LLMs, it is worth checking if you were simply unproductive relative to your average dev in the first place. If that's the case, you might want to think, what is going to happen to your productivity gains when everyone else jumps on the LLM train. LLMs might be covering for your unproductivity at the code level, but you might still be dropping the ball in non-code areas. That's the higher level pattern I would be thinking about.
I was a good dev but I did not love the code itself. I loved the outcome. Other devs would have done better on leetcode and they would have produced better code syntax than me.
I’ve always been more of a product/business person who saw code as a way to get to the end goal.
That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
Hence, LLMs have been far better for me in terms of productivity.
You have way more trust in test suites than I do. How complex is the code you’re working with? In my line of work most serious bugs surface in complex interactions between different subsystems that are really hard to catch in a test suite. Additionally in my experience the bugs AI produces are completely alien. You can have perfect code for large functions and then somewhere in the middle absolutely nonsensical mistakes. Reviewing AI code is really hard because you can’t use your normal intuitions and really have to check everything meticulously.
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.
Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
> programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
Is there a typo here? If they don't care about code why would they reject code based on quality?
> Is there a typo here?
Indeed an accidental omission by me:
programmers who don't care about code quality and thus don't brutally reject code that is not of exceptional quality.
[flagged]
Good abstractions only get you easy wins for some percent of the desirable tasks. They never guarantee 100% edge case unless trivial.
Choosing wrong means huge tech debt. Choosing righr just means most of your code will be happy path, and a few will need escape hatches. Not because of the abstract but because the target problem shifts uncontrollably. Because the problems you are solving typically require multiple abstractions and they are going to meet at the edges in the best case.
1 reply →
> then you're stuck reading every line because it might've missed some edge case or broken something
This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.
Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.
Tests are not free, over proliferation of AI-touched tests is itself a problem, similar to the problem duplicative and verbose AI-generated code.
And tests are inherently imperfect, they may not test the perfect layer, so they break when they shouldn't, and they certainly don't capture every premise.
I'm on board with the tactics you suggest, but they are only incrementally helpful. What we really need is AI that removes duplicative code and unnecessary tests.
[flagged]
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.
I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
> I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
I think you have that backwards.
The resource and copyright concerns stem from any of these "AI" technologies which require a training phase. Which, to my knowledge, is all of them.
LLMs are just the main targets because they are the most used. Diffusion models have the same concerns.
[flagged]
What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this
Short term gets faster more competitive results than long term.
[flagged]
To eliminate this tax I break anything gen-ai does in to the smallest chunks possible.
[flagged]
Yea, I just get anxious when I am responsible for something I don't really "know".
I haven't been a full-time professional software developer for a while, but I was one for years and when someone noticed a problem with one of my apps, I could mentally walk through the code and and pretty much know where to look before I even got to my desk.
I can't imagine letting Gen-AI (that is flat out wrong ~30% of the time) write huge swathes of code that I am now responsible for.
But maybe that's just a "me" thing. In this new economy words and activity have replaced value and productivity.
1 reply →
[dead]
[dead]
You should assume that if you are going to cover edge cases your tests will be tens to hundredths times as big as the code tested. It is the case for several database engines (MariaDB has 24M of C++ in sql directory and 288M of tests in mysql-test directory), it was the case when I developed VHDL/Verilog simulator. And not everything can be covered with type checking, many things, but not all.
AMD's FPU had hundredths of millions test cases for its' FPU and formal modeling caught several errors [1].
[1] https://www.cs.utexas.edu/~moore/acl2/v6-2/INTERESTING-APPLI...
SQLite used to have 1100 LOC of tests per one LOC of C code, now the multiplier is smaller, but still is big.
That's a lovely idea but it's just not possible to have tests that are guaranteed to catch everything. Even if you can somehow cover every single corner case that might ever arise (which you can't), there's no way for a test to automatically distinguish between "this got 2x slower because we have to do more work and that's an acceptable tradeoff" and "this got 2x slower because the new code is poorly written."
As far as I know sqlite has such tests and probably others.
1 reply →
I'd absolutely want to review every single line of code made by a junior dev because their code quality is going to be atrocious. Just like with AI output.
Sure, you can go ahead and just stick your head in the sand and pretend all that detail doesn't exist, look only at the tests and the very high level structure. But, 2 years later you have an absolutely unmaintainable mess where the only solution is to nuke it from orbit and start from scratch, because not even AI models are able to untangle it.
I feel like there are really two camps of AI users: those who don't care about code quality and implementation, only intent. And those who care about both. And for the former camp, it's usually not because they are particularly pedantic personalities, but because they have to care about it. "Move fast and break things" webapps can easily be vibe coded without too much worry, but there are many systems which cannot. If you are personally responsible, in monetary and/or legal aspects, you cannot blame the AI for landing you in trouble, just as much as a carpenter cannot blame his hammer for doing a shit job.
> You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically.
Because tests are always perfect and fetch every corner-case, and are even detecting all unusual behaviour they are not testing for? Seems unrealistic. But explains the sharp rise of AI-slop and self-inflicted harm.
[flagged]
I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.
Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.
Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.
> Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully? Expectations should be defined, input validated, and anything that's unexpected should be rejected. Resilience against poorly formatted or otherwise nonsensical input is a pretty basic requirement.
I hope I severely misunderstood what you meant to say because we can't be having serious discussions about how amazing this technology is if we're silently dropping the standards to make it happen.
Because I, the spec writer, didn't think of it. I would have made the same mistake if I wrote the code.
1 reply →
[flagged]
> Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
This is likely the future.
That being said: "I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.".
If you are spending a lot of time fixing syntax, have you looked into linters? If you are spending too much time thinking about how to structure the code, how about spending some days coming up with some general conventions or simply use existing ones.
If you are getting so much productivity from LLMs, it is worth checking if you were simply unproductive relative to your average dev in the first place. If that's the case, you might want to think, what is going to happen to your productivity gains when everyone else jumps on the LLM train. LLMs might be covering for your unproductivity at the code level, but you might still be dropping the ball in non-code areas. That's the higher level pattern I would be thinking about.
I was a good dev but I did not love the code itself. I loved the outcome. Other devs would have done better on leetcode and they would have produced better code syntax than me.
I’ve always been more of a product/business person who saw code as a way to get to the end goal.
That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.
Hence, LLMs have been far better for me in terms of productivity.
4 replies →
You have way more trust in test suites than I do. How complex is the code you’re working with? In my line of work most serious bugs surface in complex interactions between different subsystems that are really hard to catch in a test suite. Additionally in my experience the bugs AI produces are completely alien. You can have perfect code for large functions and then somewhere in the middle absolutely nonsensical mistakes. Reviewing AI code is really hard because you can’t use your normal intuitions and really have to check everything meticulously.
If it’s hard to catch with a comprehensive suit of test, what makes you think you can catch them by hand coding?
3 replies →