Comment by jszymborski
6 months ago
The argument that I've heard against LLMs for code is that they create bugs that, by design, are very difficult to spot.
The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code. So the bugs often won't be like those a programmer makes. Instead, they can introduce a whole new class of bug that's way harder to debug.
This is exactly what I wrote about when I wrote "Copilot Induced Crash" [0]
Funny story: when I first posted that and had a couple of thousand readers, I had many comments of the type "you should just read the code carefully on review", but _nobody_ pointed out the fact that the opening example (the so called "right code") had the exact same problem as described in the article, proving exactly what you just said: it's hard to spot problems that are caused by plausibility machines.
[0] https://www.bugsink.com/blog/copilot-induced-crash/
If it crashes, you are very lucky.
AI generated code will fuck up so many lives. The post office software in the UK did it without AI. I cannot imagine the way and the number of lives will be ruined since some consultancy vibe coded some government system. I might come to appreciate the German bureaucracy and backwardness.
My philosophy is to let the LLM either write the logic or write the tests - but not both. If you write the tests and it writes the logic and it passes all of your tests, then the LLM did its job. If there are bugs, there were bugs in your tests.
That rather depends on the type of bug and what kinds of tests you would write.
LLMs are way faster than me at writing tests. Just prompt for the kind of test you want.
Idk about you but I spend much more time thinking about what ways the code is likely to break and deciding what to test. Actually writing tests is usually straightforward and fast with any sane architecture with good separation of concerns.
I can and do use AI to help with test coverage but coverage is pointless if you don’t catch the interesting edge cases.
> My philosophy is to let the LLM either write the logic or write the tests - but not both. If you write the tests and it writes the logic and it passes all of your tests, then the LLM did its job. If there are bugs, there were bugs in your tests.
Maybe use one LLMs to write the code and a wildly different one to write the tests and yet another wildly different one to generate an English description of each test while doing critical review.
Disagree. You could write millions of tests for a function that simply sums two numbers, and it’s trivial to insert bugs while passing that test.
This is pretty nifty, going to try this out!
I don't agree. What I do agree on is to do it not only with one LLM.
Quality increases if I double check code with a second LLM (especially o4 mini is great for that)
Or double check tests the same way.
Maybe even write tests and code with different LLMs if that is your worry.
Yes, exactly - my (admittedly very limited!) experience has consistently generated well-written, working code that just doesn’t quite do what I asked. Often the results will be close to what I expect, and the coding errors do not necessarily jump out on a first line-by-line pass, so if I didn’t have a high degree of skepticism of the generated code in the first place, I could easily just run with it.
> working code that just doesn’t quite do what I asked
Code that doesn't do what you want isn't "working", bro.
Working exactly to spec is the code's only job.
It is a bit ambiguous I think, there is also the meaning of "the code compiles/runs without errors". But I also prefer the meaning of, "code that is working to the spec".
For me it's mostly about the efficiency of the code they write. This is because I work in energy where efficiency matters because our datasets are so ridicilously large and every interface to that data is so ridicilously bad. I'd argue that for 95% of the software out there it won't really matter if you use a list or a generator in Python to iterate over data. It probably should and maybe this will change with cloud costs continious increasing, but we do also live in a world where 4chan ran on some apache server running a 10k line php file from 2015...
Anyway, this is where AI's have been really bad for us. As well as sometimes "overengineering" their bug prevention in extremely inefficient ways. The flip-side of this is of course that a lot of human programmers would make the same mistakes.
I’ve had the opposite experience. Just tell it to optimise for speed and iterate and give feedback. I’ve had JS code optimised specifically for v8 using bitwise operations. It’s brilliant.
Example code or it's just a claim :)
1 reply →
>Instead, they can introduce a whole new class of bug that's way harder to debug
That sounds like a new opportunity for a startup that will collect hundreds of millions a of dollars, brag about how their new AI prototype is so smart that it scares them, and devliver nothing
> There's no logic gone into writing that bit of code.
What makes you say that? If LLMs didn't reason about things, they wouldn't be able to do one hundredth of what they do.
This is a misunderstanding. Modern LLMs are trained with RL to actually write good programs. They aren't just spewing tokens out.
No, YOU misunderstand. This isn't a thing RL can fix
It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?
Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.
Those links mostly discuss the original RLHF used to train e.g. ChatGPT 3.5. Current paradigms are shifting towards RLVR (reinforcement learning with verifiable rewards), which absolutely can optimize good programs.
You can definitely still run into some of the problems eluded to in the first link. Think hacking unit tests, deception, etc -- but the bar is less "create a perfect RL environment" than "create an RL environment where solving the problem is easier than reward hacking." It might be possible to exploit a bug in the Lean 4 proof assistant to prove a mathematical statement, but I suspect it will usually be easier for an LLM to just write a correct proof. Current RL environments aren't as watertight as Lean 4, but there's certainly work to make them more watertight.
This is in no way a "solved" problem, but I do see it as a counter to your assertion that "This isn't a thing RL can fix." RL is powerful.
5 replies →
I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance. And I'm not claiming that closed-loop agents reliably produce mergeable code, only that they've broken through a threshold where they produce enough mergeable code that they significantly accelerate development.
10 replies →
This is just semantics. What's the difference between a "human interpretation of a good program" and a "good program" when we (humans) are the ones using it? If the model can write code that passes tests, and meets my requirements, then it's a good programmer. I would expect nothing more or less out of a human programmer.
17 replies →
"Good" is the context of LLMs means "plausible". Not "correct".
If you can't code then the distinction is lost on you, but in fact the "correct" part is why programmers get paid. If "plausible" were good enough then the profession of programmer wouldn't exist.
Not necessarily. If the RL objective is passing tests then in the context of LLMs it means "correct", or at least "correct based on the tests".
1 reply →
They are also trained with RL to write code to pass unit tests and Claude does have a big problem with trying to cheat the test or request pretty quickly after running into issues, making manual edit approval more important. It usually still tells what it is trying to do wrong so you can often find out from its summary before having to scan the diff.
This can happen, but in practice, given I'm reviewing every line anyway, it almost never bites me.