Comment by mindwok
6 days ago
This is a misunderstanding. Modern LLMs are trained with RL to actually write good programs. They aren't just spewing tokens out.
6 days ago
This is a misunderstanding. Modern LLMs are trained with RL to actually write good programs. They aren't just spewing tokens out.
No, YOU misunderstand. This isn't a thing RL can fix
It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?
Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.
Those links mostly discuss the original RLHF used to train e.g. ChatGPT 3.5. Current paradigms are shifting towards RLVR (reinforcement learning with verifiable rewards), which absolutely can optimize good programs.
You can definitely still run into some of the problems eluded to in the first link. Think hacking unit tests, deception, etc -- but the bar is less "create a perfect RL environment" than "create an RL environment where solving the problem is easier than reward hacking." It might be possible to exploit a bug in the Lean 4 proof assistant to prove a mathematical statement, but I suspect it will usually be easier for an LLM to just write a correct proof. Current RL environments aren't as watertight as Lean 4, but there's certainly work to make them more watertight.
This is in no way a "solved" problem, but I do see it as a counter to your assertion that "This isn't a thing RL can fix." RL is powerful.
I think you've misunderstood. RL is great. Hell, RLHF has done a lot of good. I'm not saying LLM are useless.
But no, it's much more complex than you claim. RLVM can optimize for correct answers in the narrow domains where there are correct answers but it can't optimize good programs. There's a big difference.
You're right that Lean, Coq, and other ATPs can prove mathematical statements, but they also don't ensure that a program is good. There's frequently an infinite number of proofs that are correct, but most of those are terrible proofs.
This is the same problem all the coding benchmarks face. Even if the LLM isn't cheating, testing isn't enough. If it was we'd never do code review lol. I can pass a test with an algorithm that's O(n^3) despite there being an O(1) solution.
You're right that it makes it better, but it doesn't fix the underlying problem I'm discussing.
Not everything is verifiable.
Verifiability isn't enough.
If you'd like to prove me wrong in the former you're going to need to demonstrate that there are provably true statements to lots of things. I'm not expecting you to defy my namesake, nor will I ask you prove correctness and solve the related halting problem.
You can't prove an image is high fidelity. You can't prove a song sounds good. You can't prove a poem is a poem. You can't prove this sentence is English. The world is messy as fuck and most things are highly subjective.
But the problem isn't binary, it is continuous. I said we're using Justice Potter optimization, you can't even define what porn is. These definitions change over time, often rapidly!
You're forgetting about the tyrannical of metrics. Metrics are great, powerful tools that are incredibly useful. But if you think they're perfectly aligned with what you intend to measure then they become tools that work against you. Goodhart's Law. Metrics only work as guides. They're no different than any other powerful tool, if you use it wrong you get hurt.
If you really want to understand this I really encourage you to deep dive into this stuff. You need to get into the math. Into the weeds. You'll find a lot of help with metamathematics (i.e. my namesake), metaphysics (Ian Hacking is a good start), and such. It isn't enough to know the math, you need to know what the math means.
4 replies →
I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance. And I'm not claiming that closed-loop agents reliably produce mergeable code, only that they've broken through a threshold where they produce enough mergeable code that they significantly accelerate development.
> I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance.
I very strongly disagree with this and think this reflects a misunderstanding of model capabilities. This sort of agentic loop with access to ground truth model has been tried in one form or another ever since GPT-3 came out. For four years they didn't work. Models would very quickly veer into incoherence no matter what tooling you gave them.
Only in the last year or so have models gotten capable enough to maintain coherence over long enough time scales that these loops work. And future model releases will tighten up these loops even more and scale them out to longer time horizons.
This is all to say that progress in code production has been essentially driven by progress in model capabilities, and agent loops are a side effect of that rather than the main driving force.
1 reply →
There's the connection. You think you have ground truth. No such thing exists
7 replies →
This is just semantics. What's the difference between a "human interpretation of a good program" and a "good program" when we (humans) are the ones using it? If the model can write code that passes tests, and meets my requirements, then it's a good programmer. I would expect nothing more or less out of a human programmer.
> What's the difference between a "human interpretation of a good program" and a "good program" when we (humans) are the ones using it?
Correctness.
> and meets my requirements
It can't do that. "My requirements" wasn't part of the training set.
8 replies →
Is your grandma qualified to determine what is good code?
You think tests make code good? Oh my sweet summer child. TDD has been tried many times and each time it failed worse than the last.
7 replies →
"Good" is the context of LLMs means "plausible". Not "correct".
If you can't code then the distinction is lost on you, but in fact the "correct" part is why programmers get paid. If "plausible" were good enough then the profession of programmer wouldn't exist.
Not necessarily. If the RL objective is passing tests then in the context of LLMs it means "correct", or at least "correct based on the tests".
Unfortunately that doesn't solve the problem in any way. We don't have an Oracle machine for testing software.
If we did, we could autogenerate code even without an LLM.
They are also trained with RL to write code to pass unit tests and Claude does have a big problem with trying to cheat the test or request pretty quickly after running into issues, making manual edit approval more important. It usually still tells what it is trying to do wrong so you can often find out from its summary before having to scan the diff.