Comment by godelski

6 days ago

No, YOU misunderstand. This isn't a thing RL can fix

  https://news.ycombinator.com/item?id=44163194

  https://news.ycombinator.com/item?id=44068943

It doesn't optimize "good programs". It interprets "humans interpretation of good programs." More accurately, "it optimizes what low paid over worked humans believe are good programs." Are you hiring your best and brightest to code review the LLMs?

Even if you do, it still optimizes tricking them. It will also optimize writing good programs, but you act like that's a well defined and measurable thing.

Those links mostly discuss the original RLHF used to train e.g. ChatGPT 3.5. Current paradigms are shifting towards RLVR (reinforcement learning with verifiable rewards), which absolutely can optimize good programs.

You can definitely still run into some of the problems eluded to in the first link. Think hacking unit tests, deception, etc -- but the bar is less "create a perfect RL environment" than "create an RL environment where solving the problem is easier than reward hacking." It might be possible to exploit a bug in the Lean 4 proof assistant to prove a mathematical statement, but I suspect it will usually be easier for an LLM to just write a correct proof. Current RL environments aren't as watertight as Lean 4, but there's certainly work to make them more watertight.

This is in no way a "solved" problem, but I do see it as a counter to your assertion that "This isn't a thing RL can fix." RL is powerful.

  •   > Current paradigms are shifting towards RLVR, which absolutely can optimize good programs
    

    I think you've misunderstood. RL is great. Hell, RLHF has done a lot of good. I'm not saying LLM are useless.

    But no, it's much more complex than you claim. RLVM can optimize for correct answers in the narrow domains where there are correct answers but it can't optimize good programs. There's a big difference.

    You're right that Lean, Coq, and other ATPs can prove mathematical statements, but they also don't ensure that a program is good. There's frequently an infinite number of proofs that are correct, but most of those are terrible proofs.

    This is the same problem all the coding benchmarks face. Even if the LLM isn't cheating, testing isn't enough. If it was we'd never do code review lol. I can pass a test with an algorithm that's O(n^3) despite there being an O(1) solution.

    You're right that it makes it better, but it doesn't fix the underlying problem I'm discussing.

    Not everything is verifiable.

    Verifiability isn't enough.

    If you'd like to prove me wrong in the former you're going to need to demonstrate that there are provably true statements to lots of things. I'm not expecting you to defy my namesake, nor will I ask you prove correctness and solve the related halting problem.

    You can't prove an image is high fidelity. You can't prove a song sounds good. You can't prove a poem is a poem. You can't prove this sentence is English. The world is messy as fuck and most things are highly subjective.

    But the problem isn't binary, it is continuous. I said we're using Justice Potter optimization, you can't even define what porn is. These definitions change over time, often rapidly!

    You're forgetting about the tyrannical of metrics. Metrics are great, powerful tools that are incredibly useful. But if you think they're perfectly aligned with what you intend to measure then they become tools that work against you. Goodhart's Law. Metrics only work as guides. They're no different than any other powerful tool, if you use it wrong you get hurt.

    If you really want to understand this I really encourage you to deep dive into this stuff. You need to get into the math. Into the weeds. You'll find a lot of help with metamathematics (i.e. my namesake), metaphysics (Ian Hacking is a good start), and such. It isn't enough to know the math, you need to know what the math means.

    • The question at hand was whether LLMs could be trained to write good code. I took this to mean "good code within the domain of software engineering," not "good code within the universe of possible programs." If you interpreted it to mean the latter, so be it -- though I'm skeptical of the usefulness of this interpretation.

      If the former, I still think that the vast majority of production software has metrics/unit tests that could be attached and subsequently hillclimbed via RL. Whether the resulting optimized programs would be considered "good" depends on your definition of "good." I suspect mine is more utilitarian than yours (as even after some thought I can't conceive of what a "terrible" proof might look like), but I am skeptical that your code review will prove to be a better measure of goodness than a broad suite of unit tests/verifiers/metrics -- which, to my original last point, are only getting more robust! And if these aren't enough, I suspect the addition of LLM-as-a-judge (potentially ensembles) checking for readability/maintainability/security vulnerabilities will eventually put code quality above that of what currently qualifies as "good" code.

      Your examples of tasks that can't easily be optimized (image fidelity, song quality, etc.) seem out of scope to me -- can you point to categories of extant software that could not be hillclimbed via RL? Or is this just a fundamental disagreement about what it means for software to be "good"? At any rate, I think we can agree that the original claim that "The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code" is wrong in the context of RL.

      3 replies →

I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance. And I'm not claiming that closed-loop agents reliably produce mergeable code, only that they've broken through a threshold where they produce enough mergeable code that they significantly accelerate development.

  • > I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance.

    I very strongly disagree with this and think this reflects a misunderstanding of model capabilities. This sort of agentic loop with access to ground truth model has been tried in one form or another ever since GPT-3 came out. For four years they didn't work. Models would very quickly veer into incoherence no matter what tooling you gave them.

    Only in the last year or so have models gotten capable enough to maintain coherence over long enough time scales that these loops work. And future model releases will tighten up these loops even more and scale them out to longer time horizons.

    This is all to say that progress in code production has been essentially driven by progress in model capabilities, and agent loops are a side effect of that rather than the main driving force.

    • Sure! Super happy to hear these kinds of objections because, while all the progress I'm personally perceiving is traceable to decisions different agent frameworks seem to be making, I'm totally open to the idea that model improvements have been instrumental in making these loops actually converge anywhere practical. I think near the core of my argument is simply the idea that we've crossed a threshold where current models plus these kinds of loops actually do work.

  •   > I don't know if any of this applies to the arguments
    
      > with access to ground truth
    

    There's the connection. You think you have ground truth. No such thing exists

    • It's even simpler than what 'rfrey said. You're here using "ground truth" in some kind of grand epistemic sense, and I simply mean "whether the exit code from a program was 1 or 0".

      You can talk about how meaningful those exit codes and error messages are or aren't, but the point is that they are profoundly different than the information an LLM natively operates with, which are atomized weights predicting next tokens based on what an abstract notion of a correct line of code or an error message might look like. An LLM can (and will) lie to itself about what it is perceiving. An agent cannot; it's just 200 lines of Python, it literally can't.

      4 replies →

    • Yes it does. Ground truth is what 30 years of experience says constitutes mergeable code. Ground truth doesn't mean "perfect, provably correct code", it means whatever your best benchmark for acceptable code is.

      In medical AI, where I'm currently working, "ground truth" is usually whatever human experts say about a medical image, and is rarely perfect. The goal is always to do better than whatever the current ground truth is.

      1 reply →

This is just semantics. What's the difference between a "human interpretation of a good program" and a "good program" when we (humans) are the ones using it? If the model can write code that passes tests, and meets my requirements, then it's a good programmer. I would expect nothing more or less out of a human programmer.

  • > What's the difference between a "human interpretation of a good program" and a "good program" when we (humans) are the ones using it?

    Correctness.

    > and meets my requirements

    It can't do that. "My requirements" wasn't part of the training set.

    • "Correctness" in what sense? It sounds like it's being expanded to an abstract academic definition here. For practical purposes, correct means whatever the person using it deems to be correct.

      > It can't do that. "My requirements" wasn't part of the training set.

      Neither are mine, the art of building these models is that they are generalisable enough that they can tackle tasks that aren't in their dataset. They have proven, at least for some classes of tasks, they can do exactly that.

      7 replies →

  • Is your grandma qualified to determine what is good code?

      > If the model can write code that passes tests
    

    You think tests make code good? Oh my sweet summer child. TDD has been tried many times and each time it failed worse than the last.