← Back to context

Comment by tuhlatte

5 days ago

The question at hand was whether LLMs could be trained to write good code. I took this to mean "good code within the domain of software engineering," not "good code within the universe of possible programs." If you interpreted it to mean the latter, so be it -- though I'm skeptical of the usefulness of this interpretation.

If the former, I still think that the vast majority of production software has metrics/unit tests that could be attached and subsequently hillclimbed via RL. Whether the resulting optimized programs would be considered "good" depends on your definition of "good." I suspect mine is more utilitarian than yours (as even after some thought I can't conceive of what a "terrible" proof might look like), but I am skeptical that your code review will prove to be a better measure of goodness than a broad suite of unit tests/verifiers/metrics -- which, to my original last point, are only getting more robust! And if these aren't enough, I suspect the addition of LLM-as-a-judge (potentially ensembles) checking for readability/maintainability/security vulnerabilities will eventually put code quality above that of what currently qualifies as "good" code.

Your examples of tasks that can't easily be optimized (image fidelity, song quality, etc.) seem out of scope to me -- can you point to categories of extant software that could not be hillclimbed via RL? Or is this just a fundamental disagreement about what it means for software to be "good"? At any rate, I think we can agree that the original claim that "The LLM has one job, to make code that looks plausible. That's it. There's no logic gone into writing that bit of code" is wrong in the context of RL.

  > I took this to mean "good code within the domain of software engineering," not "good code within the universe of possible programs.

We both mean the same thing. The reasonable one. The only one that even kinda makes sense: good enough code

  > vast majority of production software has metrics/unit tests that could be attached and subsequently hillclimbed via RL

Yes, hill climbed. But that's different than "towards good"

Here's the difference[0]. You'll find another name for Goodhart's Law in any intro ML course. Which is why it is so baffling that 1) this is contentious 2) it is the status quo in research now

Your metrics are only useful if you understand them

Your measures are only as good as your attention

And it is important to distinguish metrics from measures. They are different things. Both are proxies

  > Your examples of tasks that can't easily be optimized (image fidelity, song quality, etc.) seem out of scope to me

Maybe you're unfamiliar with diffusion models?[1]

They are examples where it is hopefully clearer that these things are hard to define. If you have good programming skills you should be able to make the connection back to what this has to do with my point. If not, I'm actually fairly confident GPT will be able to do so. There's more than enough in its training data to do that.

[0] https://en.wikipedia.org/wiki/Goodhart%27s_law

[1] https://stability.ai/

  • Now I'm confused -- you're claiming you meant "good enough code" when your previous definition was such that even mathematical proofs could be "terrible"? That doesn't make sense to me. In software engineering, "good enough" has reasonably clear criteria: passes tests, performs adequately, follows conventions, etc. While these are imperfect proxies, they're sufficient for most real-world applications, and crucially -- measurable. And my claim is that they will be more than adequate to get LLMs to produce good code.

    And again, diffusion models aren't relevant here. The original comment was about LLMs producing buggy code -- not RL's general limitations in other domains. Diffusion models' tensors aren't written by hand.

    •   > Now I'm confused ... that even mathematical proofs could be "terrible"? That doesn't make sense to me.
      

      You know there's plenty of ways to prove things, right? Like there's not a single proof. Here's a few proofs for pi being irrational[0]. The list is not comprehensive.

      Take that like you do with code. They all generate the same final output. They're all correct. But is one better than another? Yes, yes it is. But which one that is depends on context.

        > and crucially -- measurable
      

      This is probably a point of contention. Measuring is far more difficult than people think. A lot of work goes into creating measurements and we get a nice ruler at the end. The problem isn't just that initial complexity, it is that every measure is a proxy. Even your meter stick doesn't measure a meter. What distinguishes the engineer from the hobbyist is the knowledge of alignment.

        How well does my measure align with what I intend to measure?
      

      That's a very hard problem. How often do you ask yourself that? I'm betting not enough. Frankly, most things aren't measurable.

      [0] https://proofwiki.org/wiki/Pi_is_Irrational#:~:text=Hence%20...