← Back to context

Comment by tptacek

6 days ago

I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance. And I'm not claiming that closed-loop agents reliably produce mergeable code, only that they've broken through a threshold where they produce enough mergeable code that they significantly accelerate development.

> I don't know if any of this applies to the arguments in my article, but most of the point of it is that progress in code production from LLMs is not a consequence of better models (or fine tuning or whatever), but rather on a shift in how LLMs are used, in agent loops with access to ground truth about whether things compile and pass automatic acceptance.

I very strongly disagree with this and think this reflects a misunderstanding of model capabilities. This sort of agentic loop with access to ground truth model has been tried in one form or another ever since GPT-3 came out. For four years they didn't work. Models would very quickly veer into incoherence no matter what tooling you gave them.

Only in the last year or so have models gotten capable enough to maintain coherence over long enough time scales that these loops work. And future model releases will tighten up these loops even more and scale them out to longer time horizons.

This is all to say that progress in code production has been essentially driven by progress in model capabilities, and agent loops are a side effect of that rather than the main driving force.

  • Sure! Super happy to hear these kinds of objections because, while all the progress I'm personally perceiving is traceable to decisions different agent frameworks seem to be making, I'm totally open to the idea that model improvements have been instrumental in making these loops actually converge anywhere practical. I think near the core of my argument is simply the idea that we've crossed a threshold where current models plus these kinds of loops actually do work.

  > I don't know if any of this applies to the arguments

  > with access to ground truth

There's the connection. You think you have ground truth. No such thing exists

  • It's even simpler than what 'rfrey said. You're here using "ground truth" in some kind of grand epistemic sense, and I simply mean "whether the exit code from a program was 1 or 0".

    You can talk about how meaningful those exit codes and error messages are or aren't, but the point is that they are profoundly different than the information an LLM natively operates with, which are atomized weights predicting next tokens based on what an abstract notion of a correct line of code or an error message might look like. An LLM can (and will) lie to itself about what it is perceiving. An agent cannot; it's just 200 lines of Python, it literally can't.

    •   > You're here using "ground truth" in some kind of grand epistemic sense
      

      I used the word "ground truth" because you did!

        >> in agent loops with access to ground truth about whether things compile and pass automatic acceptance.
      

      Your critique about "my usage of ground truth" is the same critique I'm giving you about it! You really are doing a good job at making me feel like I'm going nuts...

        > the information an LLM natively operates with,
      

      And do you actually know what this is?

      I am a ML researcher you know. And one of those ones that keeps saying "you should learn the math." There's a reason for this, because it is really connected to what you're talking about here. They are opaque, but they sure aren't black boxes.

      And it really sounds like you're thinking the "thinking" tokens are remotely representative of the internal processing. You're a daily HN user, I'm pretty sure you saw this one[0].

      I'm not saying anything OpenAI hasn't[1]. I just recognize that this applies to more than a very specific narrow case...

      [0] https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def563...

      3 replies →

  • Yes it does. Ground truth is what 30 years of experience says constitutes mergeable code. Ground truth doesn't mean "perfect, provably correct code", it means whatever your best benchmark for acceptable code is.

    In medical AI, where I'm currently working, "ground truth" is usually whatever human experts say about a medical image, and is rarely perfect. The goal is always to do better than whatever the current ground truth is.

    • I understand how you interpreted my comment as this. That's my bad.

      But even when taking state of the art knowledge as a ground truth aligning to that is incredibly hard. Medicine is a great example. You're trying to create a causal graph in a highly noisy environment. You ask 10 doctors and you'll get 12 diagnoses. The problem is subtle things become incredibly important. Which is exactly what makes measurements so fucking hard. There is no state of the art in a well defined sense.

      The point is that in most domains this is how things are. Even in programming.

      Getting the right answer isn't enough