← Back to context

Comment by bigmadshoe

6 days ago

I agree with the main take in this article: the combination of agents + LLMs with large context windows + a large budget of tokens to iterate on problems can probably already yield some impressive results.

I take serious issue with the "but you have no idea what the code is" rebuttal, since it - to me - skims over the single largest issue with applying LLMs anywhere where important decisions will be made based on their outputs.

To quote from the article:

  People complain about LLM-generated code being 
  “probabilistic”. No it isn’t. 
  It’s code. It’s not Yacc output. It’s knowable. The LLM 
  might be stochastic. But the LLM doesn’t matter. What 
  matters is whether you can make sense of the result, and 
  whether your guardrails hold.

  Reading other people’s code is part of the job. If you can’t metabolize the 
  boring, repetitive code an LLM generates: skills issue! How are you handling the 
  chaos human developers turn out on a deadline?

The problem here is that LLMs are optimized to make their outputs convincing. The issue is exactly "whether you can make sense of the result", as the author said, or, in other words: whether you're immune to being conned by a model output that sounds correct but is not. Sure, "reading other people’s code is part of the job", but the failure modes of junior engineers are easily detectable. The failure modes of LLMs are not.

EDIT: formatting

It's also funny how it requires a lot of iterations for the average task.. and the user has to pay for the failures. No other product has this expectation, imagine a toaster that only toasts bread 20% of the time, and 50% it's half toasted.

> The problem here is that LLMs are optimized to make their outputs convincing.

That may be true for chat aligned LLMs, but coding LLMs are trained w/ RL and rewards for correctness, nowadays. And there are efforts to apply this to the entire stack (i.e. better software glue, automatic guardrails, more extensive tool-use, access to LSP/debuggers/linters, etc).

I think this is the critical point in a lot of these debates that seem to be very popular right now. A lot of people try something and get the wrong impressions about what SotA is. It turns out that often that something is not the best way to do it (i.e. chatting in a web interface for coding), but people don't go the extra mile to actually try what would be best for them (i.e. coding IDEs, terminal agents, etc).

  • Which "coding LLMs" are you referring to that are trained purely on verifiably correct synthetic data? To my understanding o3, gemini 2.5 pro, claude 3.7 sonnet, etc. are all still aligned to human preferences using a reward function learned from human feedback. Any time a notion of success/correctness is deferred to a human, the model will have a chance to "game" the system by becoming more convincing as well as more correct.

    Edit: thought I would include this below instead of in a separate comment:

    Also, whether the models are trained purely on synthetic data or not, they suffer from these epistemological issues where they are unable to identify what they don't know. This means a very reasonable looking piece of code might be spit out for some out-of-distribution prompt where the model doesn't generalize well.

    • > To my understanding o3, gemini 2.5 pro, claude 3.7 sonnet, etc. are all still aligned to human preferences using a reward function learned from human feedback.

      We don't know how the "thinking" models are trained at the big3, but we know that open-source models have been trained with RL. There's no human in that loop. They are aligned based on rewards, and that process is automated.

      > Which "coding LLMs" are you referring to that are trained purely on verifiably correct synthetic data?

      The "thinking" ones (i.e. oN series, claudeThinking, gemini2.5 pro) and their open-source equivalents - qwq, R1, qwen3, some nemotrons, etc.

      From the deepseek paper on R1 we know the model was trained with GRPO, which is a form of RL (reinforcement learning). QwQ and the rest were likely trained in a similar way. (before GRPO, another popular method was PPO. And I've seen work on unsupervised DPO, where the pairs are generated by having a model generate n rollouts, verify them (i.e. run tests) and use that to guide your pair creation)

      1 reply →