Comment by D-Machine

2 months ago

This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.

They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.

Exactly. It’s also easy to find yourself in the out-of-distribution territory. Just ask for some tree-sitter queries and watch Gemini 3, Opus 4.5 and GLM 5 hallucinate new directives.

  • I think this could be the key difference in how people are experiencing the tools. Using Claude in industries full of proprietary code is a totally different experience to writing some React components, or framework code in C#, PHP or Java. It's shockingly good at the later, but as you get into proprietary frameworks or newer problem domains it feels like AI in 2023 again, even with the benefit of the agentic harnesses and context augments like memory etc.

    • You’ve hit the nail on the head.

      I characterise llm’s as being black boxes that are filled with a dense pool of digital resources - that with the correct prompt you can draw out a mix of resources to produce an output.

      But if the mix of resources you need isn’t there - it won’t work. This isn’t limited to just text. This also applies with video models - llms work better for prompts in which you are trying to get material that is widely available on the internet.

  • I think in the long term, if an LLM can’t use a tool, people won’t stop using LLM’s, they’ll stop using the tool.

    We are building everything right now with LLM agents as a primary user in mind and one of our principles is “hallucination driven development”. If LLMs hallucinate an interface to your product regularly, that is a desire path and you should create that interface.

IIRC, the most code in its training data is Python. Closely followed by Web technologies (HTML, JS/TS, CSS). This corresponds to the most abundant developers. Many of them dedicated their entire careers to one technology.

We stubbornly use the same language to refer to all software development, regardless of the task being solved. This lets us all be a part of the same community, but is also a source of misunderstanding.

Some of us are prone to not thinking about things in terms of what they are, and taking the shortcut of looking at industry leaders to tell us what we should think.

These guys consistently, in lockstep, talk about intelligent agents solving development tasks. Predominately using the same abstract language that gives us an illusion of unity. This is bound to make those of us solving the common problems believe that the industry is done.

> They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

"Plausible" sounds like the right word to me. (It would be a mistake to digress into these features of LLMs in an article where it isn't needed.)

  • I agree - I took "plausible" here to mean plausible-looking, no different than similar-looking.

    The trouble of course is that similar/plausible isn't good enough unless the LLM has seen enough similar-but-different training samples to refine it's notion of similarity to the point where it captures the differences that are critical in a given case.

    I'd rather just characterize it as a lack of reasoning, since "add more data" can't be the solution to a world full of infinite variety. You can keep playing whack a mole to add more data to fix each failure, and I suppose it's an interesting experiment to see how far that will get you, but in the end the LLM is always going to be brittle and susceptible to stupid failure cases if it doesn't have the reasoning capability to fully analyze problems it was not trained on.