LLMs work best when the user defines their acceptance criteria first

2 months ago (blog.katanaquant.com)

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

  • This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

    • I love that you’re getting straightforward replies to this absolutely sick burn. The blade is so sharp that some people aren’t even feeling it.

    • LLM code is higher quality than any codes I have seen in my 20 years in F500. So yeah you need to "guide" it, and ensure that it will not bypass all the security guidance for ex...But at least you are in control, although the cognitive load is much higher as well than just "blind trust of what is delivered".

      But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.

      50 replies →

    • If you a) know what you are doing and b) know what an llm is capable of doing, c) can manage multiple llm agents at a time, you can be unbelievably productive. Those skills I think are less common than people assume.

      You need to be technical, have good communication skills, have big picture vision, be organized, etc. If you are a staff level engineer, you basically feel like you don’t need anyone else.

      OTOH i have been seeing even fairly technical engineering managers struggle because they can’t get the LLMs to execute because they don’t know how to ask it what to do.

      3 replies →

    • For me, I'll do the engineering work of designing a system, then give it the specific designs and constraints. I'll let it plan out the implementation, then I give it notes if it varies in ways I didn't expect. Once we agree on a solution, that's when I set it free. The frontier models usually do a pretty good job with this work flow at this point.

      2 replies →

    • Really? Because this perfectly explains why it will never replace them: it needs an exact language listing everything required to function as you expect it.

      You need code to get it to generate proper code.

      1 reply →

  • > If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

    Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.

  • My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

    You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

    • Well when you write it manually you are doing the review and sanity checking in real time. For some tasks, not all but definitely difficult tasks, the sanity checking is actually the whole task. The code was never the hard part, so I am much more interested in the evolving of AIs real world problem solving skills over code problems.

      I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.

    • "Several hours"? How big are your change sets?

      If a human dropped a PR on me that took "several hours" to go through (10k+ lines or non-trivial changes), I'd jump in my car and drive to the office just to specifically slap them on the back of the head ffs.

      1 reply →

  • I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.

    • +1 if you are vibe coding projects from scratch. if the architecture you specify doesn't make sense, the llm will start struggling, the only way out of their misery is mocking tests. the good thing is that a complete rewrite with proper architecture and lessons learned is now totally affordable.

      1 reply →

  • Not trying to be snarky, with all due respect... this is a skill issue.

    It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.

    > If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

    Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.

    > If you tell them the code is slow

    Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.

    Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).

    > you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

    This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.

    And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.

    Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.

    The more you stick to convention the better off you'll be. And use planning mode.

    • > Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results?

      Yes? Why don't you?

      They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).

      8 replies →

    • My comment was a summary of the situation, not literal prompts I use. I absolutely realize the work needs to be adequately described and agents must be steered in the right direction. The results also vary greatly depending on the task and the model, so devs see different rates of success.

      On non-trivial tasks (like adding a new index type to a db engine, not oneshotting a landing page) I find that the time and effort required to guide an LLM and review its work can exceed the effort of implementing the code myself. Figuring out exactly what to do and how to do it is the hard part of the task. I don't find LLMs helpful in that phase - their assessments and plans are shallow and naive. They can create todo lists that seemingly check off every box, but miss the forest for the trees (and it's an extra work for me to spot these problems).

      Sometimes the obvious algorithm isn't the right one, or it turns out that the requirements were wrong. When I implement it myself, I have all the details in my head, so I can discover dead-ends and immediately backtrack. But when LLM is doing the implementation, it takes much more time to spot problems in the mountains of code, and even more effort to tell when it's a genuinely a wrong approach or merely poor execution.

      If I feed it what I know before solving the problem myself, I just won't know all the gotchas yet myself. I can research the problem and think about it really hard in detail to give bulletproof guidance, but that's just programming without the typing.

      And that's when the models actually behave sensibly. A lot of the time they go off the rails and I feel like a babysitter instructing them "no, don't eat the crayons!", and it's my skill issue for not knowing I must have "NO eating crayons" in AGENTS.md.

      1 reply →

    • Great answer, and the reason some people have bad experiences is actually patently clear: they don’t work with the AI as a partner, but as a slave. But even for them, AI is getting better at automatically entering planning mode, asking for clarification (what exactly is slow, can you elaborate?), saying some idea is actually bad (I got that a few times), and so on… essentially, the AI is starting to force people to work as a partner and give it proper information, not just tell them “it’s broken, fix it” like they used to do on StackOverflow.

    • It is not a tool. It is an oracle.

      It can be a tool, for specific niche problems: summarization, extraction, source-to-source translation -- if post-trained properly.

      But that isn't what y'all are doing, you're engaging in "replace all the meatsacks AGI ftw" nonsense.

      9 replies →

    • > Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized.

      ...Really? I think 'hey we have a lot of customers reporting the app is laggy when they do X, could you take a look' is a very reasonable thing to tell your coworker who implemented X.

  • Don't let it deteriorate so far that it can't recover in one session.

    Perform regular sessions dedicated to cleaning up tech debt (including docs).

  • > If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

    Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?

  • maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

    • I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.

      2 replies →

    • I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.

      2 replies →

  • I have no idea what I'm doing differently because I haven't experienced this since Opus 4.5. Even with Sonnet 4.5, providing explicit instructions along the lines of "reuse code where sensible, then run static analysis tools at the end and delete unused code it flags" worked really well.

    I always watch Opus work, and it is pretty good with "add code, re-read the module, realize some pre-existing code (either it wrote, or was already there) is no longer needed and delete it", even without my explicit prompts.

  • Yes, this is exactly the experience I have had with LLMs as a non-programmer trying to make code. When it gets too deep into the weeds I have to ask it to get back a few steps.

  • Yes that’s my observation too. I have to be double careful the longer they run a task. They like to hack and patch stuff even when I tell it I don’t prefer it.

  • The solution is to know when to use an existing solution like sqlite and when to create your own. So the biggest problem with LLMs is that they don't repel or remind you about possible consequences (too often). But if they would, I would find it even more awkward... and this is one of the reasons I prefer Claude Code over Codex.

  • I use the restore checkpoint/fork conversation feature in GitHub Copilot heavily because of this. Most of the time it's better to just rewind than to salvage something that's gone off track.

  • I feel like there's two types of LLM users. Those that understand it's limitations, and those that ask it to solve a millennium problem on the first try.

  • I have run into this too. Some of it is because models lack the big picture; so called agentic search (aka grep) is myopic.

  • The reason theyre not intelligent is becaise they want to predict the next token, so verbosity is baked in.

  • have you tired adding to your agents file: "Prefer solutions that reduce lines of code over adding lines of code"?

  • i wonder if the solution is to just ask it to refactor its code once it's working.

    • I do this all the time but then you end up with really over engineered code that has way more issues than before. Then you're back to prompting to fix a bunch of issues. If you didn't write the initial code sometimes it's difficult to know the best way to refactor it. The answer people will say is to prompt it to give you ideas. Well then you're back to it generating more and more code and every time it does a refactor it introduces more issues. These issues aren't obvious though. They're really hard to spot.

    • You can, and it might make things a bit better. The only real way I've found so far is to start going through file by file, picking it apart.

      I wouldn't be surprised if over half my prompts start with "Why ...?", usually followed by "Nope, ... instead”

      Maybe the occasional "Fuck that you idiot, throw the whole thing out"

This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

The catch is that many judges lack the time, energy, or willingness to not only read the documents in detail, but also roll up their sleeves and dig into the arguments and cited authorities. (Some lack the skills, but those are extreme cases.) So the plausible argument (improperly and unfortunately) carries the day.

LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute. (It's a species of Brandolini's Law / The Bullshit Asymmetry Principle.) Thus justice suffers.

I imagine that this is analogous to the cognitive, technical, and "sub-optimal code" debt that LLM-produced code is generating and foisting upon future developers who will have to unravel it.

  • > LLM use in litigation drafting is thus akin to insurgent/guerilla warfare: it take little time, energy, or thinking to create, yet orders of magnitude more to analyze and refute.

    The same goes for coding. I have coworkers who use it to generate entire PRs. They can crank out two thousand lines of code that includes tests "proving" that it works, but may or may not actually be nonsense, in minutes. And then some poor bastard like me has to spend half a day reviewing it.

    When code is written by a human that I know and trust, I can assume that they at least made reasonable, if not always correct, decisions. I can't assume that with AI, so I have to scrutinize every single line. And when it inevitably turns out that the AI has come up with some ass-backwards architecture, the burden is on me to understand it and explain why it's wrong and how to fix it to the "developer" who hasn't bothered to even read his own PR.

    I'm seriously considering proposing that if you use AI to generate a PR at my company, the story points get credited to the reviewer.

    • Evil voice: "I don't mind not getting credits for the story points. The story was AI-generated anyway."

    • If code smells like LLM, then you walk to said coworker and ask them to explain it for you. Play dumb if necessary.

      Or you use YOUR LLM to review the PR :D

      ...and wtf, you get "credited" story points for finishing tasks? That sounds completely insane.

      4 replies →

  • > This is my experience with how LLMs "draft" legal arguments: at first glance, it's plausible — but may be, and often is, invalid, unsound, and/or ill-advised.

    Correct, and this of course extends past just laws, into the whole scope of rules and regulations described in human languages. It will by its nature imply things that aren't explicitly stated nor can be derived with certainty, just because they're very plausible. And those implications can be wrong.

    Now I've had decent success with having LLMs then review these LLM-generated texts to flag such occurences where things aren't directly supported by the source material. But human review is still necessary.

    The cases I've been dealing with are also based on relatively small sets of regulations compared the scope of the law involved with many legal cases. So I imagine that in the domain you're working on, much more needs flagging.

  • >" justice suffers"

    Possible. It also suffers when majority simply can not afford proper representation

  • "Reasoning" needs to go back to the drawing board.

    Reasonable tasks need to be converted into formal logic, calculated and computed like a standard evaluation, and then translated back into english or language of choice.

    LLMs are being used to think when really they should be the interpret and render steps with something more deterministic in the middle.

    Translate -> Reason -> Store to Database. Rinse Repeat. Now the context can call from the database of facts.

  • As an attorney, I’m interested in this theory. Do you have any examples that illustrate the phenomenon you describe?

    • Sure, but which part: opposing counsel using LLMs; opposing counsel simply using bullshit asymmetry to befuddle (nothing new); or judges not always reading and looking deeply into the arguments and authorities (also nothing new)?

      If the first category, there have been plenty of examples that have even made their way onto the HN front page in the last half year or so. There have even been instances of judges using LLMs to draft orders containing confabulated authorities.

      1 reply →

This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow). Doesn't use is_ipk to identify primary keys, uses fsync on every statement. The problem with larger projects like this even if you are competent is that there are just too many lines of code to read it properly and understand it all. Bravo to the author for taking the time to read this project, most people never will (clearly including the author of it).

I find LLMs at present work best as autocomplete -

The chunks of code are small and can be carefully reviewed at the point of writing

Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete

That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated. They make mistakes I'd say 30% of the time or so when autocompleting, which is significant (mistakes not necessarily being bugs but ugly code, slow code, duplicate code or incorrect code.

Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.

  • > This is a fascinating look into code generated by an LLM that is correct in one sense (passes tests) but doesn't meet requirements (painfully slow).

    Why isn't requirements testing automated? Benchmarking the speed isn't rocket science. At worst a nightly build should run a benchmark and log it so you can find any anomalies.

Nitpick/question: the "LLM" is what you get via raw API call, correct?

If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

An example I keep hearing is "LLMs have no memory/understanding of time so ___" - but, agents have various levels of memory.

I keep trying to explain this in meetings, and in rando comments. If I am not way off-base here, then what should be the term, or terms, be? LLM-based agents?

  • > Nit pick/question: The LLM is what you get via raw API call, correct?

    You always need a harness of some kind to interact with an LLM. Normal web APIs (especially for hosted commercial systems) wrapped around LLMs are non-minimal harnesses, that have built in tools, interpretation of tool calls, application of what is exposed in local toolchains as “prompt templates” to transform the context structure in the API call into a prompt (in some cases even supporting managing some of the conversation state that is used to construct the prompt on the backend.)

    > If you are using an LLM via a harness like claude.ai, chatgpt.com, Claude Code, Windsurf, Cursor, Excel Claude plug-in, etc... then you are not using an LLM, you are using something more, correct?

    You are essentially always using something more than an LLM (unless “you” are the person writing the whole software stack, and the only thing you are consuming is the model weights, or arguably a truly minimal harness that just takes setting and a prompt that is not transformed in any way before tokenization, and returns the result after no transformations or filtering other than mapping back from tokens to text.)

    But, yes, if you are using an elaborate frontend of the type you enumerate (whether web or CLI or something else), you are probably using substantially more stuff on top of the LLM than if you are using the providers web API.

    • In meetings, I try to explain the roles of system prompts, agentic loops, tool calls, etc in the products I create, to the stakeholders.

      However, they just look at the whole thing as "the LLM," which carries specific baggage. If we could all spread the knowledge of what is actually going on to the wider public, it would make my meetings easier, and prevent many very smart folks who are not practitioners from saying inaccurate stuff.

      1 reply →

  • You're not off-base at all. The way I think about it:

    - LLM = the model itself (stateless, no tools, just text in/text out) - LLM + system prompt + conversation history = chatbot (what most people interact with via ChatGPT, Claude, etc.) - LLM + tools + memory + orchestration = agent (can take actions, persist state, use APIs)

    When someone says "LLMs have no memory" they're correct about the raw model, but Claude Code or Cursor are agents - they have context, tool access, and can maintain state across interactions.

    The industry seems to be settling on "agentic system" or just "agent" for that last category, and "chatbot" or "assistant" for the middle one. The confusion comes from product names (ChatGPT, Claude) blurring these boundaries - people say "LLM" when they mean the whole stack.

  • I like to use the term "coding agents" for LLM harnesses that have the ability to directly execute code.

    This is an important distinction because if they can execute the code they can test it themselves and iterate on it until it works.

    The ChatGPT and Claude chatbot consumer apps do actually have this ability now so they technically class as "coding agents", but Claude Code and Codex CLI are more obvious examples as that's their key defining feature, not a hidden capability that many people haven't spotted yet.

> The vibes are not enough. Define what correct means. Then measure.

Pretty much. I've been advocating this for a while. For automation you need intent, and for comparison you need measurement. Blast radius/risk profile is also important to understand how much you need to cover upfront.

The Author mentions evaluations, which in this context are often called AI evals [1] and one thing I'd love to see is those evals become a common language of actually provable user stories instead of there being a disconnect between different types of roles, e.g. a scientist, a business guy and a software developer.

The more we can speak a common language and easily write and maintain these no matter which background we have, the easier it'll be to collaborate and empower people and to move fast without losing control.

- [1] https://ai-evals.io/ (or the practical repo: https://github.com/Alexhans/eval-ception )

This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.

They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.

  • Exactly. It’s also easy to find yourself in the out-of-distribution territory. Just ask for some tree-sitter queries and watch Gemini 3, Opus 4.5 and GLM 5 hallucinate new directives.

    • I think this could be the key difference in how people are experiencing the tools. Using Claude in industries full of proprietary code is a totally different experience to writing some React components, or framework code in C#, PHP or Java. It's shockingly good at the later, but as you get into proprietary frameworks or newer problem domains it feels like AI in 2023 again, even with the benefit of the agentic harnesses and context augments like memory etc.

      1 reply →

    • I think in the long term, if an LLM can’t use a tool, people won’t stop using LLM’s, they’ll stop using the tool.

      We are building everything right now with LLM agents as a primary user in mind and one of our principles is “hallucination driven development”. If LLMs hallucinate an interface to your product regularly, that is a desire path and you should create that interface.

  • IIRC, the most code in its training data is Python. Closely followed by Web technologies (HTML, JS/TS, CSS). This corresponds to the most abundant developers. Many of them dedicated their entire careers to one technology.

    We stubbornly use the same language to refer to all software development, regardless of the task being solved. This lets us all be a part of the same community, but is also a source of misunderstanding.

    Some of us are prone to not thinking about things in terms of what they are, and taking the shortcut of looking at industry leaders to tell us what we should think.

    These guys consistently, in lockstep, talk about intelligent agents solving development tasks. Predominately using the same abstract language that gives us an illusion of unity. This is bound to make those of us solving the common problems believe that the industry is done.

  • > They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.

    "Plausible" sounds like the right word to me. (It would be a mistake to digress into these features of LLMs in an article where it isn't needed.)

    • I agree - I took "plausible" here to mean plausible-looking, no different than similar-looking.

      The trouble of course is that similar/plausible isn't good enough unless the LLM has seen enough similar-but-different training samples to refine it's notion of similarity to the point where it captures the differences that are critical in a given case.

      I'd rather just characterize it as a lack of reasoning, since "add more data" can't be the solution to a world full of infinite variety. You can keep playing whack a mole to add more data to fix each failure, and I suppose it's an interesting experiment to see how far that will get you, but in the end the LLM is always going to be brittle and susceptible to stupid failure cases if it doesn't have the reasoning capability to fully analyze problems it was not trained on.

Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.

  • Attributing these to "hidden requirements" is a slippery slope.

    My own experience using Claude Code and similar tools tells me that "hidden requirements" could include:

    * Make sure DESIGN.md is up to date

    * Write/update tests after changing source, and make sure they pass

    * Add integration test, not only unit tests that mock everything

    * Don't refactor code that is unrelated to the current task

    ...

    These are not even project/language specific instructions. They are usually considered common sense/good practice in software engineering, yet I sometimes had to almost beg coding agents to follow them. (You want to know how many times I have to emphasize don't use "any" in a TypeScript codebase?)

    People should just admit it's a limitation of these coding tools, and we can still have a meaningful discussion.

    • The training data is full of ‘any’ so you will keep getting ‘any’ because that is the code the models have seen.

      An interesting example of the training data overriding the context.

      1 reply →

    • Yeah I agree generally that the most banal things must be specified, but I do think that a single sentence in the prompt "Performance should be equivalent" would likely have yielded better results.

Ok, I’ll bite: how is that different from humans?

  • Human behaviour is goal-directed because humans have executive function. When you turn off executive function by going to sleep, your brain will spit out dreams. Dream logic is famous for being plausible but unhinged.

    I have the feeling that LLMs are effectively running on dream logic, and everything we've done to make them reason properly is insufficient to bring them up to human level.

    • Isn’t a modern LLM with thinking tokens fairly goal directed? But yes, we hallucinate in our sleep while LLMs will hallucinate details if the prompt isn’t grounded enough.

      9 replies →

    • It’s amazing how much you get wrong here. As LLM attention layers are stacked goal functions.

      What they lack is multi turn long walk goal functions — which is being solved to some degree by agents.

      1 reply →

    • LLMs are literally goal machines. It’s all they do. So it’s important that you input specific goals for them to work towards. It’s also why logically you want to break the problem into many small problems with concrete goals.

      2 replies →

    • A prompt for an LLM is also a goal direction and it'll produce code towards that goal. In the end, it's the human directing it, and the AI is a tool whose code needs review, same as it always has been.

      1 reply →

  • The volume is different. Someone submitted a PR this week that was 3800 lines of shell script. Most of it was crap and none of it should have been in shell script. He's submitting PRs with thousands of lines of code every day. He has no idea how any of it actually works, and it completely overwhelms my ability to review.

    Sure, he could have submitted a ill-considered 3800 line PR five years ago, but it would have taken him at least a week and there probably would have been opportunities to submit smaller chunks along the way or discuss the approach.

    • It’s harder when the person doing what you describe has the ability to have you fired. Power asymmetry + irresponsible AI use + no accountability = a recipe for a code base going right to hell in a few months.

      I think we’re going to see a lot of the systems we depend on fail a lot more often. You’d often see an ATM or flight staus screen have a BSOD - I think we’re going to see that kind of thing everywhere soon.

  • Humans have a 'world model' beyond the syntax - for code, an idea of what the code should do and how it does it. Of course, some humans are better than others at this, they are recognized as good programmers.

  • What I'm surprises me about the current development environment is the acceleration of technical debt. When I was developing my skills the nagging feeling that I didn't quite understand the technology was a big dark cloud. I felt this clopud was technical debt. This was always what I was working against.

    I see current expectations that technical debt doesn't matter. The current tools embrace superficial understand. These tools to paper over the debt. There is no need for deeper understanding of the problem or solution. The tools take care of it behind the scenes.

  • It’s not. LLMs are just averaging their internet snapshot, after all.

    But people want an AI that is objective and right. HN is where people who know the distinction hang out, but it’s not what the layperson things they are getting when they use this miraculous super hyped tool that everybody is raving about?

    • The etiquette, even at the bigtech place I work, has changed so quickly. The idea that it would be _embarrassing_ to send a code review with obvious or even subtle errors is disappearing. More work is being put on the reviewer. Which might even be fine if we made the further change that _credit goes to the reviewer_. But if anything we're heading in the opposite direction, lines of code pumped out as the criterion of success. It's like a car company that touts how _much_ gas its cars use, not how little.

      1 reply →

    • By now, a few years after ChatGPT released, I don't think anyone is thinking AI is objective and right, all users have seen at least one instance of hallucination and simply being wrong.

      6 replies →

  • It's much easier to fire an employee which produces low quality/effort work than to convince leadership to fire Claude.

    • You can fire employees who don't review code generated though, because ultimately it's their responsibility to own their code, whether they hand wrote it or an LLM did.

      It seems to me that it's all a matter of company culture, as it has always been, not AI. Those that tolerate bad code will continue to tolerate it, at their peril.

It writes statistically represented code, which is why (unless instructed otherwise) everything defaults to enterprisey, OOP, "I installed 10 trendy dependencies, please hire me" type code.

Just a recent anecdote, I asked the newest Codex to create a UI element that would persist its value on change. I'm using Datastar and have the manual saved on-disk and linked from the AGENTS.md. It's a simple html element with an annotation, a new backend route, and updating a data model. And there are even examples of this elsewhere in the page/app.

I've asked it to do why harder things so I thought it'd easily one-shot this but for some reason it absolutely ate it on this task. I tried to re-prompt it several times but it kept digging a hole for itself, adding more and more in-line javascript and backend code (and not even cleaning up the old code).

It's hard to appreciate how unintuitive the failure modes are. It can do things probably only a handful of specialists can do but it can also critical fail on what is a straightforward junior programming task.

This maps directly to the shift happening in API design for agent-to-agent communication.

Traditional API contracts assume a human reads docs and writes code once. But when agents are calling agents, the "contract" needs to be machine-verifiable in real-time.

The pattern I've seen work: explicit acceptance criteria in API responses themselves. Not just status codes, but structured metadata: "This response meets JSON Schema v2.1, latency was 180ms, data freshness is 3 seconds."

Lets the calling agent programmatically verify "did I get what I paid for?" without human intervention. The measurement problem becomes the automation problem.

Similar to how distributed systems moved from "hope it works" to explicit SLOs and circuit breakers. Agents need that, but at the individual request level.

  • Interesting, but couldn’t the agent be given access to tools that allow it to make those evaluations without having to modify the API responses? (Maybe I’m not visualizing “API” the same way you are.)

What's up with the (somewhat odd) title HN has gone with for this article? it's implying a very different article than the one I just read

One thing I’ve noticed while working with data/AI workflows is that the “acceptance criteria first” idea applies even more strongly once you move beyond code generation into data pipelines and analytics.

LLMs can generate queries, transformations, or even Spark jobs that look reasonable but if the underlying data contracts, schema expectations, or evaluation criteria aren’t defined, you end up with something that looks correct but is semantically wrong.

In practice, the teams that get the most value from AI-assisted development tend to have: clearly defined datasets reproducible data pipelines well-defined outputs / metrics Once those pieces are in place, AI becomes much more useful because it’s operating inside a structured system instead of guessing context. That’s also why there’s been a lot of interest lately in lakehouse-style platforms that combine data engineering, analytics, and AI workflows in one place (e.g. platforms like IOMETE).

When the data layer is structured and reproducible, AI tooling becomes far more reliable. Curious if others here have seen the same pattern when using LLMs for data engineering or analytics work.

Most humans also write plausible code.

  • LLMs piggyback on human knowledge encoded in all the texts they were trained on without understanding what they're doing.

    Humans would execute that code and validate it. From plausible it'd becomes hey, it does this and this is what I want. LLMs skip that part, they really have no understanding other than the statistical patterns they infer from their training and they really don't need any for what they are.

    • Could we stop using vague terms like “understanding” when talking about LLMs and machine learning? You don't know what understanding is. You only know how it feels to understand something.

      It's better to describe what you can do that LLMs currently can't.

      2 replies →

    • LLMs can execute code and validate it too so the assertions you've made in your argument are incorrect.

      What a shame your human reasoning and "true understanding" led you astray here.

LLMs have no idea what "correct" means.

Anything they happen to get "correct" is the result of probability applied to their large training database.

Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.

Relying on an LLM for anything "serious" is a liability issue waiting to happen.

  • Yes Transformer models are non-deterministic, but it is absolutely not true that they can't generalise (the equivalent of interpolation and extrapolation in linear regression, just with a lot more parameters and training).

    For example, let's try a simple experiment. I'll generate a random UUID:

    > uuidgen 44cac250-2a76-41d2-bbed-f0513f2cbece

    Now it is extremely unlikely that such a UUID is in the training set.

    Now I'll use OpenCode with "Qwen3 Coder 480B A35B Instruct" with this prompt: "Generate a single Python file that prints out the following UUID: "44cac250-2a76-41d2-bbed-f0513f2cbece". Just generate one file."

    It generates a Python file containing 'print("44cac250-2a76-41d2-bbed-f0513f2cbece")'. Now this is a very simple task (with a 480B model), but it solves a problem that is not in the training data, because it is a generalisation over similar but different problems in the training data.

    Almost every programming task is, at some level of abstraction, and with different levels of complexity, an instance of solving a more general type of problem, where there will be multiple examples of different solutions to that same general type of problem in the training set. So you can get a very long way with Transformer model generalisations.

  • It’s a shame of bulk of that training data is likely 2010s blogspam that was poor quality to begin with.

    • But isn't that a reflection of reality?

      If you've made a significant investment in human capital, you're even more likely to protect it now and prevent posting valuable stuff on the web.

      2 replies →

  • Aye. I wish more conversations would be more of this nature - in that we should start with basic propositions - e.g. the thing does not 'know' or 'understand' what correct is.

  • This is about to change very soon. Unlike many other domains (such as greenfield scientific discovery), most coding problems for which we can write tests and benchmarks are "verifiable domains".

    This means an LLM can autogenerated millions of code problem prompts, attempt millions of solutions (both working and non-working), and from the working solutions, penalize answers that have poor performance. The resulting synthetic dataset can then be used as a finetuning dataset.

    There are now reinforcement finetuning techniques that have not been incorporated into the existing slate of LLMs that will enable finetuning them for both plausibility AND performance with a lot of gray area (like readability, conciseness, etc) in between.

    What we are observing now is just the tip of a very large iceberg.

    • Lets suppose whatever you say is true.

      If Im the govt, Id be foaming at the mouth - those projects that used to require enormous funding now will supposedly require much less.

      Hmmm, what to do? Oh I know. Lets invest in Digital ID-like projects. Fun.

      2 replies →

  • This is easily proven incorrect. Just go to ChatGPT and say something incorrect and ask it to verify. Why do people still believe this type of thing?

    • I did this yesterday and it was happy to provide me with an incorrect explanation. Not just that, but incorrect thermodynamic data supporting its claims, despite readily available published values to the contrary.

I'm using an LLM to write queries ATM. I have it write lots of tests, do some differential testing to get the code and the tests correct, and then have it optimize the query so that it can run on our backend (and optimization isn't really optional since we are processing a lot of rows in big tables). Without the tests this wouldn't work at all, and not just tests, we need pretty good coverage since if some edge case isn't covered, it likely will wash out during optimization (if the code is ever correct about it in the first place). I've had to add edge cases manually in the past, although my workflow has gotten better about this over time.

I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.

100% I found that you think you are smarter than the LLM and knowing what you want, but this is not the case. Give the LLM some leeway to come up with solution based on what you are looking to achieve- give requirements, but don't ask it to produce the solution that you would have because then the response is forced and it is lower quality.

You: Claude, do you know how to program?

Claude: No, but if you hum a few bars I can fake it!

Except "faking it" turns out to be good enough, especially if you can fake it at speed and get feedback as to whether it works. You can then just hillclimb your way to an acceptable solution.

This is a bit unfair - to generate a bunch of code but not give the model data/tools and direct it to optimize it; then compare it to the optimized work of thousands over decades.

Feels like an extremely high effort hit piece, even though I know it’s not.

I’ve found this to be critical for having any chance of getting agents to generate code that is actually usable.

The more frequently you can verify correctness in some automated way the more likely the overall solution will be correct.

I’ve found that with good enough acceptance criteria (both positive and negative) it’s usually sufficient for agents to complete one off tasks without a human making a lot of changes. Essentially, if you’re willing to give up maintainability and other related properties, this works fairly well.

I’ve yet to find agents good enough to generate code that needs to be maintained long term without a ton of human feedback or manual code changes.

In the last month I've done 4 months of work. My output is what a team of 4 would have produced pre-AI (5 with scrum master).

Just like you can't develop musical taste without writing and listening to a lot of music, you can't teach your gut how to architect good code without putting in the effort.

Want to learn how to 10x your coding? Read design patterns, read and write a lot of code by hand, review PRs, hit stumbling blocks and learn.

I noticed the other day how I review AI code in literally seconds. You just develop a knack for filtering out the noise and zooming in on the complex parts.

There are no shortcuts to developing skill and taste.

  • > I review AI code in literally seconds

    You've just settled for hackathon standards and told yourself it's okay because you're using AI.

    Everyone with experience should know that even thorough code reviews only catch stylistic issues, glaring errors, and the most obvious design deficiencies. The only time new code is truly thought about is as it's being written.

Excellent article. But to be fair, many of these effects disappear when the model is given strict invariants, constraints, and built-in checks that are applied not only at the beginning but at every stage of generation.

Producing the most plausible code is literally encoded into the cross entropy loss function and is fundamental to the pre-training. I suppose post training methods like RLVR are supposed to correct for this by optimizing correctness instead of plausibility, but there are probably many artifacts like these still lurking in the model's reasoning and outputs. To me it seems at least possible that the AI labs will find ways to improve the reward engineering to encourage better solutions in the coming years though.

I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.

No exaggeration it floundered for an hour before it started to look right.

It's really not good at tasks it has not seen before.

  • Considering that a fleur-de-lis involves somewhat intricate curves, I think I'd be pretty happy with myself if I could get that task done in an hour.

    Given a harness that allows the model to validate the result of its program visually, and given the models are capable of using this harness to self correct (which isn't yet consistently true), then you're in a situation where in that hour you are free to do some other work.

    A dishwasher might take 3 hours to do for what a human could do in 30 minutes, but they're still very useful because the machine's labor is cheaper than human labor.

    • I didn't provide any constraints on how to draw it.

      TBH I would have just rendered a font glyph, or failing that, grabbed an image.

      Drawing it with vector graphics programmatically is very hard, but a decent programmer would and should push back on that.

      2 replies →

  • Even with well understood languages, if there isn't much in the public domain for the framework you're using it's not really that helpful. You know you're at the edges of its knowledge when you can see the exact forum posts you are looking at showing up verbatim in it's responses.

    I think some industries with mostly proprietary code will be a bit disappointing to use AI within.

  • I got Opus 4.6 to one shot it, took 5-ish mins. "Write me a python program that outputs an svg of a fleur-de-lis. Use freely available images to double check your work."

    It basically just re-created the wikipedia article fleur-de-lis, which I'm not sure proves anything beyond "you have to know how to use LLMs"

    • Just for reference, Codex using GPT-5.4 and that exact prompt was a 4-shot that took ten minutes. The first result was a horrific caricature. After a slight rebuke ("That looks terrible. Read https://en.wikipedia.org/wiki/Fleur-de-lis for a better understanding of what it should look like."), it produced a very good result but it then took two more prompts about the right side of the image being clipped off before it got it right.

    • Same, I used Sonnet 4.6 with the prompt, "Write a simple program that displays a fleur-de-lis. Python is a good language for this." Took five or six minutes, but it wrong a nice Python TK app that did exactly what it was supposed to.

  • I tried to use Codex to write a simple TCP to QUIC proxy. I intentionally kept the request fairly simple, take one TCP connection and map it to a QUIC connection. Gave a detailed spec, went through plan mode, clarified all the misunderstandings, let it write it in Python, had it research the API, had it write a detailed step by step roadmap... The result was a fucking mess.

    Beyond the fact that it was "correct" in the same way the author of the article talked about, there was absolutely bizarre shit in there. As an example, multiple times it tried to import modules that didn't exist. It noticed this when tests failed, and instead of figuring out the import problem it add a fucking try/except around the import and did some goofy Python shenanigans to make it "work".

  • Have you tried describing to Claude what it is? The more the detail the better the result. At some point it does become easier to just do it yourself.

The reference in the text to Anthropic’s “Towards Understanding Sycophancy in Language Models” is related to RLHF (reinforcement learning with human feedback).

Claude code uses primarily different "pathways" in Anthropic LLMs that were not post-trained with RLHF, but rather with RLVF (reinforcement learning with verifiable rewards).

So, his point about code being produced to please the user isn't valid from where I am sitting.

I think there is one problem with defining acceptance criteria first: sometimes you don't know ahead of time what those criteria are. You need to poke around first to figure out what's possible and what matters. And sometimes the criteria are subjective, abstract, and cannot be formally specified.

Of course, this problem is more general than just improving the output of LLM coding tools

  • Yeah it’s extremely helpful to clarify your thoughts before starting work with LLM agents.

    I find Claude Code style plan mode to be a bit restrictive for me personally, but I’ve found that creating a plan doc and then collaboratively iterating on it with an LLM to be helpful here.

    I don’t really find it much different than the scoping I’d need to do before handing off some work to a more junior engineer.

The difference for me recently

Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.

Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.

I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.

I treat coding agents like junior developers.

  • Unlike junior developers, llms can take detailed instructions and produce outstanding results at first shot a good number of times.

    • While I’m pro LLMs over junior developers. The other issue with LLMs is even the most junior developer will learn your business context over time.

      In my case, in consulting (cloud + app dev), I just start the AGENTS.md file with a summary of the contract (the SOW), my architectural diagram and the transcript of my design review with the customer.

  • Did you ask it to research best practices for this method, have an adversarial performance based agent review their approach or search for performant examples of the task first? Relying on training data only will always get your subpar results. Using “What is the most performant way to load a CSV from S3 into PostgreSQL on RDS? Compare all viable and research approaches before recommending one.” gave me the extension as the top option.

    • I knew the best way. I was just surprised that Claude got it wrong. As soon as I told it to use the s3 extension, it knew to add the appropriate permissions, to update my sql unit script to enable the extension and how to write the code

      1 reply →

  • Same pattern in data engineering generally. LLMs default to the obvious row-by-row or download-then-insert approach and you have to steer them toward the efficient path (COPY, bulk loaders, server-side imports). Once you name the right primitive, they execute it correctly, permissions and all, as you found.

    The deeper issue is that "efficient ingest" depends heavily on context that's implicit in your setup: file sizes, partitioning, schema evolution expectations, downstream consumers. A Lambda doing direct S3-to-Postgres import is fine for small/occasional files, but if you're dealing with high-volume event-driven ingestion you'll hit connection pool pressure fast on RDS. At that point the conversation shifts to something like a queue buffer or moving toward a proper staging layer (S3 → Redshift/Snowflake/Databricks with native COPY or autoloader). The LLM won't surface that tradeoff unless you explicitly bring it up. It optimizes for the stated task, not for the unstated architectural constraints.

    • Also with Redshift - split the file up before ingestion to equal the number of nodes or combine a lot of small files into larger files before putting them into S3 and/or use an Athena CTAS command to combine a lot of small files into one big file.

      So in my other case, the whole thing was

      Web crawler (internal customer website) using Playwrite -> S3 -> SNS -> SQS -> Lambda (embed with Bedrock) -> S3 Vector Store.

      Similar to what you said, I ran into Bedrock embedding service limits. Then once I told it that, it knew how to adjust the lambda concurrency limits. Of course I had to tell it to also adjust the sqs poller so messages wouldn’t be backed up in flight, then go to the DLQ without ever being processed.

      2 replies →

Does it work if you get the agent to throw away all of its actual implementation and start again from scratch, keeping all the learning and tests and feedback?

Gemini seems to try to get a lot of information upfront with questions and plans but people are famously bad at knowing what they want.

Maybe it should build a series of prototypes and spikes to check? If making code is cheap then why not?

  • This does work but it requires prompts to instruct on it. It's also not perfect, though it is pretty good.

    What I've found when doing exactly this, is that the cost of the initial code makes me hesitant to throw it away. A better workflow I've been using is instead to iterate on very detailed planning documents written in markdown and repeatedly iterating on that instead (like, sometimes 50+ times for a complex app). It's really quite amazing how much that helps. It can lead to a design doc that is good enough that I can turn the agent loose on implementation and get decent results. Best results are still with guidance throughout, but I have never once regretted hammering out a very detailed planning document. I have many times regretted keeping code (or throwing code away).

There's also such a thing as being too ambitious. 99% of developers can not rewrite SQLite in rust even if they spent the rest or their lifetime doing it.

Expecting an AI do to a good job vibe-coding a Sqllite clone over a few weekends just isn't realistic. Despite that, it's useful technology.

The following paragraph appears twice:

> Now 2 case studies are not proof. I hear you! When two projects from the same methodology show the same gap, the next step is to test whether similar effects appear in the broader population. The studies below use mixed methods to reduce our single-sample bias.

You can ask an LLM to write benchmarks and to make the code faster. It will find and fix simple performance issues - the low-hanging fruit. If you want it to do better, you can give it better tools and more guidance.

It's probably a good idea to improve your test suite first, to preserve correctness.

> Your LLM Doesn't Write Correct Code. It Writes Plausible Code.

I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.

In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.

The acceptance criteria point translates directly outside of coding too. Using Claude Code for sales and operational workflows, having acceptable criteria upfront (along with some manual checks along the way depending on the task) definitely helps the output.

This article hits on an important point not easily discerned from the title:

Sometimes good software is good due to a long history of hard-earned wins.

AI can help you get to an implementation faster. But it cannot magically summon up a battle-hardened solution. That requires going through some battles.

Great software takes time.

Uncle Bob made this concept clear to me when he introduced to me that code itself IS requirements specification. LLMs are the new intermediary, but the necessity of the word and the machine persists.

> SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.

Someone (with deep pockets to bear the token costs) should let Claude run for 26 months to have it optimize its Rust code base iteratively towards equal benchmarks. Would be an interesting experiment.

The article points out the general issue when discussing LLMs: audience and subject matter. We mostly discuss anecdotally about interactions and results. We really need much more data, more projects to succeed with LLMs or to fail with them - or to linger in a state of ignorance, sunk-cost fallacy and supressed resignation. I expect the latter will remain the standard case that we do not hear about - the part of the iceberg that is underwater, mostly existing within the corporate world or in private GitHubs, a case that is true with LLMs and without them.

In my experience, 'Senior Software Engineer' has NO general meaning. It's a title to be awarded for each participation in a project/product over and over again. The same goes for the claim: "Me, Senior SWE treat LLMs as Junior SWE, and I am 10x more productive." Imagine me facepalming every time.

bad input > bad output

idk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.

yes, llms can produce bad code, they can also produce good code, just like people

  • yes, llms can produce bad code, they can also produce good code, just like people

    Over time, you develop a feel for which human coders tend to be consistently "good" or "bad". And you can eliminate the "bad".

    With an LLM, output quality is like a box of chocolates, you never know what you're going to get. It varies based on what you ask and what is in it's training data --- which you have no way to examine in advance.

    You can't fire an LLM for producing bad code. If you could, you would have to fire them all because they all do it in an unpredictable manner.

Enterprise customers don't buy correct code, they buy plausible code.

  • They're not buying code.

    They are buying a service. As long as the service 'works' they do not care about the other stuff. But they will hold you liable when things go wrong.

    The only caveat is highly regulated stuff, where they actually care very much.

  • Enterprise customers don't buy plausible code, they buy the promise of plausible code as sold by the hucksters in the sales department.

I'm sure this is because they are pattern matching masters, if you program them to find something, they are good at that. But you have to know what you're looking for.

These LLM prompting tip articles write themselves if you just take the last decade of project management articles and replace “IC” with “agent”.

I've noticed a key quality signal with LLM coding is an LOC growth rate that tapers off or even turns negative.

That's very impressive. Your LLM actually wrote a correct code for a full relational database on the first try, like it takes 2.5 seconds to insert 100 rows but it stores them correctly and select is pretty fast. How many humans can do this without a week of debugging? I would suggest you install some profiling tools and ask it to find and address hotspots. SQL Lite had how long and how many people to get to where it is?

  • I could "write" this code the same way, it's easy

    Just copy and paste from an open source relational db repo

    Easy. And more accurate!

    • The actual task is usually to mix something that looks like a dozen of different open source repos combined but to take just the necessary parts for task at hand and add glue / custom code for the exact thing being built. While I could do it, LLM is much faster at it, and most importantly I would not enjoy the task.

Early LLMs would do better at a task if you prefixed the task with "You are an expert [task doer]"

This is becoming clear, now?

I have had similar experiences, and I read over and over others experiences like this.

A powerful tool...

Bro you are like saying "OH LLM can't do X within 10 days which few people spend over decades" Live a life bro applause and change the title to "it can do xyz" instead of adding the "critical and critical" ...

> I write this as a practitioner, not as a critic. After more than 10 years of professional dev work, I’ve spent the past 6 months integrating LLMs into my daily workflow across multiple projects. LLMs have made it possible for anyone with curiosity and ingenuity to bring their ideas to life quickly, and I really like that! But the number of screenshots of silently wrong output, confidently broken logic, and correct-looking code that fails under scrutiny I have amassed on my disk shows that things are not always as they seem.

Same experience, but the hype bros do only need a shiny screengrab to proclaim the age of "gatekeeping" SWE is over to get their click fix from the unknowingly masses.

Oftentimes, plausible code is good enough, hence why people keep using AI to generate code. This is a distinction without a difference.

  • No. Plausible code is syntactically-correct BS disguised as a solution, hiding a countless amount of weird semantic behaviours, invariants and edge cases. It doesn't reflect a natural and common-sense thought process that a human may follow. It's a jumble of badly-joined patterns with no integral sense of how they fit together in the larger conceptual picture.

    • Why do people keep insisting that LLMs don't follow a chain of reasoning process? Using the latest LLMs you can see exactly what they "think" and see the resultant output. Plausible code does not mean random code as you seem to imply, it means...code that could work for this particular situation.

      11 replies →

But my AI didn't do what your AI did.

Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.

Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.

  • Yeah right. A LLM in the hands of a junior engineer produces a lot of code that looks like they are written by juniors. A LLM in the hands of a senior engineer produces code that looks like they are written by seniors. The difference is the quality of the prompt, as well as the human judgement to reject the LLM code and follow-up prompts to tell the LLM what to write instead.

    • Lol what. The difference is that the senior... is a senior. Ask yourself what characteristics comprises a senior vs junior...

      You're glossing over so much stuff. Moreover, how does the Junior grow and become the senior with those characteristics, if their starting point is LLMs?

      2 replies →

    • Prompting is just step 1. Creating and reviewing a plan is step 2. Step 0 was iterating and getting the right skills in place. Step 3 is a command/skill that decomposes the problem into small implementation steps each with a dependency and how to verify/test the implementation step. Step 4 is execute the implementation plan using sub agents and ensuring validation/testing passes. Step 5 is a code review using codex (since I use claude for implementation).

    • I kind of agree. But I'd adjust that to say that in both cases you get good looking code. In the hands of a junior you get crappy architecture decisions and complete failure to manage complexity which results in the inevitable reddit "they degraded the model" post. In the hands of seniors you get well managed complexity, targeted features, scalable high performance architecture, and good base technology choices.

  • It’s easy to get AI to write bad code. Turns out you still need coding skills to get AI to write good code. But those who have figured it out can crank out working systems at a shocking pace.

    • Agreed 100%. I'd add that it's the knowledge of architecture and scaling that you got from writing all that good code, shipping it, and then having to scale it. It gives you the vocabulary and broad and deep knowledge base to innovate at lightning speeds and shocking levels of complexity.