Comment by miguelgrinberg

14 hours ago

> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.

It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.

In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.

I dunno, I have extensive experience reviewing code, and I still review all the AI generated code I own, and I find nothing to complain about in the vast majority of cases. I think it is based on "holding it right."

For instance, I've commented before that I tend to decompose tasks intended for AI to a level where I already know the "shape" of the code in my head, as well as what the test cases should look like. So reviewing the generated code and tests for me is pretty quick because it's almost like reading a book I've already read before, and if something is wrong it jumps out quickly. And I find things jumping out more and more infrequently.

Note that decomposing tasks means I'm doing the design and architecture, which I still don't trust the AI to do... but over the years the scope of tasks has gone up from individual functions to entire modules.

In fact, I'm getting convinced vibe coding could work now, but it still requires a great deal of skill. You have to give it the right context and sophisticated validation mechanisms that help it self-correct as well as let you validate functionality very quickly with minimal looks at the code itself.

  • "Holding it right" has been one of my biggest problems. Many times I find the output affected by prompt poisoning, and I have to throw away the entire context.

This definitely is the case. I was talking to someone complaining about how llms don't work good.

They said it couldn't fix an issue it made.

I asked if they gave it any way to validate what it did.

They did not, some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"

Its shocking some people don't give it any real instruction or way to check itself.

In addition I get great results doing voice to text with very specific workflows. Asking it to add a new feature where I describe what functions I want changed then review as I go vs wait for the end.

  • > Its shocking some people don't give it any real instruction or way to check itself.

    It's not shocking. The tech world is telling them that "Claude will write all of their app easily" with zero instructions/guidelines so of course they're going to send prompts like that.

    • I think the implications of limited to no instructions are a little to way off depending on what you're doing... CRUD APIs, sure... especially if you have a well defined DB schema and API surface/approach. Anything that might get complex, less so.

      Two areas I've really appreciated LLMs so far... one is being able to make web components that do one thing well in encapsulation.. I can bring it into my project and just use it... AI can scaffold a test/demo app that exercises the component with ease and testing becomes pretty straight forward.

      The other for me has been in bridging rust to wasm and even FFI interfaces so I can use underlying systems from Deno/Bun/Node with relative ease... it's been pretty nice all around to say the least.

      That said, this all takes work... lots of design work up front for how things should function... weather it's a ui component or an API backend library. From there, you have to add in testing, and some iteration to discover and ensure there aren't behavioral bugs in place. Actually reviewing code and especially the written test logic. LLMs tend to over-test in ways that are excessive or redundant a lot of the time. Especially when a longer test function effectively also tests underlying functionalities that each had their own tests... cut them out.

      There's nothing "free" and it's not all that "easy" either, assuming you actually care about the final product. It's definitely work, but it's more about the outcome and creation than the grunt work. As a developer, you'll be expected to think a lot more, plan and oversee what's getting done as opposed to being able to just bang out your own simple boilerplate for weeks at a time.

    • It's surprising they don't learn better after their first hour or two of use. Or maybe they do know better but don't like the thing so they deliberately give it rope to hang itself with, then blame overzealous marketting.

  • There are subtler versions of this too. I've been working on a TUI app for a couple of weeks, and having great success getting it to interactively test by sending tmux commands, but every once in a while it would just deliver code that didn't work. I finally realized it was because the capture tools I gave it didn't capture the cursor location, so it would, understandably, get confused about where it was and what was selected.

    I promptly went and fixed this before doing any more work, because I know if I was put in that situation I would refuse to do any more work until I could actually use the app properly. In general, if you wouldn't be able to solve a problem with the tools you give an LLM, it will probably do a bad job too.

  • If you tell a human junior developer just "fix this" then they will spend a week on a wild-goose chase with nothing to show for it.

    At least the LLM will only take 5 minutes to tell you they don't know what to do.

    • Do they? I’ve never got a response that something was impossible, or stupid. LLMs are happy to verify that a noop does nothing, if they don’t know how to fix something. They rather make something useless than really tackle a problem, if they can make tests green that way, or they can claim that something “works”.

      And’ve I never asked Claude Code something which is really impossible, or even really difficult.

      2 replies →

    • An LLM might take 5 minutes, or 20 minutes, and still do the wrong thing. Rarely have I seen an LLM not "know what to do." A coworker told it to fix some unit tests, it churned away for a while, then changed a bunch of assert status == 200 to 500. Good news, tests pass now!

    • To be fair, that happening feels more like poor management and mentorship than "juniors are scatterbrained".

      Over time, you build up the right reflexes that avoid a one-week goose chase with them. Heck, since we're working with people, you don't just say " fix this", you earmark time to make sure everyone is aligned on what needs done and what the plan is.

    • > At least the LLM will only take 5 minutes to tell you they don't know what to do.

      In my experience, the LLM will happily try the wrong thing over and over for hours. It rarely will say it doesn’t know.

      1 reply →

  • Yeah, the more time I spend in planning and working through design/api documentation for how I want something to work, the better it does... Similar for testing against your specifications, not the code... once you have a defined API surface and functional/unit tests for what you're trying to do, it's all the harder for AI to actually mess things up. Even more interesting is IMO how well the agents work with Rust vs other languages the more well defined your specifications are.

  • > some people really are saying "fix this" instead of saying "x fn is doing y when someone makes a request to it. Please attempt to fix x and validate it by accessing the endpoint after and writing tests"

    This works about 85% of the time IME, in Claude Code. My normal workflow on most bugs is to just say “fix this” and paste the logs. The key is that I do it in plan mode, then thoroughly inspect and refine the plan before allowing it to proceed.

  • Untested Hypothesis: LLM instruction is usually an intelligence+communication-based skill. I find in my non-authoritative experience that users who give short form instructions are generally ill prepared for technical motivation (whether they're motivating LLMs or humans).

  • lol that is still “how you’re talking to them that affects the results” just more specific

  • Feeding the LLM a "copy as cURL" for its feedback loop instead of letting it manage the dev server was an unlock for me.

I have 30 years of experience delivering code and 10 years of leading architecture. My argument is the only thing that matters is does the entire implementation - code + architecture (your database, networking, your runtime that determines scaling, etc) meet the functional and none functional requirements. Functional = does it meet the business requirements and UX and non functional = scalability, security, performance, concurrency, etc.

I only carefully review the parts of the implementation that I know “work on my machine but will break once I put in a real world scenario”. Even before AI I wasn’t one of the people who got into geek wars worrying about which GOF pattern you should have used.

All except for concurrency where it’s hard to have automated tests, I care more about the unit or honestly integration tests and testing for scalability than the code. Your login isn’t slow because you chose to use a for loop instead of a while loop. I will have my agents run the appropriate tests after code changes

I didn’t look at a line of code for my vibe coded admin UI authenticated with AWS cognito that at most will be used by less than a dozen people and whoever maintains it will probably also use a coding agent. I did review the functionality and UX.

Code before AI was always the grind between my architectural vision and implementation

  • Explain how fragility of implementation, like spaghetti code, high coupling low cohesion fit into your world view?

    • As human developers, I think we're struggling with "letting go" of the code. The code we write (or agents write) is really just an intermediate representation (IR) of the solution.

      For instance, GCC will inline functions, unroll loops, and myriad other optimizations that we don't care about (and actually want!). But when we review the ASM that GCC generates we are not concerned with the "spaghetti" and the "high coupling" and "low cohesion". We care that it works, and is correct for what it is supposed to do.

      Source code in a higher-level language is not really different anymore. Agents write the code, maybe we guide them on patterns and correct them when they are obviously wrong, but the code is just the work-item artifact that comes out of extensive specification, discussion, proposal review, and more review of the reviews.

      A well-guided, iterative process and problem/solution description should be able to generate an equivalent implementation whether a human is writing the code or an agent.

      43 replies →

    • You did see the part about my unit, integration and scalability testing? The testing harness is what prevents the fragility.

      It doesn’t matter to AI whether the code is spaghetti code or not. What you said was only important when humans were maintaining the code.

      No human should ever be forced to look at the code behind my vibe coded internal admin portal that was created with straight Python, no frameworks, server side rendered and produced HTML and JS for the front end all hosted in a single Lambda including much of the backend API.

      I haven’t done web development since 2002 with Classic ASP besides some copy and paste feature work once in a blue moon.

      In my repos - post AI. My Claude/Agent files have summaries of the initial statement of work, the transcripts from the requirement sessions, my well labeled design diagrams , my design review sessions transcripts where I explained it to client and answered questions and a link to the Google NotebookLM project with all of the artifacts. I have separate md files for different implemtation components.

      The NotebookLM project can be used for any future maintainers to ask questions about the project based on all of the artifacts.

      7 replies →

    • In my experience, consulting companies typically have a bunch of low-to-medium skilled developers producing crap, so the situation with AI isn't much different. Some are better than others, of course.

    • Also developer UX, common antipatterns, etc

      This “the only thing that matters about code is whether it meets requirements” is such a tired take and I can’t imagine anyone seriously spouting it has has had to maintain real software.

      6 replies →

It's not skill with talking to an LLM, it's the users skill and experience with the problem they're asking the LLM to solve. They work better for problems the prompter knows well and poorly for problems the prompter doesn't really understand.

Try it yourself. Ask claude for something you don't really understand. Then learn that thing, get a fresh instance of claude and try again, this time it will work much better because your knowledge and experience will be naturally embedded in the prompt you write up.

  • Not only you understanding the how, but you not understanding the goal.

    I often use AI successfully, but in a few cases I had, it was bad. That was when I didn't even know the end goal and regularly switched the fundamental assumptions that the LLM tried to build up.

    One case was a simulation where I wanted to see some specific property in the convergence behavior, but I had no idea how it would get there in the dynamics of the simulation or how it should behave when perturbed.

    So the LLM tried many fundamentally different approaches and when I had something that specifically did not work it immediately switched approaches.

    Next time I get to work on this (toy) problem I will let it implement some of them, fully parametrize them and let me have a go with it. There is a concrete goal and I can play around myself to see if my specific convergence criterium is even possible.

    • LLMs massively reduce the cost of "let's just try this". I think trying to migrate your entire repo is usually a fool's errand. Figure out a way to break the load-bearing part of the problem out into a sub-project, solve it there, iterate as much as you like. Claude can give you a test gui in one or two minutes, as often as you like. When you have it reliably working there, make Claude write up a detailed spec and bring that back to the main project.

    • Yup, same sort of experience. If I'm fishing for something based on vibes that I can't really visualize or explain, it's going to be a slog. That said, telling the LLM the nature of my dilemma up front, warning it that I'll be waffling, seems to help a little.

Also Claude (and possibly others) sometimes decide to build everything an obviously bad idea, shitty architecture then keeps doubling down into mess of a code. My realization is I need to be the manager architect, let it produce the plan then review and adjust the architecture. Once you get good control of architecture way may less bugs, and easier to fix. One final thing hook observability really early on and then force LLM to throw all exceptions instead of “safe fallbacks” which in practice means I will swallow everything a you. Will need look at all of the code every time there is bug.

I review most of the code I get LLMs to write and actually I think the main challenge is finding the right chunk size for each task you ask it to do.

As I use it more I gain more intuition about the kinds of problems it can handle on it's, vs those that I need to work on breaking down into smaller pieces before setting it loose.

Without research and planning agents are mostly very expensive and slow to get things done, if they even can. However with the right initial breakdown and specification of the work they are incredibly fast.

you are overestimating the skill of code review. Some people have very specific ways of writing code and solving problems which are not aligned what LLMs wrote, but doesn't mean it's wrong.

I know senior developers that are very radical on some nonsense patterns they think are much better than others. If they see code that don't follow them, they say it's trash.

Even so, you can guide the LLM to write the code as you like.

And you are wrong, it's a lot on how people write the prompt.

  • > you are overestimating the skill of code review.

    “You are overestimating the skill of [reading, comprehending, and critically assessing code of a non-guaranteed quality]” is an absurd statement if you properly expand out what “code review” means.

    I don’t care if you code review the CSS file for the Bojangles online menu web page, but you better be code reviewing the firmware for my dad’s pacemaker.

    This whole back and forth with LLM-generated code makes me think that the marginal utility of a lot of code the strong proponents write is <1¢. If I fuck up my code, it costs our partners $200/hr per false alert, which obliterates the profit margin of using our software in the first place.

    • By far most of the code LLMs write is for crappy crud apps and webapps not pacemakers and rockets

      We can capture enough reliability on what LLMs produce there by guided integration tests and UX tests along with code review and using other LLMs to review along with other strategies to prvent semantic and code drift

      Do you know how much crap wordpress ,drupal and Joomla sites I have seen?

      Just that work can be automated away

      But Ive also worked in high end and mission critical delivery and more formal verification etc - that’s just moving the goalposts on what AI can do- it will get there eventually

      Last year you all here were arguing AI Couldn’t code - now everyone has moved the goalposts to formal high end and mission critical ops- yes when money matters we humans are still needed of course - no one denying that- its the utility of the sole human developer against the onslaught of machine aided coding

      This profession is changing rapidly- people are stuck in denial

      1 reply →

I guess it's no coincidence that most of then people saying "LLMs are great for doing code" are non-developers...

I'm relatively forgiving on bugs that I kind of expect to have happen... just from experience working with developers... a lot of the bugs I catch in LLMs are exactly the same as those I have seen from real people. The real difference is the turn around time. I can stay relatively busy just watching what the LLM is doing, while it's working... taking a moment to review more solidly when it's done on the task I gave it.

Sometimes, I'll give it recursive instructions... such as "these tests are correct, please re-run the test and correct the behavior until the tests work as expected." Usually more specific on the bugs, nature and how I think they should be fixed.

I do find that sometimes when dealing with UI effects, the agent will go down a bit of a rabbit hole... I wanted an image zoom control, and the agent kept trying to do it all with css scaling and the positioning was just broken.. eventually telling it to just use nested div's and scale an img element itself, using CSS positioning on the virtual dom for the positioning/overflow would be simpler, it actually did it.

I've seen similar issues where the agent will start changing a broken test, instead of understanding that the test is correct and the feature is broken... or tell my to change my API/instructions, when I WANT it to function a certain way, and it's the implementation that is wrong. It's kind of weird, like reasoning with a toddler sometimes.

> Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding

this makes me feel better about the amount of disdain I've been feeling about the output from these llms. sometimes it popsout exactly what I need but I can never count on it to not go offrails and require a lot of manual editing.

I think that entirely disregarding the fundamental operation of LLMs with dismissiveness is ungrounded. You are literally saying it isn’t a skill issue while pointing out a different skill issue.

It is absolutely, unequivocally, patently false to say that the input doesn’t affect the output, and if the input has impact, then it IS a skill.

I think that code review experience is a big driver of success with the llms, but my take away is somewhat different. If you’ve spent a lot of time reviewing other people’s code you realize the failures you see with llms are common failures full stop. Humans make them too.

I also think reviewable code, that is code specifically delivered in a manner that makes code review more straightforward was always valuable but now that the generation costs have lowered its relative value is much higher. So structuring your approach (including plans and prompts) to drive to easily reviewed code is a more valuable skill than before.

I will still take a glance every once in a while to satisfy my curiosity, but I have moved past trying to review code. I was happy with the results frequently enough that I do not find it to be necessary anymore. In my experience, the best predictor is the target programming language. I fail to get much usable code in certain languages, but in certain others it is as if I wrote it myself every time. For those struggling to get good results, try a different programming language. You might be surprised.

I thought I try to debunk your argument with a food example. I am not sure I succeeded though. Judge for yourself:

It's always easier to blame the ingredients and convince yourself that you have some sort of talent in how you cook that others don't.

In my experience the differences are mostly in how the dishes produced in the kitchen are tasted. Chefs who have experience tasting dishes critically are more likely to find problems immediately and complain they aren't getting great results without a lot of careful adjustments. And those who rarely or never tasted food from other cooks are invariably going to miss stuff and rate the dishes they get higher.

> complain they aren't getting great results without a lot of hand holding

This is what I don’t understand - why would I “complain” about “hand holding”? Why would I just create a Claude skill or analogue that tells the agent to conform to my preferences?

I’ve done this many times, and haven’t run into any major issues.

> It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.

Well, it's easily the simplest explanation, right?

Unfortunately it is impossible to ascertain what is what from what we read online. Everyone is different and use the tools in a different way. People also use different tools and do different things with them. Also each persons judgement can be wildly different like you are saying here.

We can't trust the measurements that companies post either because truth isn't their first goal.

Just use it or don't use it depending on how it works out imo. I personally find it marginally on the positive side for coding

It's also always easier to blame the LLM when the developer doesn't work with it right.

That seems to make sense. Any suggestions to improve this skill of reviewing code?

I think especially a number of us more junior programmers lack in this regard, and don't see a clear way of improving this skill beyond just using LLMs more and learning with time?

  • It's "easy". You just spend a couple of years reviewing PRs and working in a professional environment getting feedback from your peers and experience the consequences of code.

    There is no shortcut unfortunately.

  • You improve this skill by not using LLMs more and getting more experienced as a programmer yourself. Spotting problems during review comes from experience, from having learned the lessons, knowing the codebase and libraries used etc.

  • Find another developer and pair/work together on a project. It doesn't need to be serious, but you should organize it like it is. So, a breakdown of tasks needed to accomplish the goal first. And then many pull requests into the source that can be peer reviewed.

It's always easier to blame the model and convince yourself that you have some sort of talent in reviewing LLM's work that others don't.

In my experience the differences are mostly in how the code produced by LLM is prompted and what context is given to the agent. Developers who have experience delegating their work are more likely to prevent downstream problems from happening immediately and complain their colleagues cannot prompt as efficiently without a lot of hand holding. And those who rarely or never delegated their work are invariably going to miss crucial context details and rate the output they get lower.

  • Partly true, but I think there's a real skill in catching subtle logic errors in generated code too not just prompting well. Both matter.

That's what I meant, though. I didn't mean "I say the right words", I meant "I don't give them a sentence and walk away".

In my experience the differences are mostly between the chair and the keyboard.

I asked Codex to scrape a bunch of restaurant guides I like, and make me an iPhone app which shows those restaurants on a map color coded based on if they're open, closed or closing/opening soon.

I'd never built an iOS app before, but it took me less than 10 minutes of screen time to get this pushed onto my phone.

The app works, does exactly what I want it to do and meaningfully improves my life on a daily basis.

The "AI can't build anything useful" crowd consists entirely of fools and liars.