Comment by ryandrake

2 days ago

I've been trying to open my mind and "give AI a chance" lately. I spent all day yesterday struggling with Claude Code's utter incompetence. It behaves worse than any junior engineer I've ever worked with:

- It says it's done when its code does not even work, sometimes when it does not even compile.

- When asked to fix a bug, it confidently declares victory without actually having fixed the bug.

- It gets into this mode where, when it doesn't know what to do, it just tries random things over and over, each time confidently telling me "Perfect! I found the error!" and then waiting for the inevitable response from me: "No, you didn't. Revert that change".

- Only when you give it explicit, detailed commands, "modify fade_output to be -90," will it actually produce decent results, but by the time I get to that level of detail, I might as well be writing the code myself.

To top it off, unlike the junior engineer, Claude never learns from its mistakes. It makes the same ones over and over and over, even if you include "don't make XYZ mistake" in the prompt. If I were an eng manager, Claude would be on a PIP.

Recently I've used Claude Code to build a couple TUIs that I've wanted for a long time but couldn't justify the time investment to write myself.

My experience is that I think of a new feature I want, I take a minute or so to explain it to Claude, press enter, and go off and do something else. When I come back in a few minutes, the desired feature has been implemented correctly with reasonable design choices. I'm not saying this happens most of the time, I'm saying it happens every time. Claude makes mistakes but corrects them before coming to rest. (Often my taste will differ from Claude's slightly, so I'll ask for some tweaks, but that's it.)

The takeaway I'm suggesting is that not everyone has the same experience when it comes to getting useful results from Claude. Presumably it depends on what you're asking for, how you ask, the size of the codebase, how the context is structured, etc.

  • Its great for demos, its lousy for production code. The different cost of errors in these two use cases explains (almost) everything about the suitability of AI for various coding tasks. If you are the only one who will ever run it, its a demo. If you expect others to use it, its not.

    • As the name indicates, a demo is used for demonstration purposes. A personal tool is not a demo. I've seen a handful of folks assert this definition, and it seems like a very strange idea to me. But whatever.

      Implicit in your claim about the cost of errors is the idea that LLMs introduce errors at a higher rate than human developers. This depends on how you're using the LLMs and on how good the developers are. But I would agree that in most cases, a human saying "this is done" carries a lot more weight than an LLM saying it.

      Regardless, it is not good analysis to try to do something with an LLM, fail, and conclude that LLMs are stupid. The reality is that LLMs can be impressively and usefully effective with certain tasks in certain contexts, and they can also be very ineffective in certain contexts and are especially not great about being sure whether they've done something correctly.

      1 reply →

Learning to use Claude Code (and similar coding agents) effectively takes quite a lot of work.

Did you have it creating and running automated tests as it worked?

  • > Learning to use Claude Code (and similar coding agents) effectively takes quite a lot of work.

    I've tried to put in the work. I can even get it working well for a while. But then all of a sudden it is like the model suffers a massive blow to the head and can't produce anything coherent anymore. Then it is back to the drawing board, trying all over again.

    It is exhausting. The promise of what it could be is really tempting fruit, but I am at the point that I can't find the value. The cost of my time to put in the work is not being multiplied in return.

    > Did you have it creating and running automated tests as it worked?

    Yes. I work in a professional capacity. This is a necessity regardless of who (or what) is producing the product.

> - It says it's done when its code does not even work, sometimes when it does not even compile.

> - When asked to fix a bug, it confidently declares victory without actually having fixed the bug.

You need to give it ways to validate its work. A junior dev will also give you code that doesn't compile or should have fixed a bug but doesn't if they don't actually compile the code and test that the bug is truly fixed.

  • Believe me, I've tried that, too. Even after giving detailed instructions on how to validate its work, it often fails to do it, or it follows those instructions and still gets it wrong.

    Don't get me wrong: Claude seems to be very useful if it's on a well-trodden train track and never has to go off the tracks. But it struggles when its output is incorrect.

    The worst behavior is this "try things over and over" behavior, which is also very common among junior developers and is one of the habits I try to break from real humans, too. I've gone so far as to put into the root CLAUDE.md system prompt:

    --NEVER-- try fixes that you are not sure will work.

    --ALWAYS-- prove that something is expected to work and is the correct fix, before implementing it, and then verify the expected output after applying the fix.

    ...which is a fundamental thing I'd ask of a real software engineer, too. Problem is, as an LLM, it's just spitting out probabilistic sentences: it is always 100% confident of its next few words. Which makes it a poor investigator.