Comment by _0ffh

2 days ago

Thoughtful comment, and I'd like to add another angle: However meaningful it is to say the community is divided, I also think that individuals are "divided" on the question as well.

I can speak from myself as an example (although n=1): I am incredibly open to machine learning and the advances it brings. On the other hand I am extremely conscious of the fact that the current LLMs do often write bad code, which becomes especially obvious once projects go much beyond "private toy" size. For me personally the consequence is that I try to make my projects even more modular, and the modules even more clearly delineated. With proper guidance, current LLMs work mostly fine when working on isolated modules. That doesn't mean they don't sometimes fail even there, but they also on rare occasions come up with surprisingly clever solutions when you let them loose on a code base, as long as the problem you want them to fix is mostly isolated in a couple of modules.

So long story short, you can be all for LLMs and still be conscious of their shortcomings, and that just vibe coding applications you intend to let loose on unsuspecting customers is probably a bad idea, and possibly outright immoral. We have already seen more than enough examples of vibe coded applications dumping sensitive user data into the lap of anyone who is inclined to pick up a stick and prod them a little bit.

There’s also the uncomfortable reality that a lot of people spent a lot of time learning how to do things that anyone can prompt now…

We tend to ignore that reality but it hangs above all these discussions like a dirty cloud

  • That does not seem like it should be controversial. I have known for quite a long time that a motivated high school kid could learn to code well enough to be perfectly useful. Especially with popular languages like Python.

    This is just an object lesson for everyone who thought all that software guys did was write code that there is just a little bit more to the job than that. Actual developers should already have known.

Yep, I use it every day. IMO it's a bigger long-term productivity lever for someone who knows good code than it is for someone who doesn't look at the code (at least for systems that are expected to have lots of users).

For those sorts of systems, comments like this just remind me of people pitching UML + codegen-or-outsourcing 20 years ago.

> Users don’t care whether the code was written by AI or by hand, or which framework you used. They care that the product works.

> I say this as someone who has spent more than 20 years honing their craft as a software engineer.

I've worked with outsourced code, and I've worked with similarly-messy/not-forward-thinking first-iteration startup-MVP code that found success and now needs to do more and be more reliable while being cheaper to operate. It ain't pretty.

Once you start modifying working code that people rely on you quickly start to see that the code itself matters. There aren't enough tests in the world to get around this from a quality POV. And piles of copypasta turn into either inconsistent behaviors (bad from a user satisfaction POV) or a continual drag on velocity (bad from a product goals/competitivness POV). And I've run the tools on those sorts of "found MVP, now need to iterate fast" codebases, both on the messy parts and the better parts.

Sure, Claude can chew through the pain points of "let me find the ten copypasta versions of this and try to update them all" and chase down all the weird stupid bugs/janky things that that requires because each of them evolved separately much faster than a human can. But it's not gonna be as fast as it was at writing the first version, and it's gonna get more and more annoying. You hit the point where your agent is churning for hours and running more and more tests and inadvertently breaking more things trying to make the change in all those places, and making MORE code changes to fix those (and those are just the caught broken things the tests cover!).

It's wayyyyy faster and less potentially painful to have it make updates to a well-factored module. "You don't have to read the code" is all well and good until you have a 10k+ line PR that's 10x larger than it needs to be because nobody read the code in the past, and you realize that most of the relevant test files also changed substantially and you don't have that much known-unchanged permutation coverage of the actual things users do... how comfortable are you pushing the "ship it" button then?

  • > You hit the point where your agent is churning for hours and running more and more tests and inadvertently breaking more things trying to make the change in all those places

    I've had this experience with vibe-coded applications written by Claude itself. For some reason, Claude doesn't seem to use any good practices unless I tell it to, and seems to require me to walk it through a decent design up front.

I read an article not long ago which I cannot find for the life of me now, but the thing in it that stuck with me was the idea that people (in the context, career software developers) were not either AI-pilled or anti-AI, but rather that people are both excited and afraid at the same time.

Yes, I’m in this camp. I’ve been pushing forward some personal projects lately using LLMs. At first, I was delighted at how productive I was, prompting. But over time a lot of cracks have started to show. Claude is good at programming “in the small”. It’s good at getting self contained, well scoped tasks done. But it’s bad at large scale system thinking. Over time, every project I’ve gotten Claude to write has become riddled with poor design choices layered on top of one another until even Claude struggles to make forward progress. And at that point, what do you do? You’ve gotta read all its code. Something I’ve learned I should just do from the start to save myself a lot of time later on.

It’s also strangely bad at correctness. You can ask it to write unit tests for a project. Unless you’re careful with your prompting, it will only write the unit tests that it knows will pass.

Generally, these models are amazing tools. But they cannot be trusted to make correct, maintainable software. At least not yet. Maybe in another year or two.

  • > Claude is good at programming “in the small”. It’s good at getting self contained, well scoped tasks done. But it’s bad at large scale system thinking. Over time, every project I’ve gotten Claude to write has become riddled with poor design choices layered on top of one another until even Claude struggles to make forward progress.

    What's really fun about these tools is that this is both true and false!

    If you ask these tools to reason about things in-the-large you often get very useful information back out of them.

    I've asked about refactors I was thinking about doing, and gotten accurate and useful information back, which as been AMAZINGLY helpful for avoiding "let me start doing this, and then realize 4 hours in that it's not gonna work as well as I hoped" traps.

    But it's a very attention-on-one-thing-at-a-time thing. IMO this is fairly inherent to the models, but people have been doing great work making the tooling around them smarter in terms of how to break up tasks ahead of time to compensate, so I'm not gonna say it won't get materially better.

    So if you prompt it to do a task in a certain way, especially in a "plan mode" type of usage, you can get a pretty solid recipe + execution of a properly-designed implementation of that task.

    But if you're not opinionated and checking in frequently, you're gonna get the sorta median-approach or random-luck-output-of-the-day decision. And so the human-in-the-loop point is unlikely to go away as long as the human has more context. Even if it's half-baked or not-fully-realized intuition about how the code is likely to evolve in the future that you don't put into every prompt.

    > It’s also strangely bad at correctness. You can ask it to write unit tests for a project. Unless you’re careful with your prompting, it will only write the unit tests that it knows will pass.

    My hunch is that this is the same fundamental problem. When it's attention is fully on "produce the next string of code" the parts of the context that relate to the broader system goals are NOT being considered as much for the output. So you get things like this, even with latest Opus still, when dealing with hard-to-isolate-in-a-single-test bugs (esp when it comes to multi-service call sequences or concurrent code):

    - "we need to fix this bug across eight methods in three files"

    - "I found the spot! we need to do [blah blah blah]"

    - "great, implement that plan"

    - "I've done it!"

    - "wait a sec... you moved some of the sequencing around, but didn't actually fix the fundamental issue"

    - "you're right! i moved [xyz] into [func b] instead of [func a] since it needed to be called later, but actually it needs to be after [func c] since it depends on the output of func b!"

    When asked about correctness it's good enough at "reasoning"-style output to spot these issues, but when generating code it's in such a pure "predict plausible code sequences" mode that this can get lost.

    • Yeah I hear you with all of this.

      Its weird, working with LLMs. There are some things the LLMs are extremely good at doing autonomously. Like, reverse engineering, or reading documentation (and using that knowledge in other areas). There are things that they can do - but you need to explicitly prompt. Like, I've found Opus is quite good at optimising code. I've had a lot of success by asking it to write benchmarks (and do profiling), and use that data to improve the performance of some piece of code. Thats often enough to get quite large performance improvements. You can get even further by showing it similar code others have written which is well optimised. It's very good at copying optimisation ideas from one project to another.

      But then there are very simple things it really struggles to do. Some kinds of correctness testing. Invariants. System design.

      Is it bad at that stuff, or do I just need to figure out how to prompt the LLM? To return to the topic of this thread, I think we're seeing a lot of different opinions on LLM generated code for 3 reasons:

      1. Some people aren't looking at claude's output at all. Some people are looking at the code and it looks fine to them. And some people (with more experience writing software) are looking at the code and judging it poorly.

      2. We all prompt our LLMs very differently! It turns out that you get really different results based on how you prompt the machine. We're all figuring this out together. Some people have better instincts than others.

      3. We're working on different projects. Claude is comparatively much better at end-user facing software. Its great at making a standalone website. Its much less good at finding and fixing obscure bugs in large, established pieces of software. If you work in consulting, LLMs can already do a lot of your job. If you work on Chrome or Unreal or the windows kernel, maybe not so much.

      1 reply →

I'm a divided individual.

I've spent an obscene number of hours learning how to get reliably good quality code out of these things. I'm actually very happy with where the tech is right now and can't imagine ever going back to typing code by hand.

But I absolutely hate how companies and society at large are acting because of this stuff. It feels like all rationality has flown out the window. So I'm just staying in my sandbox with my little toys and hoping the mass psychosis blows over at some point.