Comment by daitangio

1 year ago

Software metric are hard, indeed :) Be prepared in a ai-code world when more code does not mean better code.

I've been watching my colleagues' adoption of Copilot with interest. From what I can tell, the people who are the most convinced that it improves their productivity have an understanding of developer productivity that is very much in line with that of the managers in this story.

Recently I refactored about 8,000 lines of vibe-coded bloat down into about 40 lines that ran ten times as fast, required 1/20 as much memory, and eliminated both the defect I was tasked with resolving and several others that I found along the way. (Tangentially, LLM-generated unit tests never cease to amaze me.) The PHBs didn't particularly appreciate my efforts, either. We've got a very expensive Copilot Enterprise license to continue justifying.

  • I see a stratified software market in the future.

    There will be vibe and amateur banged out hustle trash, which will be the cheap plastic cutlery of the software world.

    There will be lovingly hand crafted by experts code (possibly using some AI but in the hands of someone who knows their shit) that will be like the fine stuff and will cost many times more.

    A lot of stuff will get prototyped as crap and then if it gets traction reimplemented with quality.

    • Back in the day, if you went to a website you could always tell who wrote their HTML by hand and who used a tool like GruntPage, Dreamweaver, etc. even without looking at the META tags. The by-hand stuff was like a polished jewel that had only as much layout, styling, and markup as needed to get the desired effect. The proprietary web editor stuff was encrusted with extraneous tags and vendor-specific extensions (like mso: attributes and styles in GruntPage).

      Then as now, if you let the machine do the thinking for you, the result was a steaming mess. Up to you if that was accessible (and for many, it was).

      4 replies →

    • This was said about large frameworks like electron on the desktop, but outside of some specific technical niches it literally doesn’t matter to end users.

      2 replies →

    • A beautiful vision.

      If the vision were true, we should see it happen with normal goods too. Quality physical goods do not beat the shit goods in the market : crap furniture is the canonical example (with blog articles discussing the issue).

      Software (and movies) is free for subsequent copies, so at first sight you might think software is completely different from physical goods.

      However for most factory produced goods, designing and building the factory is the major cost. The marginal cost of producing each copy of an item might be reasonably low (highly dependent on raw materials and labor costs?).

      Many expensive physical goods are dominated by the initial design costs, so an expensive Maserati might be complete shit (bought for image status or Veblen reasons, not because it is high quality). There's a reason why the best products are often midrange. The per unit 2..n reproduction cost of cheap physical goods is always low almost by definition.

      Some parts of iPhone software are high quality (e.g. the security is astounding). Some parts are bad. Apple monetisation adds non-optional features that have negative value to me: however those features have positive value to Apple.

      1 reply →

  • I don’t believe your numbers unless your colleagues are exceptionally bad programmers.

    I’m using AI a lot too. I don’t accept all the changes if they look bad. I also keep things concise. I’ve never seen it generate something so bad I could delete 99 percent of it.

    • 90%+ is a stretch. Anecdotally I have cleaned up a vibe coded PR and removed at least half of the code. The thing with the LLM is that often they will make some decision up front that has downstream ramifications for how the entire project's code is structured. I don't think I've seen an LLM re-visit it's assumptions, instead they code around them.

      In the case I saw, it was rust code and the LLM typed some argument as a Arc<Mutex<_>> when it absolutely did not need to, which caused the entire PR to inflate. The vibe coder apparently didn't catch this and just kept it vibing... Technically the code did what it needed to do but was super inefficient.

      It would have been easy for me to just accept the PR. It technically worked. But it was garbage.

      2 replies →

    • I've never seen 8000 -> 40, but I have done ~10 kLoC -> ~600.

      Aggggressively "You can write Java in any language" style JavaScript (`Factory`, `Strategy`, etc) plus a whole mini state machine framework that was replaceable with judicious use of iterators.

      (This was at Google, and I suspected it was a promo project gone metastatic.)

    • The original used a Shlemiel the painter algorithm, a whole bunch of "enterprise" coding patterns, and its own implementations of a bunch of things we already had. Including domain objects, which meant that a whole bunch of excess glue code was needed to interface with the rest of the system.

  • Every now and then, in between reasonable and almost-reasonable suggestions, Copilot will suggest a pile of code, stylistically consistent with the function I’m editing, the extends clear off the bottom of the page. I haven’t been inspired to hit tab a couple times and try to reverse engineer the resulting vomit of code, but I can easily imagine a new programmer accepting the code because AI! or, perhaps worse, hitting tab without even noticing.

  • "8,000 lines of vibe-coded bloat down into about 40 lines" ... I just saw a vision of my future and shuddered.

    I mean, I like killing crappy code as much as the next guy, but I don't want that to be my daily existence. Ugggh.

  • > Tangentially, LLM-generated unit tests never cease to amaze me.

    In a good or bad way?

    I've found AI pretty helpful to write tests, specially if you already have an existing one as a template.

    • I guess it depends on how much you like things like well-obfuscated smoke tests and mocks that don't accurately simulate relevant parts of the behavior of the module they're mocking.

  • I would love to know the time balance between the two activities. It takes nothing to generate slop, but could be weeks to extricate it.