Comment by akersten

6 hours ago

> The code it generated was awful. The kind of garbage that people who don’t know any better would ship: it looked right and it worked. But it was instantly a maintenance dead end.

In the Tailwind thread the other day I was explicitly told that the intended experience of many frameworks is "write-only code" so maybe this is just the way of the future that we have to learn to embrace. Don't worry how it's all hooked up, if it works it works and if it stops working tell the AI to fix it.

It's kind of liberating I guess. I'm not sure if I've reached AI nirvana on accepting this yet, but I do think that moment is close.

I'm pretty you wouldn't want the same for code that runs healthcare, banks or transport. Only useless shitty web projects could embrace what's you're saying. And no there's no "Claude review the code and improve it" magical formula

  • I work in the health software space and there are tons of internal tools which aren't production code that can benefit massively from throwaway "write-only code". Putting a web UI on top of a management CLI tool so support ops can run things without needing an oncall engineer can be a huge win. I recently built a testing UI that doubles a demo-scenario-setup tool. Is it well-engineered? Who cares - it pokes the right things into the database and runs the right backend tasks, and has helped me catch and fix dozens of real bugs in the UIs that customers see.

    There is an enormous untapped market for crappy low-effort apps which previously weren't worth the time - but with the effort so low put together a simple dashboard or one-off tool it becomes much more attractive.

The problem is it’s impossibly hard to test all the edge cases

Which is probably why so many random buttons in microsoft/apple/spotify just stop working once you get off the beaten path or load the app in some state which is slightly off base

  • The problem is worse than that.

    The number of edge cases in a software is not fixed at all. One of the largest markers of competence in software development is being able to keep them at minimum, and LLMs tend to make that number higher than humanely possible.

  • Yeah, the biggest thing I've noticed from LLMs is that large tech products now have even more bugs. Turns out the humans weren't so bad after all...

    • > Turns out the humans weren't so bad after all...

      The people pushing AI _over_ humans never thought they were. They just don't care about 'good' or 'bad', only 'time-to-market'. A bad app making money is better than a good one that isn't deployed yet. And who cares about anything past the end of the quarter? That's the next guy's problem.

      1 reply →

    • I'm wondering if companies are 'diverting' engineering resources from core products to AI products with the view that the former are legacy. Kind of two sides of the same coin though.

      1 reply →

Easy, have Claude review the code, tell it to be critical and that it needs to be easier to understand, follow Clean Code, SOLID principles and best practices. Lie to it, say you got this from a Junior developer, or "review it as if you were a Staff Level Engineer reviewing Junior code" the models can write better code, just nobody tells them to.

  • Lol, the only thing worse than a junior developer following Clean Code and SOLID has to be an LLM messing with code so it looks like it follows.

    • Clean Code has its really "meh" areas, but the core idea and spirit of it is sound, heck Python's best guide is PEP-8 if you follow that, it forces you to write much better Python code.

      In terms of "junior dev following" it would be the model trying to think and write it as a Senior or Staff Level engineer would.

  • Code review is the main thing I use LLMs for. I have found it to be remarkably candid when you tell it the code came from another LLM (even name it). I was running Kimi K2.6 Q4 locally, seeing if it could SIMD a bit-matrix transpose function, and it was slow enough that I would paste its thinking into Gemini every few minutes. Gemini was savage.

    • > Gemini was savage.

      Humorously, this could be the result of LLMs vacuuming up all the sentiment on the web that the code that LLMs produce is trash-tier.

  • This is it. I've had a similar experience in just playing around I asked it to clean up some code it wrote to increase maintainability and readability by humans. After a few iterations it had generated quite solid code. It also broke the code a couple of times along the way. But it does get me thinking that these pipelines with agents doing specific tasks makes a lot of sense. One to design and architect, one to implement, one to clean, one to review, one to test (actually there's probably a bunch of different agents for testing -- testing perf/power, that it matches the requirements/spec, matches the design, is readable/maintainable, etc...).

    • I built GuardRails after some frustrations with Beads which I love, and this whole exchange made me realize, because I have "gates" after tasks, I could add a "Review the code" type of gate, and probably get insanely better output, I already get reasonably good output because I spec out the requirements beforehand, that's the other thing, if you can tell the LLM HOW to build before it does, you will have better output.

  • Why wouldn't Claude just impose this same loop in the code it writes - or better, write better code before it needs such review?

    • Because language models don’t think before doing, they think by doing.

      Maybe a more idealized training set could improve things, but at least for today’s SOTA, you have to get the shitty first draft out and then improve it.

      Harnessing makes a difference, but it’s only shuffling around when and where the tokens get generated. It can trade being slower by doing a hidden first draft and only showing the output after doing a self review. But the models still need to generate it all explicitly.

    • Why would it? It doesn't do anything with intention without being prompted. When you ask it to do something it's going to give you what seems like the most likely result, it isn't striving to give you the most correct result, those things just have some overlap.

    • I assume it would involve wasting a lot more tokens reasoning about this. It is known that GPT uses less tokens than Claude, but Claude uses them to reason about problems more, which is part of its "secret sauce" and why so many swear by Claude Code.

  • Even better, if you have access to multiple models, tell it you got the code from another AI agent.

    I did an experiment on this a few weekends ago and Codex for example was a lot more adversarial and thorough in its review when given Claude-authored code compared to when given the same code with "I wrote this, can you review it?"

    • If it's within its context window, it will know you're lying, so either compact or start a new chat (don't do this on Claude, it dings your usage, always has).

  • Is this a joke? Smartest people on the planet never thought about telling AI to just write better code?

    • Kind of wild that you have to tell an LLM things like "do it right" and "make the code maintainable" and "don't make mistakes". Shouldn't that be the default? I wouldn't accept a calculator application that got math wrong unless you pressed a button labeled "actually solve the problem."

      2 replies →

    • You forget that this all takes tokens from the model, so it has to be very stingy and whatever it comes up with "first" is what it goes with. I've seen people do the same as me, tell the model NOT TO GUESS but to do research first, which yields better output and saves time. Models today are better when they review the context directly, the focus shifted from it knowing everything in its training data to being able to dynamically learn new things and use that information in a meaningful way.

      For example, I built up a programming language from scratch with Claude, it knows nuances about my languages syntax, and can write code in my language effectively. I did it mostly as a test. It definitely helped that my language is heavily mostly Python based.

I have been wondering recently that if the cost of just throwing everything out and building it from scratch again gets low enough, maybe maintainability becomes less of a priority? Can we just embrace the thing like those Zen carpenters who build wooden fire shrines do where they just accept that the thing will keep burning down and they make a discipline around getting really good at rebuilding it?

Granted, the load bearing thing here is whether we’re actually getting good at rebuilding up to any sort of standard of quality. Or if the tooling is even structurally capable of doing that rather than just introducing new baskets of problems with each build.