Comment by snarf21

7 hours ago

I feel the same but the question I struggle most with is this: "Does it matter when the people who are going to come along and maintain this are just going to use AI to fix or adjust this maintenance nightmare?"

At that point the code becomes a compile target, and then you need a new source of truth.

Which I think is perfectly worthy of exploration. Some people want to check in the prompts. Or even better, check in a plan.md or evenest betterest: some set of very well-defined specifications.

I'm not sure what the answer will be. Probably some mix of things. But today it is absolutely imperative that the code I write for the case I wrote it in is good quality and can be maintained by more than just me.

  • When we want to maintain a reliable, stable "product" in traditional software development (a binary executable artifact that ships out to users, or the binary engine of some SaaS the company sells to users), we don't just check in (to the source of truth repo) the actual application-layer source code. We also check in build instructions (think autoconf/cmake/etc) and have some concept of compiler compatibilities / versions, build environments, and papering over their runtime differences. And then our official executable output is not just defined by "Tag v1.23.45 of the application source code repo" - it's additionally defined by the build environment (including, critically, the compiler version, among many others).

    It's tempting to move out a layer and try making prompts and plan.md the "source code", and then the generated actual-source-code becomes just another ephemeral form of "intermediate representation" in the toolchain while building the final executable product. But then how are you versioning the toolchain and maintaining any reasonable sense of "stability" (in terms of features/bugs/etc) in the final output?

    Example: last week, someone ran our "LLM inputs" source code through AgentCo SuperModel-7-39b, and produced a product output that users loved and it seemed to work well. Next week, management asks for a new feature. The "developer" adds the new feature to the prompting with a few trial iterations, but the resulting new product now has 339 new subtle bugs in areas that were working fine in last week's build owing the fact that, in the meantime, AgentCo has tweaked some weights in SuperModel-7-39b under the hood because of some concern about CSAM results or whatever and this had subtle unrelated effects. Or better yet: next month, management has learned that OtherCo MegaModel-42.7c seems to be the new hotness and tells everyone to switch models. Re-building from our "source" with the new model fixes 72 known bugs filed by users, fixes another 337 bugs nobody had even noticed yet, and causes 111 new bugs to be created that are yet-unknown.

    If you treat the output source code as a write-only messy artifact, and you don't have stable, repeatable models, and don't treat model updates/changes as carefully as switching compiler vendors and build environments, this kind of methodology can only lead to chaos.

    And don't even get me started on the parallel excuses of "Your specifications should be more-perfect" (perfection is impossible), or "An expansive testsuite should catch and correct all new bugs" (also impossible. testing is only as good as the imperfect specification, and then layers in its own finite capabilities to boot).

  • I don't see the benefit of checking in either prompts or specs.

    I never tried spec driven development for myself, but if I review other's MRs I am typically exhausted after the first 10 lines.

    And there are hundreds of lines, nearly always with major inaccuracies.

    For myself I always found the plan mode to work well. Once the implementation is done, the code is the source of truth. If it works, it works.

    When I want to add more functionality or change it, I just tell the agent what I want changed.

    I doubt walls of semi-accurate existing specs are going to be beneficial there, but maybe my work differs from yours.

    • Those checked-in specs become the requirements for the system. So the next time you ask the AI to make a fix, it can use those specs as part of the solution and not break another requirement. Basically the code underneath keeps getting rewritten over and over, but that doesn't matter as long as it hits the required specs.

      1 reply →

    • I value traceability, and I value understanding the "why" of the code. For me, the prompts are useful for both.

Same. Messy code makes it harder for us to understand and thus maintain the code (which is why people often refer to code as a liability), but is that the case for AI tools as well? If not, it seems like clean code may not matter as much anymore.

The problem with crappy frontend code is not only the maintenance. It's that stuff such as responsive design, accessibility or cross-browser compatibility that work nearly for free with elegant code won't work at all.

The problem is that technical debt is compounding. Bad LLM architectural and implementation decisions just blend in to the background and you build layer upon layer of a mess. At some point it becomes difficult and expensive (token wise) to maintain this code, even for an agent.

I mitigate this by few things: 1. Checkpoints every few days to thoroughly review and flag issues. Asking the LLM to impersonate (Linus Torvalds is my favorite) yields different results. 2. Frequent refactors. LLMs don't get discouraged from throwing things out like humans do. So I ask for a refactor when enough stuff accumulates. 3. Use verbose, typed languages. C# on the backend, TypeScript on the frontend.

Does it produce quality code? Locally yes, architecturally I don't know - it works so far, I guess. Anyway, my alternative is not to make this software I'm writing better but not making it at all for the lack of time, so even if it's subpar it still brings business value.