Comment by noelwelsh

2 days ago

I wish people would describe in more detail the tasks they use LLMs to code. My experience is that simple components in an existing architecture are fine, but anything requiring architectural considerations quickly becomes a mess. On my projects (e.g. a ui framework), running multiple agents in parallel would just increase the speed at which it can stuff up the project.

I get this question a lot, and I found it hard to answer briefly, so I ended up writing a longer post about how I work:

https://www.trigosec.com/insights/mob-programming-for-one/

The short version is that I don’t let AI agents work unsupervised on my code. I treat them like participants in a mob programming session instead of autonomous developers. Different agents get different roles (implementer, reviewer, architect, security reviewer, etc.), and I stay involved throughout the process.

I also agree with your point about architecture. Generating isolated components is relatively easy; preserving and evolving the architectural boundaries across a larger codebase is much harder.

We’re still missing a good way to express and measure architectural quality. Until then, architecture heavy work requires much closer supervision than implementation heavy work

  • > We’re still missing a good way to express and measure architectural quality

    Architectural complexity[1]! There’s several really good papers on this.

    Unfortunately it never caught on and we don’t have great automated tools to spit out a number. Also the majority of people just don’t care enough. Research in this field kinda died out when we invented microservices and started treating those as a silver bullet to The Architecture Problem (it’s not [2])

    [1] https://swizec.com/blog/why-taming-architectural-complexity-...

    [2] https://youtu.be/y8OnoxKotPQ

    • > Also the majority of people just don’t care enough.

      Yet! It is the next frontier and we will need it for having agent as described in the post to really work

      2 replies →

  • > The short version is that I don’t let AI agents work unsupervised on my code. I treat them like participants in a mob programming session instead of autonomous developers.

    I wonder if OS maintainers would have a leg up in defining workflows to better leverage this. Of course, OS contributors are autonomous developers, but maybe a trick or two might transfer across

i've been running claude in what the blog calls phase 0 for the last 6-7 months. i'm perfectly happy with it, my development velocity has increased while i still have a good grasp of the entire app, and i've actually been making decent progress with web development for a personal project, which is something i've bounced off several times in the past. also i do not get stuck as often on stuff like "how do i get django to statically serve up a js bundle with relative imports" which is more about knowing specific APIs of specific frameworks than any feature of my code or architecture.

i would not want to go down the "take myself out of the loop" path because yes, i do have to micromanage the claude session, often course-correcting every commit and then doing large scale refactoring every so often. but i'm perfectly happy doing that - i see claude as more of a tool than a coder i can hand work off to.

  • i just ran into a concrete example of why i would not want to run a tree of unsupervised agents churning out code. i have a project that generates large but repetitive .docx documents. i asked claude to add some graphics to it, it did a very good job of figuring out the xml graphics elements, locating where in the document structure it could insert them, and even printing to pdf and checking visually to get them perfectly lined up with the text. it took some 5 minutes, i would likely have spent an hour doing all that from scratch including several trips to google.

    then i looked at the code and asked it to benchmark, hinting that it looked like it was doing a lot in the inner loop. and sure enough, adding a few simple graphics to every page more doubled the time it took to generate the largest size of document (~1s -> ~2.2s for ~400 pages). without any more prompting claude figured out that it had an accidentally-quadratic loop, and fixed that.

    i then had to tell it "look, we are using a template to avoid regenerating boilerplate with every page. you can add a placeholder to the template and replace it with graphics using xml patching code you already wrote for another part of the doc generation". the final code was a lot cleaner and ran in ~1.2s, which claude (again unprompted, to its credit) did fine-grained benchmarking to prove was the unavoidable overhead of simply inserting all those large chunks of xml into the document.

    i wouldn't even say it was a coincidence that i ran into this right after writing my comment about having to micromanage the LLM, because this sort of thing happens all the time. i can say that i had a much easier time doing this because i looked at the code generated in a single commit and could easily see that it smelt off. i would have not have wanted to do this at the end of 20 commits all building on each other.

I built this with 94% written by coding agents: https://buildermark.dev/

The complete log of all prompts and commits is here: https://demo.buildermark.dev/projects/u020uhEFtuWwPei6z6nbN

  • This demo tool is really cool, kudos on that!

    I clicked that link first even though it’s listed second bc I wanted to see the prompts. I didn’t expect the level of detail or mapping to each commit. It is rad!

    That being said the landing page is soooo obviously “vibe coded” (read: AI generated).

    It has that design style that Claude likes to ~ab~use. & if I’m being honest, had I clicked on the website link first, I would never have gotten to the demo bc I would’ve just dismissed it as AI slop.

I'm currently using it to do a large migration from one Relay environment to another, but this is possible because

1. We've done it by hand for another route already, which the LLM uses as reference

2. Theres a strong validation setup/harness I've setup for it with storybooks, and component tests

3. It's a _mostly_ mechanical transform. Not entirely, as the two environments/APIs are not 1:1, but it's close enough

But! I and my team are still reviewing everything shrug it is "faster" because I get to have this running while I'm in meetings planning other more interesting projects

And this isn't really that many agents in parallel. Yeah, plenty of fan-out subagents, but that IMO doesn't count/isn't really the same as what others are talking about

  • I think a problem here is you're overestimating how hard it is to rewrite something when you have one example of how to do it right. Even in the 2000s, I remember a junior essentially rewriting our entire codebase from old school asp vbscript to .Net in a few months. A 100 or so pages back then.

    Your team could have done it pre-AI, but you just thought it was hard so you didn't try.

    I remember migrating a code base from MySQL to SQL Server in the 2010s. I thought it would take me weeks, if not months. It took me a couple of days.

    Immediately made me sour on the "hot" idea in the 2010s that your data layer should be provider agnostic so you could switch if you needed to. That was never a real thing, it was a made up justification for unnecessary over-engineering, by people who had clearly never tried to port an app from one data source to another. There are other reasons for a clear separation, but switching a few hundred SQL statements is not it.

    In reality, mechanical ports are not that hard, you can sit down, put some music on and blitz it in a few days. Programmers just over-estimate how hard they will be.

    • No, I'm not. Because, as I said, we've literally done it first for the first half of our application, and it took us eight engineers for ~5 months.

      Its genuinely weird to have you say that so confidently lol

      2 replies →

    • Yes if you know exactly how something should work it is fairly quick to implement. The hard and slow parts are when you only have vague requirements or have to experiment and iterate.

I personally limit LLMs to single files only at the moment. Self-contained components.

Using LLMs in a larger scope can sometimes work, but it has the real risk of turning a project into a mess after which you will have to undo the work and lose a lot of time.

Also, using LLMs this way with less clear boundaries will make reading and maintaining the code more cumbersome.

  • I use this strategy, too. I liken it to limiting the blast radius. If the LLM truly fouls things up it’s easier to pick up the pieces if you keep the scope limited.

Me when not trying to meet management expectations, only as smarter code completion, formatting code, basic code analysis, and helping copy pasting code examples between languages.

Me when meeting management expectations, agent orchestration tools like Boomi and Workato calling into tools, doing with AI what a few years ago would be done with BPEL.

You have to make those architectural decisions and feed them to the agents. Be very specific. That's been my experience.

I used LLMs to develop Whistle Enterprise (https://whistle-enterprise.com) from the ground up, from scratch.

It's taken _a lot_ of time and effort, but this is an example of what can be developed using LLMs alone.

You have to have dedication and a goal to reach, but you can absolutely build anything if you're building with the right foundations in mind.

  • I think the relevant question isn’t what can be built but the amount of effort in comparison to doing this the old fashioned way.

    What do you think the productivity gain was from using an LLM? This question assumes you’re already an experienced developer.

    • n=1 but, a friend of mine spent the last few months working on an experimental music software with Claude. What he built is amazing and far beyond my abilities (I have been programming for 20 years). He doesn't know any programming.

      In fact, it's far beyond what I would even attempt, because I've just spent two decades building up a data bank of how hard things are supposed to be.

      He doesn't know it's supposed to be hard, so he just does it.

      3 replies →

    • There’s no free lunch, it takes time and effort still. And expertise if you need it to be robust.

      In terms of velocity, let me offer some numbers. In 6 months I generated >150k lines of code and merged 10k PRs to ship and iterate on https://plotalong.app

      I follow best practices and isolate agents to continuously deployed dev environments, semi-manually review PRs and gate the release process between multiple protected envs. The project is getting close to 500 end-to-end tests in Playwright.

      That’s just working nights and weekends. Before AI, it took my team at the office 4 years to produce this much work. There are some qualitative differences but the speed and results are real

    • Thank you for the assumption, I'm actually not a developer at all.

      I'm from a hardware / networking / infrastructure background. I've had extensive exposure to (web) application development as I'm working closely with development teams and I do have the bash/powershell scripting knowledge.

      But honestly, if I tried this "the old fashioned way" it probably would have taken me about 6 to 7 years to develop that application, that's an optimistic estimate. You really do have to have a passion for what you're building, I didn't know that voice transcription and local LLMs would be such a driving force for me, but it's all I think about, so much that I find it hard to go to sleep sometimes.

  • neat. I saw the "no bot joins the call". Is it obvious to others in the virtual meeting that you are using this tool?

    • Thank you! No they cannot tell. It is your requirement as per the laws of your country to notify the other party if you're going to use it.

The true test challenges should be how far an AI can minimize a given fucked up codebase and keep full functionality.

I also think that writting large codebases into a sort of functional transformer tree as information compression stage would allow them to easier reason about large code bases by having a large lossless overview with minimal token usage.

In the last week we have done a complete analytics dashboard overhaul with Fable/Opus. The baseline was really bad, for we have no front-end engineers, so we largely felt comfortable not reading anything but the auth code (where we did find one subtle edge case handled incorrectly).

The pipelines and data serving design was all human since it did have to deal with some data scale but the javascript/api layer was all slop, and it seems fine and good.

If you have a really high quality piece of code that needs to meet a high bar of quality/reliability, then I think the risk of letting the AI loose on it is very high and I wouldn't do it. If you have a pile of code you already know is a pile of garbage despite being human written, well, it can't get much worse :)

I also built an agent orchestration meta harness that runs on k8s and uses the k8s agents sandbox for running codex/claude code in the cloud. This was almost entirely just handed over to Fable and I have not asked a single architectural detail. The quality of this product is mediocre, but the fact that it largely works after I went through a few iterations of clicking around is impressive. I would have preferred to buy something off the shelf, but nothing even really came close (though maybe now I would have forked Omnigent)

It's great for people who are just maintaining something. Less so for someone building something from scratch, in the earlier phases.

There are hour long youtube videos where people explain the process by using a complex toy project. Search for them.

Architectural considerations are easy. Figuring out what to actually do from the super vague requirements is even worse I think.