Automating myself out of development

5 days ago (thoughtfultechnologist.com)

I wish people would describe in more detail the tasks they use LLMs to code. My experience is that simple components in an existing architecture are fine, but anything requiring architectural considerations quickly becomes a mess. On my projects (e.g. a ui framework), running multiple agents in parallel would just increase the speed at which it can stuff up the project.

  • I get this question a lot, and I found it hard to answer briefly, so I ended up writing a longer post about how I work:

    https://www.trigosec.com/insights/mob-programming-for-one/

    The short version is that I don’t let AI agents work unsupervised on my code. I treat them like participants in a mob programming session instead of autonomous developers. Different agents get different roles (implementer, reviewer, architect, security reviewer, etc.), and I stay involved throughout the process.

    I also agree with your point about architecture. Generating isolated components is relatively easy; preserving and evolving the architectural boundaries across a larger codebase is much harder.

    We’re still missing a good way to express and measure architectural quality. Until then, architecture heavy work requires much closer supervision than implementation heavy work

    • > We’re still missing a good way to express and measure architectural quality

      Architectural complexity[1]! There’s several really good papers on this.

      Unfortunately it never caught on and we don’t have great automated tools to spit out a number. Also the majority of people just don’t care enough. Research in this field kinda died out when we invented microservices and started treating those as a silver bullet to The Architecture Problem (it’s not [2])

      [1] https://swizec.com/blog/why-taming-architectural-complexity-...

      [2] https://youtu.be/y8OnoxKotPQ

      3 replies →

    • > The short version is that I don’t let AI agents work unsupervised on my code. I treat them like participants in a mob programming session instead of autonomous developers.

      I wonder if OS maintainers would have a leg up in defining workflows to better leverage this. Of course, OS contributors are autonomous developers, but maybe a trick or two might transfer across

  • i've been running claude in what the blog calls phase 0 for the last 6-7 months. i'm perfectly happy with it, my development velocity has increased while i still have a good grasp of the entire app, and i've actually been making decent progress with web development for a personal project, which is something i've bounced off several times in the past. also i do not get stuck as often on stuff like "how do i get django to statically serve up a js bundle with relative imports" which is more about knowing specific APIs of specific frameworks than any feature of my code or architecture.

    i would not want to go down the "take myself out of the loop" path because yes, i do have to micromanage the claude session, often course-correcting every commit and then doing large scale refactoring every so often. but i'm perfectly happy doing that - i see claude as more of a tool than a coder i can hand work off to.

    • i just ran into a concrete example of why i would not want to run a tree of unsupervised agents churning out code. i have a project that generates large but repetitive .docx documents. i asked claude to add some graphics to it, it did a very good job of figuring out the xml graphics elements, locating where in the document structure it could insert them, and even printing to pdf and checking visually to get them perfectly lined up with the text. it took some 5 minutes, i would likely have spent an hour doing all that from scratch including several trips to google.

      then i looked at the code and asked it to benchmark, hinting that it looked like it was doing a lot in the inner loop. and sure enough, adding a few simple graphics to every page more doubled the time it took to generate the largest size of document (~1s -> ~2.2s for ~400 pages). without any more prompting claude figured out that it had an accidentally-quadratic loop, and fixed that.

      i then had to tell it "look, we are using a template to avoid regenerating boilerplate with every page. you can add a placeholder to the template and replace it with graphics using xml patching code you already wrote for another part of the doc generation". the final code was a lot cleaner and ran in ~1.2s, which claude (again unprompted, to its credit) did fine-grained benchmarking to prove was the unavoidable overhead of simply inserting all those large chunks of xml into the document.

      i wouldn't even say it was a coincidence that i ran into this right after writing my comment about having to micromanage the LLM, because this sort of thing happens all the time. i can say that i had a much easier time doing this because i looked at the code generated in a single commit and could easily see that it smelt off. i would have not have wanted to do this at the end of 20 commits all building on each other.

  • I built this with 94% written by coding agents: https://buildermark.dev/

    The complete log of all prompts and commits is here: https://demo.buildermark.dev/projects/u020uhEFtuWwPei6z6nbN

    • This demo tool is really cool, kudos on that!

      I clicked that link first even though it’s listed second bc I wanted to see the prompts. I didn’t expect the level of detail or mapping to each commit. It is rad!

      That being said the landing page is soooo obviously “vibe coded” (read: AI generated).

      It has that design style that Claude likes to ~ab~use. & if I’m being honest, had I clicked on the website link first, I would never have gotten to the demo bc I would’ve just dismissed it as AI slop.

  • I'm currently using it to do a large migration from one Relay environment to another, but this is possible because

    1. We've done it by hand for another route already, which the LLM uses as reference

    2. Theres a strong validation setup/harness I've setup for it with storybooks, and component tests

    3. It's a _mostly_ mechanical transform. Not entirely, as the two environments/APIs are not 1:1, but it's close enough

    But! I and my team are still reviewing everything shrug it is "faster" because I get to have this running while I'm in meetings planning other more interesting projects

    And this isn't really that many agents in parallel. Yeah, plenty of fan-out subagents, but that IMO doesn't count/isn't really the same as what others are talking about

    • I think a problem here is you're overestimating how hard it is to rewrite something when you have one example of how to do it right. Even in the 2000s, I remember a junior essentially rewriting our entire codebase from old school asp vbscript to .Net in a few months. A 100 or so pages back then.

      Your team could have done it pre-AI, but you just thought it was hard so you didn't try.

      I remember migrating a code base from MySQL to SQL Server in the 2010s. I thought it would take me weeks, if not months. It took me a couple of days.

      Immediately made me sour on the "hot" idea in the 2010s that your data layer should be provider agnostic so you could switch if you needed to. That was never a real thing, it was a made up justification for unnecessary over-engineering, by people who had clearly never tried to port an app from one data source to another. There are other reasons for a clear separation, but switching a few hundred SQL statements is not it.

      In reality, mechanical ports are not that hard, you can sit down, put some music on and blitz it in a few days. Programmers just over-estimate how hard they will be.

      4 replies →

  • I personally limit LLMs to single files only at the moment. Self-contained components.

    Using LLMs in a larger scope can sometimes work, but it has the real risk of turning a project into a mess after which you will have to undo the work and lose a lot of time.

    Also, using LLMs this way with less clear boundaries will make reading and maintaining the code more cumbersome.

    • I use this strategy, too. I liken it to limiting the blast radius. If the LLM truly fouls things up it’s easier to pick up the pieces if you keep the scope limited.

  • Me when not trying to meet management expectations, only as smarter code completion, formatting code, basic code analysis, and helping copy pasting code examples between languages.

    Me when meeting management expectations, agent orchestration tools like Boomi and Workato calling into tools, doing with AI what a few years ago would be done with BPEL.

  • You have to make those architectural decisions and feed them to the agents. Be very specific. That's been my experience.

  • I used LLMs to develop Whistle Enterprise (https://whistle-enterprise.com) from the ground up, from scratch.

    It's taken _a lot_ of time and effort, but this is an example of what can be developed using LLMs alone.

    You have to have dedication and a goal to reach, but you can absolutely build anything if you're building with the right foundations in mind.

    • I think the relevant question isn’t what can be built but the amount of effort in comparison to doing this the old fashioned way.

      What do you think the productivity gain was from using an LLM? This question assumes you’re already an experienced developer.

      6 replies →

    • neat. I saw the "no bot joins the call". Is it obvious to others in the virtual meeting that you are using this tool?

      1 reply →

  • In the last week we have done a complete analytics dashboard overhaul with Fable/Opus. The baseline was really bad, for we have no front-end engineers, so we largely felt comfortable not reading anything but the auth code (where we did find one subtle edge case handled incorrectly).

    The pipelines and data serving design was all human since it did have to deal with some data scale but the javascript/api layer was all slop, and it seems fine and good.

    If you have a really high quality piece of code that needs to meet a high bar of quality/reliability, then I think the risk of letting the AI loose on it is very high and I wouldn't do it. If you have a pile of code you already know is a pile of garbage despite being human written, well, it can't get much worse :)

    I also built an agent orchestration meta harness that runs on k8s and uses the k8s agents sandbox for running codex/claude code in the cloud. This was almost entirely just handed over to Fable and I have not asked a single architectural detail. The quality of this product is mediocre, but the fact that it largely works after I went through a few iterations of clicking around is impressive. I would have preferred to buy something off the shelf, but nothing even really came close (though maybe now I would have forked Omnigent)

  • The true test challenges should be how far an AI can minimize a given fucked up codebase and keep full functionality.

    I also think that writting large codebases into a sort of functional transformer tree as information compression stage would allow them to easier reason about large code bases by having a large lossless overview with minimal token usage.

  • It's great for people who are just maintaining something. Less so for someone building something from scratch, in the earlier phases.

  • There are hour long youtube videos where people explain the process by using a complex toy project. Search for them.

  • Architectural considerations are easy. Figuring out what to actually do from the super vague requirements is even worse I think.

Interestingly, despite it being much more detailed and a lot more process and procedure than what I currently do - which is more akin to the version 0 described, but in parallel - we come up at the same final problem: reviews and quality assurance.

I sign off the code I merged, part of company policy but also just to be sure it is actually decent. But reviewing has become the real draining bottleneck: even stacked PRs, if that total 5-6k lines is not a 5min job. Even if I brainstormed and set the plan, that's really the part that doesn't scale right now for me in this. But the author is very shy about that: either the changes arent that big in the end or they trust the process enough to review in a more casual manner. Being equally untrusting I can't do that ...

  • For decades, engineers understood that large code reviews are harder than small ones. Out of both politeness and a desire to receive better code reviews, we learned to break our large changes into smaller chunks. Some engineers took things even further and replaced code reviews with pair programming. But then LLMs showed up and everyone seems to have forgotten those lessons.

    They can be still be applied now using coding agents, if you're willing to push back against the default setup and change your mode of thinking a little bit. Of course it doesn't help that an entire industry is dedicated to persuading us that maximizing token spend is the only way to get shit done.

    I appreciate this probably seems like an extremist take, but I wrote some more about it here in case there's anybody out there who identifies with it:

    https://philbooth.me/blog/agentic-coding-and-mental-models

    • > They can be still be applied now using coding agents, if you're willing to push back against the default setup and change your mode of thinking a little bit. Of course it doesn't help that an entire industry is dedicated to persuading us that maximizing token spend is the only way to get shit done.

      Yeah the problem is the executives and managers around us are demanding we ship massive features as quickly as possible, and I like having a job and dread having to find a new one in this market...

      1 reply →

    • Agree with this completely. This push for more autonomy I think is the complete wrong direction for how to use LLMs.

      I want less code to maintain not more that I don't even fully understand.

      I think research and very supervised coding with lots of guardrails is the way to actually gain productivity from these tools.

    • I think that's reasonable. My only gripe is that making small sets of changes is often faster to do by hand than waiting on llm reasoning, so I've found it amounts to very little speedup.

  • Proper review should take longer than writing it yourself, because you need to know the correct solution, understand the proposed solution, and evaluate the difference between the two. When designing it yourself, you just need to know the correct solution and write it, and with modern high-level languages and IDEs with autocomplete writing it is hardly a bottleneck.

  • If I'm attentive during spec/plan creation I sort of build this "expectation" of what the actual PR will look like, the mental model of it. Then it's somewhat easier to review. But the mental load is brutal tbh, and still not sure if it's "worth it"

Good writeup. I think the main difference in my workflow is that I skipped the sandboxing part and accepted the coding agent having access to the entire 24/7 dev machine, so I'm still running on worktrees. Also, the "idea enrich" steps in my workflow are less formal - I tend to write most details in a feature spec myself. I also do my workflow on my own self-hosted custom interface which comes with a kanban board for project tracking, so I don't need Github. The rest of the workflow looks pretty similar.

>Automating myself out of development

>I want to start by saying that I’m neither an AI-fanatic

Kind of like saying you are a fanatic before saying you aren't.

I don't think theres too much here (e.g. "spec driven development") I haven't seen elsewhere.

  • > I don't think theres too much here I haven't seen elsewhere.

    Isn't that the rhyme here. I can't think of any article or discussion on AI here that contains anything new or noteworthy. And yet all those articles we've read before and all those "discussions" we've had before keep coming and coming. I have gotten bored and I'm just waiting for anything decisive to happen.

*siiiighhh... Slop automation. Removing self from loop, automating brainstorming. It's madness. No way that code is any good, shippable beyond 2 users or even maintainable beyond auto-slapping on more slop. Sad.

I don't know if I’m overly critical but there’s gotta be a middle ground between totally AI pilled people that otherwise have no talents, and control freak veteran developers who cant let go

My current process is also using Github projects in a normal scrum style way, with many tickets written or fleshed out and state managed by the LLM, and it doubling as the memory system

Completely leapfrogging all these other open and closed source concoctions and being more effective

But its effective enough that I don’t need OP’s final form state of still approving everything

Auto-mode is fine. Worktrees are built into Claude Code now. I just tell it to classify tickets as sequential or parallel possible and spawn subagents to tackle all of the tickets in the todo list

They all get their own context window its pretty perfect now

in the meantime I work in a couple tabs of Claude Design for different flows of any client side app. My philosophy has been that devs could pick up graphic and UI/UX design easily, its just still a full time job to make variations of layouts and portray their states.

UI/UX is not a full time job anymore.

And I use Claude chat to flesh out aspects of the overall idea

I think you may be overcomplicating your workflow in the concluding state.

Overall I agree that planning and intention is now most of the time, before a 10 subagent precision strike is initiated

  • I just do turn based development with Cline. I design my UIs in the browser first then let something like Claude wire it up, correcting it as I go. Way faster than before, easy to correct mistakes, doesn't require self-sacrifice or submission.

    I shudder when I hear about some people's (wildly overcomplicated) setups. I get the allure but there's something nice about pair programming with an LLM in a singular chat.

  • Could be (the overcomplicating part), I'm just not yet comfortable loosing the mental model of the final application. At least not in all types of tickets. Are you not seeing that?..

    • I focus on one side project at a time, alongside work applications

      Both are giving me skillsets to excel in the other domain

      I watch the subagents, push back on some choices, look at commits and glance at pull requests

  • > control freak veteran developers who cant let go

    It is not control freak behavior to want to be in control when you are the one accountable for it if it breaks.

  • All these people saying UI/UX is dead, then I see their designs and they're absolutely the worst (but they're always swearing by how incredible it is).

    Sorry access to an LLM (even if it could center a div reliably and make a responsive designs, it can't) does not give you taste, intuition or make you good at building user interfaces. You people/sloppers have no idea the amount of sweat that gets poured into great UX.

    Its insulting when you people say these things and Im not even a designer or frontend dev.

    I actually think UI/UX designers and devs will be the last to fall. I will want beautiful products that were built by beautiful minds, thats how you will set yourself apart from the slop. And fortunately it will be even easier when 80% of everything is half assed cranked out UI by llm design tools. The contrast is already glaring.

    •   > I actually think UI/UX designers and devs will be the last to fall. I will want beautiful products that were built by beautiful minds
      

      as an aside, i do find it interesting how people around me are more reluctant to have an ai design the look of an app vs a human, yet having an ai design the more important parts (how it works) is totally fine to them...

        > You people/sloppers have no idea the amount of sweat that gets poured into great UX.
      

      i could say the same about code though... why is ui/ux some sacred cow but code isn't? its just as important no?

        > Its insulting when you people say these things and Im not even a designer or frontend dev.
      

      playing devils advocate, but again, why is code any different?

      1 reply →

    • UI/UX or dev isn't dead.

      It will be shrinking. Less grunt work.

      Internal projects can get done with less of either.

      Nobody really cares about great UX or about how great someone can implement a CRUD app.

      So there will be less need/fighting over such resources.

      If I can just generate a usable UI for a hobby project I don't need to find some company to build it out. Sure, it will miss out on a lot of stuff but it's a trade off.

      If someone else can build a product and needed a basic web shop / crud app, they don't need to find someone to implement that at a massive overcharge.

    • I’ve seen that slop but

      Claude Design has barely been out for a month

      And it’s fulfilled my needs better than v0, lovable, playwright via LLM or just iterating in the coding LLM. I’ve worked with graphic designers my whole career and have also contracted design agencies to do style guides and collaborate on branding and layouts. I’ve gotten the output that I’m looking for with Claude Design

      eventually you’ll see examples but its not in my purview to publicly link any of my projects as being vibe coded

      2 replies →

I am completely calm regarding AI and development.

First nobody sane want to give their domain IP to OpenAI/Anthropic. That's why local AI will eventually prevail and flourish because people who actually have some IP will have no problem to buy 10k+ EUR machine to run some pretty good models on it. However if your main job is just doing CRUD stuff, then you are screwed.

Secondly hallucination is really Achilles heel of every LLM. Sure you can recreate an application which exists in thousand of variations on the internet, but the moment you will try to go more into domain knowledge you will start struggling more and more.

Try to make CAN driver for ESP32, easy it is probably going to work. Try to make CAN driver for STM32F7xx now the AI will start having a problem but probably will be able to produce something what is working after a lot of debugging. Now let's make CAN driver for MPC5555. AI will start writing fairy tales about registers which do not exist. All of processor above have reference manuals and sometimes example git repositories available on open internet.

  • > First nobody sane want to give their domain IP to OpenAI/Anthropic.

    Pretty much the whole industry has zero problem giving OpenAI/Anthropic full access to their systems and codebases.

    You're putting way more thoughts into it than the vast majority, most companies seem to go with the momentum

  • > First nobody sane want to give their domain IP to OpenAI/Anthropic. That's why local AI will eventually prevail and flourish because people who actually have some IP will have no problem to buy 10k+ EUR machine to run some pretty good models on it. However if your main job is just doing CRUD stuff, then you are screwed

    Replace OpenAI/Anthropic with AWS and this is not too dissimilar to the arguments in 2009 about cloud providers.

    It’s not that there's nobody for whom this is true, it’s just that there’s enough of everyone else to build an empire with.

    • > It’s not that there's nobody for whom this is true, it’s just that there’s enough of everyone else to build an empire with.

      But those everyone else are racing to the bottom because all their ideas are being soaked up by AI and then being given to their competitors on a silver platter as AI output.

      2 replies →

  • Did you try this by giving it access to the materials? Human programmers also don't memorize all this stuff. If this is the reason for your calmness it's quite shortsighted.

    There are problems when you rely too much on AI generated code, but these shallow dismissals are quite annoying.

    • I did, the problem is that

      1. There can be massive differences between chips which sounds plausibly same and thanks to the way how LLM is working, models are mangling these variations together

      2. Registers are often named in very way similar across different manufacturers so models are making up registers in MPC5555 which are coincidentally registers in Renesas processors doing same thing.

      3. There are no standard in reference manuals, sometimes there are literally missing chunks of knowledge thanks to translation to English or there are pieces which you can only get from Application Notes which has code as a screenshot.

      And then you will find out that all those descriptions are wrong and through trial and error you will get it working in 2 weeks time.

      Bonus point: Random people having public Git repositories for obscure processors, but with bad or completely non working implementation of drivers for them. However LLM will just output variation of this garbage on you, because there are 3 public repositories on the whole internet. Sometimes I have a feeling that this must be on purpose to poison the well.

      3 replies →

  • > All of processor above have reference manuals and sometimes example git repositories available on open internet.

    okay? then give those reference manuals and git repositories? I haven't heard something know LLMs can't get around and figure out?