Comment by PaulHoule

3 days ago

Kinda funny but I think LLM-assisted workflows are frequently slow -- that is, if I use the "refactor" features in my IDE it is done in a second, if I ask the faster kind of assistant it comes back in 30 seconds, if I ask the "agentic" kind of assistant it comes back in 15 minutes.

I asked an agent to write an http endpoint at the end of the work day when I had just 30 min left -- my first thought was "it took 10 minutes to do what would have taken a day", but then I thought, "maybe it was 20 minutes for 4 hours worth of work". The next day I looked at it and found the logic was convoluted, it tried to write good error handling but didn't succeed. I went back and forth and ultimately wound up recoding a lot of stuff manually. In 5 hours I had it done for real, certainly with a better test suite than I would have written on my own and probably better error handling.

See https://www.reddit.com/r/programming/comments/1lxh8ip/study_...

As a counter example (re: agents), I routinely delegate simple tasks to Claude Code and get near-perfect results. But I've also had experiences like yours where I ended up wasting more time than saved. I just kept trying with different types of tasks, and narrowed it down to the point where I have a good intuition for what works and what doesn't. The benefit is I can fire off a request on my phone, stick it in my pocket, then do a code review some time later. This process is very low mental overhead for me, so it's a big productivity win.

  • Sounds like a slot machine. Insert api tokens, get something that's pretty close to right, insert more tokens and hope it works this time.

    • Except the tokens you insert have meaning, and some yield better results than others. Not like a slot machine at all, really. Last I checked, those only have 1 possible input, no way to improve your odds.

      6 replies →

    • How's that different from a human developer? Give the same task to different developers and you'll get different levels of correctness and quality. Give the task to the same developer on different days and it is the same.

      1 reply →

  • The cost is in the context switching. Throw 3 tasks that came 15, 20 and 30 min later. The first is mostly ok, you finish by hand. The second have some problems, ask for a rework. Then came the other and, while ok, is have some design problems. Ask another rework. Comes back the second one, and you have to remember the original task and what things you asked for change.

  • Thats cool, how are you integrating your phone with your Claude workflow?

    • I don't know how to do it with Claude Code, but I was at a beach vacation for the past few days and I was studying french on my phone with an webapp that I made. Sometimes I'd notice something bug me, and I used cursor's "background agents" tool to ask it to make a change. This is essentially just a website where you can type in your request, and they allocate a VM, check out your repository, then run the cursor LLM agent inside that VM to implement your requested changes, then push it and create a pull request to your repo. Because I have CI/CD setup, I then just merged the change and waited for it to deploy (usually going for a swim in-between).

      I realized as I was doing it that I wouldn't be able to tell anyone about it because I would sound like the most obnoxious AI bro ever. But it worked! (For the simple requests I used it on.) The most annoying part was that I had to tell it to run rustfmt every time, because otherwise it would fail CI and I wouldn't be able to merge it. And then it would take forever to install a rust toolchain and figure out how to run clippy and stuff. But it did feel crazy to be able to work on it from the beach. Anyway, I'm apparently not very good at taking vacations, lol

      2 replies →

    • My dev environment works perfectly on Termux, and so does Claude Code. So I just run `claude` like normal, and everything is identical to how I do it on desktop.

      Edit: clarity

      2 replies →

I've already written about this several times here. I think the current trend of LLMs chasing benchmark scores are going in the wrong direction at least as programming tools. In my experience they get it wrong with enough probability, so I always need to check the work. So I end up in a back and forth with the LLM and because of the slow responses it becomes a really painful process and I could often have done the task faster if I sat down and thought about it. What I want is an agent that responds immediately (and I mean in subseconds) even if some benchmark score is 60% instead of 80%.

  • Programmers (and I'm including myself here) often go to great lengths to not think, to the point of working (with or without a coding assistant) for hours in the hope of avoiding one hour of thinking. What's the saying? "An hour of debugging/programming can save you minutes of thinking," or something like that. In the end, we usually find that we need to do the thinking after all.

    I think coding assistants would end up being more helpful if, instead of trying to do what they're asked, they would come back with questions that help us (or force us) to think. I wonder if a context prompt that says, "when I ask you to do something, assume I haven't thought the problem through, and before doing anything, ask me leading questions," would help.

    I think Leslie Lamport once said that the biggest resistance to using TLA+ - a language that helps you, and forces you to think - is because that's the last thing programmers want to do.

    • > Programmers (and I'm including myself here) often go to great lengths to not think, to the point of working (with or without a coding assistant) for hours in the hope of avoiding one hour of thinking. What's the saying? "An hour of debugging/programming can save you minutes of thinking," or something like that. In the end, we usually find that we need to do the thinking after all.

      This is such a great observation. I'm not quite sure why this is. I'm not a programmer, but a signal-processing/system engineer/researcher. The weird thing seems that it's the process of programming that causes the "not-thinking" behaviour, e.g. when I program a simulation and I find that I must have a sign error somewhere in my implementation (sometimes you can see this from the results), I end up switching every possible sign around instead of taking a pen and pencil and comparing theory and implementation, if I do other work, e.g. theory, that's not the case. I suspect we try to avoid the cost of the context switch and try to stay in the "programming-flow".

      2 replies →

    • I do both. I like to develop designs in my head, and there’s a lot of trial and error.

      I think the results are excellent, but I can hit a lot of dead ends, on the way. I just spent several days, trying out all sorts of approaches to PassKeys/WebAuthn. I finally settled on an approach that I think will work great.

      I have found that the old-fashioned “measure twice, cut once” approach is highly destructive. It was how I was trained, so walking away from it was scary.

      1 reply →

    • Sometimes thinking and experimenting go together. I had to do some maintenance on some Typescript/yum that I didn't write but had done a little maintenance.

      Typescript can make astonishingly complex error messages when types don't match up so I went through a couple of rounds of showing the errors to the assistant and getting suggestions to fix it that were wrong but I got some ideas and did more experiments and over the course of two days (making desired changes along the way) I figured out what was going wrong and cleared up the use of types such that I was really happy with my code and when I saw a red squiggle I usually knew right away what was wrong and if I did ask the assistant it would also get it right right away.

      I think there's no way I would have understood what was going on without experimenting.

      1 reply →

    • I like that prompt idea. Because I hate hate hate when it just starting “doing work”. Those things are much better as sounding board for ideas and clarifying my thinking than writing one-shot code.

    • > assume I haven't thought the problem through

      This is the essence of my workflow.

      I dictate rambling, disorganized, convoluted thoughts about a new feature into a text file.

      I tell Claude Code or Gemini CLI to read my slop, read the codebase, and write a real functional design doc in Markdown, with a section on open issues and design decisions.

      I'll take a quick look at its approach and edit the doc to tweak its approach and answer a few open questions, then I'll tell it to answer the remaining open questions itself and update the doc.

      When that's about 90% good, I'll tell the local agent to write a technical design doc to think through data flow, logic, API endpoints and params and test cases.

      I'll have it iterate on that a couple more rounds, then tell it to decompose that work into a phased dev plan where each phase is about a week of work, and each task in the phase would be a few hours of work, with phases and tasks sequenced to be testable on their own in frequent small commits.

      Then I have the local agent read all of that again, the codebase, the functional design, the technical design, and the entire dev plan so it can build the first phase while keeping future phases in mind.

      It's cool because the agent isn't only a good coder, it's also a decent designer and planner too. It can read and write Markdown docs just as well as code and it makes surprisingly good choices on its own.

      And I have complete control to alter its direction at any point. When it methodically works through a series of small tasks it's less likely to go off the rails at all, and if it does it's easy to restore to the last commit and run it again.

      1 reply →

    • I agree with your comment in general, however I would say that on my field, the resistence to TLA+ isn't having to think, rather having to code twice without guarantees that it actually maps to the theorical model.

      Tools like Lean and Dafny are much more appreciated, as they generate code from the model.

      8 replies →

    • > "An hour of debugging/programming can save you minutes of thinking,"

      I get what you're referring to here, when it's tunnel-vision debugging. Personally I usually find that coding/writing/editing is thinking for me. I'm manipulating the logic on screen and seeing how to make it make sense, like a math problem.

      LLMs help because they immediately think through a problem and start raising questions and points of uncertainty. Once I see those questions in the <think> output, I cancel the stream, think through them, and edit my prompt to answer the questions beforehand. This often causes the LLM's responses to become much faster and shorter, since it doesn't need to agonise over those decisions any more.

    • it's funny, I feel like I'm the opposite and it's why I truly hate working with stuff like claude code that constantly wants to jump into implementation. I want to be in the driver's seat fully and think about how to do something thoroughly before doing it. I want the LLM to be, at most, my assistant. Taking on the task of being a rubber duck, doing some quick research for me, etc.

      It's definitely possible to adapt these tools to be more useful in that sense... but it definitely feels counter to what the hype bros are trying to push out.

    • In general agreement about the need to think it through, and she should be careful to not oraise the other extreme.

      > "An hour of debugging/programming can save you minutes of thinking"

      The trap so many dev fall into is assuming code behaves like they think it is. Or believing documentation or seemingly helpful comments. We really want to believe.

      People's mental image is more often than not wrong, and debugging tremendously helps bridge the gap.

    • Absolutely! I have used Copilot for a few weeks and then stopped when I worked on a machine that didn't have Copilot installed and I immediately struggled with even basic syntax. Now I often use LLMs as advanced rubber ducks. By describing my problems, the solution often comes to my mind on its own and sometimes the responses I get are enough for me to continue on my own. In my opinion, letting LLMs directly code can be really harmful for the software developers, because they forget to think for themselves. Maybe I'm wrong and I am just slow to accept the new reality, but I try to keep writing most of my code on my own and improve my coding skills more than my prompting skills (while still using these tools, of course). For me, LLMs are like a grumpy and cynical old senior dev who is forced to talk in a very positive manner and who has fun trickling in some completely random bullshit between his actual helpful advice.

  • World of LLMs or not, development should always strive for being fast. In the LLM World, users should always have the controls on accuracy Vs speed. (Though we can try for improving both and not one way or other). For eg at rtrvr.ai we use Gemini Flash as our default and did benchmarking on flash too with 0.9 min per task in the benchmark still yielding top results. That said, I have to accept there are certain web tasks on tail end sites that needs pro to accurately navigate at this point. This is the limitation given our reliance on Gemini models straight up, once we move to our models trained on web trajectories this hopefully will not be a problem.

    If using off the shelf LLMs always have a bottleneck of their speed.

The only thing I've found that LLM speeds up my work is a sort of advanced find replace.

A prompt like " I want to make this change in the code where any logic deals with XXX. To be/do XXX instead/additionally/somelogicchange/whatever"

It has been pretty decent at these types of changes and saves time of poking though and finding all the places I would have updated manually in a way that find/replace never could. Though I've never tried this on a huge code base.

  • > A prompt like " I want to make this change in the code where any logic deals with XXX. To be/do XXX instead/additionally/somelogicchange/whatever"

    If I reached a point where I would find this helpful, I would take this as a sign that I have structured the code wrongly.

    • You would be right about the code but probably wrong about the you. I’ve done such requests to clean up code written over the years by dozens of other people copying patterns around because ship was king… until it wasn’t. (They worked quite well, btw.)

      1 reply →

    • sometimes you want a cutpoint for a refactor and only that refactor. And turns out that there is no nice abstraction that is useful beyond that refactor.

    • I knew someone would make this comment. I almost added a "I'm probably not leet enough to avoid these situations" disclaimer. It seemed a bit pointlessly self deprecating.

      You don't always get to choose the state of or the way a system you work in/with is designed. In this case I was working in a limited scripting language that I have no choice about.

      Keep that nose turned up. I'm sure you are leet10xninja. Maybe work on your reading comprehension before you dump on someone though as I already specified that I greatly simplified for comment sake.

      1 reply →

  • I supposed you haven’t tried emacs grep mode or vim quickfix? If the change is mechanical, you create a macro and be done in seconds. If it’s not, you still got the high level overview and quick navigation.

    • Finding and jumping to all the places is usually easy, but non trivial changes often require some understanding of the code beyond just line based regex replace. I could probably spend some time recording a macro that handles all the edge cases, or use some kind of AST based search and replace, but cursor agent does it just fine in the background.

      4 replies →

    • I'm decent at that kind of stuff. However thats not really what I'm talking about. For instance today I needed two logic flows. One for data flowing in one direction. Then a basically but not quite reversed version of the same logic for when the data comes back. I was able to write the first version then tell the LLM

      "Now duplicate this code but invert the logic for data flowing in the opposite direction."

      I'm simplifying this whole example obviously but that was the basic task I was working on. It was able to spit out in a few seconds what would have taken me probably more than an hour and at least one tedium headache break. I'm not aware of any pre LLM way to do something like that.

      Or a little while back I was implementing a basic login/auth for a website. I was experimenting with high output token LLM's (i'm not sure that's the technical term) and asked it to make a very comprehensive login handler. I had to stop it somewhere in the triple digits of cases and functions. Perhaps not a great "pro" example of LLM but even though it was a hilariously over complex setup it did give me some ideas I hadn't thought about. I didn't use any of the code though.

      Its far from the magic LLM sellers want us to believe but it can save time same as various emac/vim tricks can to devs that want to learn them.

    • emacs macros aren't the same. You need to look at the file, observe a pattern, then start recording the macro and hope the pattern holds. An LLM can just do this.

      2 replies →

I guess it depends? The "refactor" stuff, if your IDE or language server can handle it, then yeah I find the LLM slower for sure. But there are other cases than an LLM helps a lot.

I was writing some URL canonicalization logic yesterday. Because we rolled this out as an MVP, customers put URLs in all sorts of ways and we stored it into the DB. My initial pass at the logic failed on some cases. Luckily URL canonicalization is pretty trivially testable. So I took the most used customers from our DB, send them to Claude and told Claude to come up with the "minimum spanning test cases" that cover this behavior. This took maybe 5-10 sec. I then told Zed's agent mode using Opus to make me a test file and use these test cases to call my function. I audited the test cases and ended up removing some silly ones. I iterated on my logic and that was that. Definitely faster than having to do this myself.

All the references to LLMs in the article seemed out-of-place like poorly done product placement.

LLMs are the anti-thesis of fast. In fact, being slow is a perceived virtue with LLM output. Some sites like Google and Quora (until recently) simulate the slow typed output effect for their pre-cached LLM answers, just for credibility.

I'm consistently seeing personal and shared anecdotes of a 40%-60% speedup on targeted senior work.

As much as I like agents, I am not convinced the human using them can sit back and get lazy quite yet!

  • Eeeh, I spend less time writing code, but way more time reviewing and correcting it. I'm not sure I come ahead overall, but it does make development less boilerplaty and more high level, which leads to code that otherwise wouldn't have been written.

    • I wonder if you observe this when you use it in a domain you know well versus a domain you know less well.

      I think LLM assistants help you become functional across a more broad context -- and completely agree that testing and reviewing becomes much, much more important.

      E.g - a front end dev optimizing database queries, but also being given nonsensical query parameters that don't exist.

      1 reply →

  • That sounds plausible if the senior did lots of simple coding tasks and moves that work to an agent. Then the senior basically has to be a team lead and do code reviews/qa.

  • Curious, what do you count as senior work?

    • Roughly:

      A senior can write, test, deploy, and possibly maintain a scalable microservice or similar sized project without significant hand-holding in a reasonable amount of time.

      A junior might be able to write a method used by a class but is still learning significant portions and concepts either in the language, workflow orchestration, or infrastructure.

      A principal knows how each microservice fits into the larger domain they service, whether they understand all services and all domains they serve.

      A staff has significant principal understanding across many or all domains an organization uses, builds, and maintains.

      AI code assistance help increase breadth and, with oversight, improve depth. One can move from the "T" shape to "V" shape skillset far easier, but one must never fully trust AI code assistants.

I switch to vs code from cursor many times a day just to use their python refactoring feature. The pylance server that comes with cursor doesn't support refactoring.

Not only that, I am already typing enough for coding, I don't want to type on chat windows as well, and so far the voice assistance is so so.