Comment by cycomanic
3 days ago
I've already written about this several times here. I think the current trend of LLMs chasing benchmark scores are going in the wrong direction at least as programming tools. In my experience they get it wrong with enough probability, so I always need to check the work. So I end up in a back and forth with the LLM and because of the slow responses it becomes a really painful process and I could often have done the task faster if I sat down and thought about it. What I want is an agent that responds immediately (and I mean in subseconds) even if some benchmark score is 60% instead of 80%.
Programmers (and I'm including myself here) often go to great lengths to not think, to the point of working (with or without a coding assistant) for hours in the hope of avoiding one hour of thinking. What's the saying? "An hour of debugging/programming can save you minutes of thinking," or something like that. In the end, we usually find that we need to do the thinking after all.
I think coding assistants would end up being more helpful if, instead of trying to do what they're asked, they would come back with questions that help us (or force us) to think. I wonder if a context prompt that says, "when I ask you to do something, assume I haven't thought the problem through, and before doing anything, ask me leading questions," would help.
I think Leslie Lamport once said that the biggest resistance to using TLA+ - a language that helps you, and forces you to think - is because that's the last thing programmers want to do.
> Programmers (and I'm including myself here) often go to great lengths to not think, to the point of working (with or without a coding assistant) for hours in the hope of avoiding one hour of thinking. What's the saying? "An hour of debugging/programming can save you minutes of thinking," or something like that. In the end, we usually find that we need to do the thinking after all.
This is such a great observation. I'm not quite sure why this is. I'm not a programmer, but a signal-processing/system engineer/researcher. The weird thing seems that it's the process of programming that causes the "not-thinking" behaviour, e.g. when I program a simulation and I find that I must have a sign error somewhere in my implementation (sometimes you can see this from the results), I end up switching every possible sign around instead of taking a pen and pencil and comparing theory and implementation, if I do other work, e.g. theory, that's not the case. I suspect we try to avoid the cost of the context switch and try to stay in the "programming-flow".
This is your brain trying conserve your energy/time by recollecting/brute-forcing/following known patterns, instead of diving into unknown. Otherwise known as „being lazy” / procrastinating.
There is an illusion that the error is tiny, and its nature is obvious, so it could be fixed by an instant, effortless tweak. Sometimes it is so (when the compiler complains about a forgotten semicolon), sometimes it may be arbitrarily deeply wrong (even it manifests just as a reversed sign).
I do both. I like to develop designs in my head, and there’s a lot of trial and error.
I think the results are excellent, but I can hit a lot of dead ends, on the way. I just spent several days, trying out all sorts of approaches to PassKeys/WebAuthn. I finally settled on an approach that I think will work great.
I have found that the old-fashioned “measure twice, cut once” approach is highly destructive. It was how I was trained, so walking away from it was scary.
> I have found that the old-fashioned “measure twice, cut once” approach is highly destructive. It was how I was trained, so walking away from it was scary.
To be fair it’s great advice when you’re dealing with atoms.
Mutable patterns of electrons, not so much (:
Sometimes thinking and experimenting go together. I had to do some maintenance on some Typescript/yum that I didn't write but had done a little maintenance.
Typescript can make astonishingly complex error messages when types don't match up so I went through a couple of rounds of showing the errors to the assistant and getting suggestions to fix it that were wrong but I got some ideas and did more experiments and over the course of two days (making desired changes along the way) I figured out what was going wrong and cleared up the use of types such that I was really happy with my code and when I saw a red squiggle I usually knew right away what was wrong and if I did ask the assistant it would also get it right right away.
I think there's no way I would have understood what was going on without experimenting.
Agree, also llms change the balance of plan vs do for me, sometimes it cheaper to do & review than up-front plan.
When you can see what goes wrong with the naive plan you then have all the specific context in front of you for making a better plan.
If something is wrong with the implementation then I can ask the agent to then make a plan which avoids the issues / smells I call out. This itself could probably be automated.
The main thing I feel I'm "missing" is, I think it would be helpful if there were easier ways to back up in the conversation such that the state of the working copy was restored also. Basically I want the agent's work to be directly integrated with git such that "turns" are commits and you can branch at any point.
I like that prompt idea. Because I hate hate hate when it just starting “doing work”. Those things are much better as sounding board for ideas and clarifying my thinking than writing one-shot code.
> assume I haven't thought the problem through
This is the essence of my workflow.
I dictate rambling, disorganized, convoluted thoughts about a new feature into a text file.
I tell Claude Code or Gemini CLI to read my slop, read the codebase, and write a real functional design doc in Markdown, with a section on open issues and design decisions.
I'll take a quick look at its approach and edit the doc to tweak its approach and answer a few open questions, then I'll tell it to answer the remaining open questions itself and update the doc.
When that's about 90% good, I'll tell the local agent to write a technical design doc to think through data flow, logic, API endpoints and params and test cases.
I'll have it iterate on that a couple more rounds, then tell it to decompose that work into a phased dev plan where each phase is about a week of work, and each task in the phase would be a few hours of work, with phases and tasks sequenced to be testable on their own in frequent small commits.
Then I have the local agent read all of that again, the codebase, the functional design, the technical design, and the entire dev plan so it can build the first phase while keeping future phases in mind.
It's cool because the agent isn't only a good coder, it's also a decent designer and planner too. It can read and write Markdown docs just as well as code and it makes surprisingly good choices on its own.
And I have complete control to alter its direction at any point. When it methodically works through a series of small tasks it's less likely to go off the rails at all, and if it does it's easy to restore to the last commit and run it again.
1. Shame on you, that doesn't sound like fun vibe coding, at all!
2. Thank you for the detailed explanation, it makes a lot of sense. If AI is really a very junior dev that can move fast and has access to a lot of data, your approach is what I imagine works - and crucially - why there is such a difference in outcomes using it. Because what you're saying is, frankly, a lot of work. Now, based on that work you can probably double your output as a programmer, but considering the many code bases I've seen that have 0 documentation, 0 tests, I think there is a huge chunk of programmers that would never do what you're doing because "it's boring".
3. Can you share maybe an example of this, please:
> and write a real functional design doc in Markdown, with a section on open issues and design decisions.
Great comment, I've favorite'd it!
I agree with your comment in general, however I would say that on my field, the resistence to TLA+ isn't having to think, rather having to code twice without guarantees that it actually maps to the theorical model.
Tools like Lean and Dafny are much more appreciated, as they generate code from the model.
But both Dafny and Lean (which are really hard to put in the same category [1]) are used even less than TLA+, and the problem of formally tying a spec to code exists only when you specify at a level that's much higher than the code, which is what you want most of the time because that's where you get the most bang for you buck. It's a little like saying that the resistance to blueprints is that a rolled blueprint makes a poor hammer.
TLA+ is for when you have a 1MLOC database written in Java or a 100KLOC GC written in C++ and you want to make sure your design doesn't lead to lost data or to memory corruption/leak (or for some easier things, too). You certainly can't do that with Dafny, and while I guess you could do it in Lean (if you're masochistic and have months to spare), it wouldn't be in a way that's verifiably tied to the code.
There is no tool that actually formally ties spec to code in any affordable way and at real software scale, and I think the reason people say they want what doesn't exist is precisely because they want to avoid the thinking that they'll have to do eventually anyway.
[1]: Lean and TLA+ are sort-of similar, but Dafny is something else altogether.
7 replies →
> "An hour of debugging/programming can save you minutes of thinking,"
I get what you're referring to here, when it's tunnel-vision debugging. Personally I usually find that coding/writing/editing is thinking for me. I'm manipulating the logic on screen and seeing how to make it make sense, like a math problem.
LLMs help because they immediately think through a problem and start raising questions and points of uncertainty. Once I see those questions in the <think> output, I cancel the stream, think through them, and edit my prompt to answer the questions beforehand. This often causes the LLM's responses to become much faster and shorter, since it doesn't need to agonise over those decisions any more.
it's funny, I feel like I'm the opposite and it's why I truly hate working with stuff like claude code that constantly wants to jump into implementation. I want to be in the driver's seat fully and think about how to do something thoroughly before doing it. I want the LLM to be, at most, my assistant. Taking on the task of being a rubber duck, doing some quick research for me, etc.
It's definitely possible to adapt these tools to be more useful in that sense... but it definitely feels counter to what the hype bros are trying to push out.
In general agreement about the need to think it through, and she should be careful to not oraise the other extreme.
> "An hour of debugging/programming can save you minutes of thinking"
The trap so many dev fall into is assuming code behaves like they think it is. Or believing documentation or seemingly helpful comments. We really want to believe.
People's mental image is more often than not wrong, and debugging tremendously helps bridge the gap.
Absolutely! I have used Copilot for a few weeks and then stopped when I worked on a machine that didn't have Copilot installed and I immediately struggled with even basic syntax. Now I often use LLMs as advanced rubber ducks. By describing my problems, the solution often comes to my mind on its own and sometimes the responses I get are enough for me to continue on my own. In my opinion, letting LLMs directly code can be really harmful for the software developers, because they forget to think for themselves. Maybe I'm wrong and I am just slow to accept the new reality, but I try to keep writing most of my code on my own and improve my coding skills more than my prompting skills (while still using these tools, of course). For me, LLMs are like a grumpy and cynical old senior dev who is forced to talk in a very positive manner and who has fun trickling in some completely random bullshit between his actual helpful advice.
World of LLMs or not, development should always strive for being fast. In the LLM World, users should always have the controls on accuracy Vs speed. (Though we can try for improving both and not one way or other). For eg at rtrvr.ai we use Gemini Flash as our default and did benchmarking on flash too with 0.9 min per task in the benchmark still yielding top results. That said, I have to accept there are certain web tasks on tail end sites that needs pro to accurately navigate at this point. This is the limitation given our reliance on Gemini models straight up, once we move to our models trained on web trajectories this hopefully will not be a problem.
If using off the shelf LLMs always have a bottleneck of their speed.
GitHub copilot's inline completions still exist, and are nearly instant!