Comment by lloeki

2 days ago

I tried exactly that, several times, over and over.

Except on "hello world" situations (which I guess is a solid part of the corpus LLMs are trained with) these tools were consistently slower.

Last time was an area where several files were subtly different in a section that essentially does about the same thing, and needed to be aligned and made consistent†.

Time to - begrudgingly - do it manually: 5min

Time to come up with a one-shot shell incantation: 10min

Time to very dumbly manually mark the areas with ===BEGIN=== and ===END=== and come up with a one-shot shell incantation: 3min

Time to do it for the LLM: 45min††; also it required regular petting every 20ish command so zero chance of letting it run and doing something else†††.

Time to review + manually fix the LLM output which missed two sections, left obsolete comments, and modified four files that were entirely unrelated yet clearly declared as out of scope in the prompt: 5min

Consistently, proponents have been telling me "yeah you need to practice more, I'm getting fine results so you're holding it wrong, we can do a session together and I'll show you how to do it", which they do, and then it doesn't work, and they're like "well I'll look into it and circle back" and I never hear from them again.

As for suggestions, for every good completion where I accept saying "oh well, why not", 99 get rejected: the majority are complete hallucinations absolutely unrelated to the surrounding logic, a third are either broken or introduce non-working code, and 1-5 _actively dangerous_ in some way.

The only places where I found LLMs vaguely useful are:

- Asking questions about an unknown codebase. It still hallucinates and misdirects or is excessively repetitive about some things (even with rules) but it can crudely draw a rough "map" and make non-obvious connections about two distant areas, which can be welcome.

- Asking for a quick code review in addition to the one I ask to humans; 70% of such output is laughably useless (although harmless beyond the noise + energy cost), 30% is duplicate of human reviews but I can get it earlier, and sometimes it unearths a good point that has been overlooked.

† No, the specific section cannot+should not be factored out

†† And that's because I interrupted it because it was going about modifying files that it should not have.

††† A bit of a lie because I did the other three ways during that time. Which also is telling because the time to do the other ways would actually be _lower_ because I was interrupted by / had to keep tabs on what the AI agent was doing.