Comment by overgard
9 hours ago
My feeling is that the code it generates is locally ok, but globally kind of bad. What I mean is, in a diff it looks ok. But when you start comparing it to the surrounding code, there's a pretty big lack of coherency and it'll happily march down a very bad architectural path.
In fairness, this is true of many human developers too.. but they're generally not doing it at a 1000 miles per hour and they theoretically get better at working with your codebase and learn. LLMs will always get worse as your codebase grows, and I just watched a video about how AGENTS.md actually usually results in worse outcomes so it's not like you can just start treating MD files as memory and hope it works out.
> But when you start comparing it to the surrounding code, there's a pretty big lack of coherency and it'll happily march down a very bad architectural path.
I had an idea earlier this week about this, but haven’t had a chance to try it. Since the agent can now “see” the whole stack, or at least most of it, by having access to the repos, there’s becoming less of a reason to suspect they won’t be able to take the whole stack into account when proposing a change.
The idea is that it’s like grep: you can call grep by itself, but when a match is found you only see one line per match, not any surrounding context. But that’s what the -A and -B flags are for!
So you could tell the agent that if its proposed solution lies at layer N of the system, it needs to consider at least layers N-1 (dependencies) and N+1 (consumers) to prevent the local optimum problem you mentioned.
The model should avoid writing a pretty solution in the application layer that conceals and does not address a deeper issue below, and it should keep whatever contract it has with higher-level consumers in good standing.
Anyway, I haven’t tried that yet, but hope to next week. Maybe someone else has done something similar and (in)validated it, not sure!