Comment by jumploops
8 hours ago
I’ve found that LLMs seem to work better on LLM-generated codebases.
Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.
As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.
We call this tech debt.
Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.
With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).
Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.
YMMV here, as it’s hard to do large refactors without tests for correctness.
(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)
are LLM codebases not messy?
Claude by default, unless I tell it not to, will write stuff like:
instead of the very simple boolean logic that could express this in one line, with the "this code does what it obviously does" comments added all over the place.
generally unless you tell it not to, it does things in very verbose ways that most humans would never do, and since there's an infinite number of ways that it can invent absurd verbosity, it is hard to preemptively prompt against all of them.
to be clear, I am getting a huge amount of value out of it for executing a bunch of large refactors and "modernization" of a (really) big legacy codebase at scale and in parallel. but it's not outputting the sort of code that I see when someone prompts it "build a new feature ...", and a big part of my prompts is screaming at it not to do certain things or to refuse the task if it at any point becomes unsure.
Yeah to be clear it will have the same issues as a flyby contributor if prompted to.
Meaning if you ask it “handle this new condition” it will happily throw in a hacky conditional and get the job done.
I’ve found the most success in having it reason about the current architecture (explicitly), and then to propose a set of changes to accomplish the task (2-5 ways), review, and then implement the changes that best suit the scope of the larger system.
The failure mode is missing constraints, not “coding skill”. Treat the model as a generator that must operate inside an explicit workflow: define the invariant boundaries, require a plan/diff before edits, run tests and static checks, and stop when uncertainty appears. That turns “hacky conditional” behaviour into controlled change.
1 reply →
Surely because LLM generated code is part of the training data for the model, so code/patterns it can work with is closer to its training data.