← Back to context

Comment by smusamashah

5 hours ago

The code base I work on at $dayjob$ is legacy, has few files with 20k lines each and a few more with around 10k lines each. It's hard to find things and connect dots in the code base. Dont think LLMs able to navigate and understand code bases of that size yet. But have seen lots of seemingly large projects shown here lately that involve thousands of files and millions of lines of code.

I’ve found that LLMs seem to work better on LLM-generated codebases.

Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.

As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.

We call this tech debt.

Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.

With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).

Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.

YMMV here, as it’s hard to do large refactors without tests for correctness.

(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)

  • Surely because LLM generated code is part of the training data for the model, so code/patterns it can work with is closer to its training data.