← Back to context

Comment by bryanrasmussen

2 months ago

maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.

  • Perhaps the problem is that you RL on one patch a time, failing to capture the overarching long term theme, an architecture change being introduced gradually over many months, that exists in the maintainer’s mental model but not really explicitly in diffs.

  • right, it would have to a specialized tool that you used to do analysis of codebase every now and then, or parts that you thought should be cleaned up.

    Obviously there is a just keep generating more tokens bias in software management, since so many developer metrics over the years do various lines of code style analysis on things.

    But just as experience and managerial programs have over time developed to say this is a bad bias for ranking devs, it should be clear it is a bad bias for LLMs to have.

I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.

  • deleting and code cleanup is perhaps more an expression of seniority, and personal preferences. Maybe there should be the same kind style transfer with code that you see with graphical generative AI, "rewrite this code path in the style of Donald Knuth"

    • I imagine there would be value in not just throwing all of GitHub commits in as training data, but also rating the quality.