Comment by HarHarVeryFunny

5 hours ago

> What may save us it that agents are unreasonably good at reading exhaustively. An agent will read every PR comment, every closed issue, every commit message, every stale design doc ...

> Not just “this module exists,” but “this module is weird because the migration had to preserve old behavior,” or “this benchmark matters because a previous optimization silently changed the distribution.”

The thesis here is that an LLM will document code better than a human (although based on human artifacts), since churning through huge quantities of text is what they are good at.

A few thoughts:

1) Yes, an LLM may be able to pull comments out of commits and PR comments and put them back in the code where they belong, but I question how often a developer too lazy to put a vital comment in the code would put it in a commit message instead!

2) "The truth is in the code" has always been true, and will always remain true. If the comments differ from the code, the code defines the truth. Pulling comments from stale external documentation and putting them in the code does more harm than good.

3) Comments that can be auto-generated from the code don't add much value (lda #1; add one to the accumulator).

4) Comments about the purpose or motivation of the code, distinct from 3), such as the "we had to preserve backwards compatibility" example, or "this code does this non-obvious tricky thing because ...", are where the value is, but the LLM is highly unlikely to be able to discern any unwritten motivation by itself. If the human developer left a comment somewhere then great (assuming it is still relevant)

Most of the discussion we see about LLM coding is how fast it can churn out thousands of LOC on a greenfield project, or how good they can be at finding bugs, but neither of these are very relevant to the main job of developers which is maintaining and extending existing codebases. It would be lovely if most projects were greenfield, but they are not.

In any large project that has been maintained over a few years or more, there will inevitably be an ever growing accumulation of bug fixes and patches for specific issues that have been discovered in production, likely poorly documented and out of sync with any original documentation that may have existed (which anyway tends to be more idealistic and architectural in nature, not capturing these types of post-deployment detail and special cases).

The natural tendency of an LLM is to want to rewrite code to match the statistics of what it was trained on, and they need to be reigned in via prompting to resist this and not touch more code than is minimally needed for what is being asked. Of course asking an LLM to do something is a bit like asking a dog to do something - sometimes it will, and sometimes it won't. I expect over the next few years we'll be experiencing, and reading about, more and more cases where LLMs have introduced bugs and regressions into mature code bases because of this - rewriting code that should have been left alone. The general rule is that if you are tempted to rewrite something you better first understand why it was there, coded the way it is, in the first place.

I can't help but compare the current state of "AI" (LLMs) to the early days of things like computer speech recognition or language translation when they were considered amazing, and everyone was gushing about them, but at the end of the day the accuracy still wasn't good enough to make them very useful - that would take another 10-20 years.

Another historical lesson/perspective would be expert systems which at the time were considered as AI and the future of machine intelligence (the Japanese "5th generation systems" were going to take over the world, CYC promised to offer human level intelligence), but in retrospect were far less important. It won't be until we move on from LLMs to something more brain-like, deserving to be called AGI, that LLMs will be put in their historical perspective.

At the moment DeepMind seems to be the only one of the big labs admitting/recognizing that scaling LLMs isn't going to achieve AGI and that "a few more transformer-level breakthroughs" are needed. Hassabis has however talked about LLMs (GPTs) still being a part of what they are envisaging, which one could either regard as a pragmatic stepping stone to real AGI, or perhaps that they are not being ambitious enough - building something that still needs to be spoon-fed language rather than being capable of learning it from scratch.

Even if writing new code is not the problem or the bottleneck anymore, a major blocker is one CoreService.java file someone wrote 10 years ago that is keeping the whole system glued together. Only they know how it works, and how to keep it working. Parroting my own words from a while back, inferred intent is not the same as initial intent. This is just reverse engineering except somewhat automated

It also bakes in the LLM quality at the time the documentation was generated, into the documentation. It potentially worsens the performance of future LLMs if they ingest the documentation produced by older LLMs. It’s not clear why documentation wouldn’t instead be generated on demand, using the newest SOTA LLM.