← Back to context

Comment by crabbone

2 hours ago

> systemic tech debt is now addressable at scale with LLMs.

Is there any reason to believe this? I've only seen the evidence of the contrary so far.

My experience with AI coding aides is that they, generally:

1. Don't have an opinion.

2. Are trained on code written using practices that increase technical debt.

3. Lack in the greater perspective department, more focused on concrete, superficial and immediate.

I think, I need to elaborate on the first and explain how it's relevant to the question. I'll start with an example. We have an AI reviewer and recently had migrated a bunch of company's repositories from Bitbucket to GitLab. This also prompted a bunch of CI changes. Some projects I'm involved with, but don't have much of an authority, that are written in Python switched to complicated builds that involve pyproject.toml (often including dynamic generation of this cursed file) as well as integration with a bunch of novelty (but poor quality) Python infrastructure tools that are used for building Python distributalbe artifacts.

In the projects where I have an authority, I removed most of the third-party integration. None of them use pyproject.toml or setup.cfg or any similar configuration for the third-party build tool. The project code contains bespoke code to build the artifacts.

These two approaches are clearly at odds. A living and breathing person would either believe one to be the right approach or the other. The AI reviewer had no problems with this situation. It made some pedantic comments about the style and some fantasy-impossible-error-cases, but completely ignored the fact that moving forward these two approaches are bound to collide. While it appears to have an opinion about the style of quotation marks, it completely doesn't care about strategic decisions.

My guess as to why this is the case is that such situations are genuinely rarely addressed in code review. Most productive PRs, from which an AI could learn, are designed around small well-defined features in the pre-agreed upon context. The context is never discussed in PRs because it's impractical (it would usually require too much of a change, so the developers don't even bring up the issue).

And this is where real large glacier-style deposits of tech debt live. It's the issues developers are afraid of mentioning because of the understanding that they will never be given authority and resources to deal with.

You are not wrong about anything you’re saying but like I said this misses the forest for the trees. I’m talking about like the next ~2 years. There is a common idea that we don’t understand this technology or what will happen performance wise. We know a lot more about what’s going to happen than people think. It’s because none of this is new. We’ve known about neural nets since the 40s, we know how RL works on a fundamental level and it has been an active and beautiful field of research for at least 30-40 years, we know what happens when you combine RL with verifiable rewards and throw a lot of compute at it.

One big misconception is that these models are trained to mimic humans and are limited by the quality of the human training data, and this is not true and also basically almost entirely the reason why you have so much bullishness and premature adoption of agentic coding tools.

Coding agents use human traces as a starting point. You technically don’t have to do this at all but that’s an academic point, you can’t do it practically (today). The early training stages with human traces (and also verified synthetic traces from your last model) get you to a point where RL is stable and efficient and push you the rest of the way. It’s synthetic data that really powers this and it’s rejection sampling; you generate a bunch of traces, figure out which ones pass the verification, and keep those as training examples.

So because

- we know how this works on a fundamental level and have for some time

- human training data is a bootstrap it’s not a limitation fundamentally

- you are absolutely right about your observations yet look at where you are today and look at say Claude sonnet 3.x. It’s an entire world away in like a year

- we have imperfect benchmarks all with various weaknesses yet all of them telling the same compelling story. Plus you have adoption numbers and walled garden data that is the proof in the pudding

The onus is on people who say “this is plateauing” or “this has some fundamental limitation that we will not get past fairly quickly”.

  • > look at say Claude sonnet 3.x. It’s an entire world away in like a year

    In the area I work I find them to be of very little value both then and now... I see no real difference. They help in marginal tasks. Eg. they catch typos, or they help new programmers to faster explore the existing codebase.

    So far, I haven't used a single line of code generated by AI, even though I've seen thousands. Some of them worked to draw attention to a problem, but none solved it successfully. It was all pretty lame.

    I see no reason to believe it's going to get better. Waving hands more forcefully isn't helping, there's no argument behind the promise of "it will get better". No reason to believe it will...