Comment by devin

4 hours ago

> If you can go from producing 200 lines of code a day to 2,000 lines of code a day, what else breaks? The entire software development lifecycle was, it turns out, designed around the idea that it takes a day to produce a few hundred lines of code. And now it doesn’t.

It is so embarrassing that LOC is being used as a metric for engineering output.

LOC is useful here not because it's a metric for output but because it's a metric for _understandability_. Reviewing 200 lines is a very different workload than reviewing 2000.

  • That's assuming the 200 lines are logical and consistent. Many of my most frustrating LLM bugs are caused by things that look right and are even supported by lengthy comments explaining their (incorrect) reasoning.

  • It’s still a bad metric.

    I have worked with code where 1000s of lines are very straightforward and linear.

    I’ve worked on code where 100 lines is crucial and very domain specific. It can be exceptionally clean and well-commented and it still takes days to unpack.

    The skills and effort required to review and understand those situations are quite different.

    One is like distance driving a boring highway in the Midwest: don’t get drowsy, avoid veering into the indistinguishable corn fields, and you’ll get there. The other is like navigating a narrow mountain road in a thunderstorm: you’re 100% engaged and you might still tumble or get hit by lightning.

    • The number of bugs tends to be linear to lines of code written meaning fewer lines of code for the same functionality will have fewer bugs.

      So I’m pretty skeptical that reviewing 2000 lines of code won’t take any more time than reviewing 200 lines of code.

      Furthermore how do you know the AI generated lines are the open highway lines of code and not the mountain road ones? There might be hallucinations that pattern match as perfectly reasonable with a hard to spot flaw.

  • Its still posssible to run any LLM in a loop and optimize for LoC while preserving the wanted outcome.

I experimented with vibe coding (not looking at the code myself) and it produced around 10k LOC even after refactors etc.

I rewrote the same program using my own brain and just using ChatGPT as google and autocomplete (my normal workflow), I produced the same thing in 1500 LOC.

The effort difference was not that significant either tbh although my hand coded approach probably benefited from designing the vibe coded one so I had already though of what I wanted to build.

  • Sounds like a great oppurtunity to understand your own development process, and codify it in such detail that the agent can replicate how you work and end up with less code but doing the same.

    My experience was the same as you when I started using agents for development about a year ago. Every time I noticed it did something less-than-optimal or just "not up to my standards", I'd hash out exactly what those things meant for me, added it to my reusable AGENTS.md and the code the agent outputs today is fairly close to what I "naturally" write.

    • or go with this, and use the agent to prototype ideas, and write it yourself once you know what you want

LoC is perfectly fine as a metric for engineering output. It is terrible as a standalone measure of engineering productivity, and the problems occur when one tries to use it as such.

It's still useful, however, because that is the only metric that is instantly intuitively understandable and comparable across a wide variety of contexts, i.e. across companies and teams and languages and applications.

As we know, within the same team working on the same product, a 1000 LoC diff could take less time than a 1 line bug fix that took days to debug. Hence we really cannot compare PRs or product features or story points across contexts. If the industry could come up with a standard measure of developer productivity, you'd bet everyone would use it, but it's unfeasible basically for this very reason.

So, when such comparisons are made (and in this case it was clearly a colloquial usage), it helps to assume the context remains the same. Like, a team A working on product P at company C using tech stack T with specific software quality processes Q produced N1 lines of code yesterday, but today with AI they're producing N2 lines of code. Over time the delta between N1 and N2 approximates the actual impact.

(As an aside, this is also what most of the rigorous studies in AI-assisted developer productivity have done: measure PRs across the same cohorts over time with and without AI, like an A/B test.)

He's not using LOC as a metric, he's making an observation about the impact of a change in the typical volume of LOC.

Is it? The whole point of the article is that the rate of output for writing code has surpassed the rate at which it can be reviewed by humans. LOC as an input for software review makes a lot of sense, since you literally need to read each line.

Agreed. And, LOC has historically been one of the things we've collectively fought against management for how to evalute a "productive" developer!

  • Why?

    We should have gone the other way; generated a lot of code and demanded pay raises; look at the LOC I cranked out! Company is now in my debt!

    If they weren't going to care enough as managers to learn and line go up is all that matters to them, make all lines go up = winning

    You all think there's more to this than performative barter for coin to spend on food/shelter.

    • Because not everyone is just out after earning the most money, some people also want to enjoy the workplace where they work. Personally, what the quality of the codebase and infrastructure is in matters a lot for how much you enjoy working in it, and I'd much rather work in a codebase I enjoy and earn half, than a codebase made by just jerking out as many LOC as possible and earn double.

      Although this requires you to take pride in your profession and what you do.

      1 reply →

Humans are also incredibly varied and different.

Do you reject all stats that treat the number of people involved (eg. 2 million pepole protested X) as "embarrassing" ... because they lump incredibly varied people together and pretend they're equal?

I read somewhere that measuring software engineering output by LoC is like measuring aerospace engineering by pounds added to the plane and I thought that was an apt comparison.

Totally. I thought Simon was wiser than this; even he couldn't resist getting swept up by breathless hype. The moment you start typing "LOC as a metric", alarm bells should go off in your head.

  • This was a podcast, not a pre-scripted talk. I suggest listening to the audio version - it makes it more clear that this was thinking out loud, not carefully considering every word.

    • I see, fair point. Sorry for taking a dig at you. Please know that I do appreciate a lot of work that you do. I was just worried for a moment when just reading that bit.

  • LOC is very much an effective metric for general productivity for the median feature. You can't code golf most lines of code out of existence.

    We're also assuming LOC vibe coded by competent engineers who should be able to tell when something is overengineered.