Comment by yurimo

16 hours ago

What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.

And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.

Lines of code has always been a terrible metric. But all else being equal it is a measure. If all else is not equal, which is usually the case, then it's not.

A lot of the focus has been on AI recently.

Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?

Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.

We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.

  • I personally don't view coding agents making software as "software gotten better" you are comparing a tool and the end result, these are two different things. Agent you use going down and your product going down mean two different things to you customers. I will not deny that we made incredible progress in coding and hell, even design over the past 3.5 years, this technology is here to stay.

    That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.

    I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.

  • > Lines of code has always been a terrible metric. But all else being equal it is a measure.

    A terrible metric is _worse_ than no metric. A terrible metric can _only_ lead you in the wrong direction. "No metric" means saying we don't know, and that leads us to stop and reconsider. But we've taken "move fast and break things" as a mantra, and we'd rather run towards any direction than stay still.

    Using LoC as a metric for quality of LLMs will promote LLMs that write more code. It's better to say we have no way to compare different LLMs than it is to say "let's use the LLMs that produced more LoC because at least we can measure that". We, as an industry, should be focusing on developing better metrics for quality, not on improving LLMs based on known-bad metrics. We should be turning to the computer scientists, not to the venture capitalists.

    When a pundit talks about how many lines of code an LLM has created, we should lose all respect for them. It's as if someone talking about physics measured the phlogiston, or as if a doctor started measuring our skulls. We know these theories don't work, and anyone using them should be mocked.

  • I also can't help but notice they didn't mention how many tokens were burned, or how much that translates to in terms of cost over the 5 months at enterprise AI prices. I'm going to guess this wasn't a cheap demo.

  • > Is Google Maps suddenly taking you the wrong way more often?

    Funny you mention that because I had that issue in a cab just yesterday. Google decided to drive us of the main road to a series of small roads which happened to be a dead end. My guess is that the AI decided that this is a shorter road? less busier road?

    That being said, Google maps have been gradually degrading. Most notably, its search function is quasi-broken now.

Well, to be fair, the amount of goalpost shifting that is going on is quite intense. AI not being able to work in a "serious" project, and being limited to "toy projects" has been a long standing critique.

But also, bigger projects need some amount of loc written and it's a bit silly to pretend that this is not the case or a bad thing.

So the answer to the question is roughly: Establishing that an agent can work in a large-ish code base is valuable, because 1) them not being able to do so has been a critique and 2) it's something that is required for a lot of software projects.

  • Should we not be counting function points rather than LOC’s.

    Lines of Code is a meaningless measure. It should also be easy to count function points using AI.

    • I'd argue LoC isn't actually a meaningless measure, but people use it the wrong way. The same program with the same features but less LoC is more likely to have a proper design and architecture, and is most likely easier to change and maintain in the future. Of course, only if it's less LoC because of proper design, not because you've folded everything to one line.

      So if anything, we should find a way to aim for as little lines of code as possible. If you have two agents, and one can build exactly the same program as another, but with half the LoC, then most likely the first agent is better at software engineering and particularly software design.

      Of course, as the author of an experiment that investigated exactly this, I'm slightly biased. Cursor's browser had millions lines of code which sounded weird to me based on the features and functionality it had. Meanwhile, I built the same thing but actually thinking about the design with the agent and ended up with ~20K lines of code instead.

    • Sure; But that's not the point that is argued about here.

      (To state it in AI lingo:)

      It's not about the best measure for "amount of code".

      It's about wether "amount of code" is a good metric to begin with.

  • I don’t think it’s solvable. And I think Anthropic etc know it. LLMs can only reconstitute things in its training data and they are so hungry they can’t do a good job in long lived codebase full of complexity and novelty. There’s never going to be enough similar code on the open internet.

    • > LLMs can only reconstitute things in its training data

      Such as a 4D raytracing engine in Metal? Or integrating APIs for features first released months after their knowledge cut-off date?

      LLMs have shown an ability to transfer "knowledge" and capabilities across domains, languages, and use-cases outside their training data.

      Case in point: GPT-2 "learning" to translate English to French and vice versa despite non-English examples having been voluntarily (and almost entirely) removed from the dataset.

      2 replies →

The "lines of code" at this point are basically the same thing as binary code that comes out of a compiler - something you almost never look at and certainly won't try to touch by hand.

The actual "code" is everything driving the harness.

The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.

  • I think the telling part is in this line:

    > Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility

    I asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.

    I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.

>I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.

I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.

  • The latest frontier models will write code better than you and more elegant, with less lines of code, in 100th of the time, with full test coverage. Hand coding is like writing out assembly/machine code rather than using a compiler.

    • This hasn’t been my experience. State of the art models available to the public still do all sorts of bandaids and bad hacks. Putting code where it doesn’t belong. Stapling types onto variable (in TypeScript) when abstractions/types already exist to use. I use it to generate code, but still have to review every line and have corrections/steering basically every time.

      Maybe you have access to some other model?

    • > The latest frontier models will write code better than you and more elegant

      They often do, but they often don’t. I regularly have to push for more elegant, or less lazy solutions.

    • This.

      Insisting on writing code by hand when LLMs are available is not software engineering in 2026. Engineers find the most cost-effective solution for the problem at hand that meets the requirements.

Far from everybody is using it for massive code generation - code size is still very much a liability. I currently use it (ds4-flash, local, code isn't open to non-local) to modify and trim a certain internal project for a different use case. Tons of review passes, fuzz test generation and running, "give me your thoughts on this", etc, etc - basically iterating over and over and over. Funnily enough, my intrinsic knowledge of the codebase improves massively like this. It is involved, but the acceleration is real - and I don't have to worry about wrist nerves as much.

The only people I know that have LoC/token use/etc metrics imposed on them work for big corps where such things are (or used to be) en vogue.

I think it's a rebuttal against claims that LLMs are incompatible with large code bases. It's not so much a flex about the quality of the code, it's more a flex about the complexity of the code and the LLMs ability to deal with such complexity.

Whether or not that complexity is warranted is a different story.

The codebase may be bloated by a factor of 10 but if the costs associated with that are less than the costs of developing the software from a business standpoint the choice is clear.

Yeah so all of personal computing—text editing, SVG antialiasing, etc, fits in 20,000 LOC (VPRI's STEPS project) so a million lines of code is 50 reïnventions of personal computing. BUT: it is unlikely that humans would have solved this problem in 20 kLOC. Sussman said “we really don't know how to compute!” as his talk title and LLMs had to ossify some pre-existing voice as the forever programming habitus and it chose a persona that doesn't know how to program—because we don't —and now we are stuck with it. Claude is our tickets, our implementations, our documentation... And if you tell it “hey the node role should not have those permissions, that should be a service account” it will happily do the right thing, but it has no intrinsic sense of taste and the error message it's trying to clear just says “the node role doesn't have that permission and the system prompt says “keep it short, stupid ” and graybeards might be our last bulwark.

I've heard it said that measuring productivity of a software developer by lines of code added, is akin to measuring the productivity of an aerospace engineer by mass added.

It is a metric. It is often not a good metric. But it is easy to measure.

  • Indeed. The more you do to add complexity or generate without ensuring it isn't pointless code adds bloat. It is still probably more MVP than junk developers will do, but not ever anywhere near as great as someone who studies a programming language.

> It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality.

The simple answer is that promoting locs as a relevant metric is also reward hacking. Is it easier to promote big loc counts as a key metric, or is it easier to prove agentic engineering against harder metrics?

On a more general note, software practice marketers have been pushing in that direction for quite a while. "You need cloud", "Here's how to do agile at scale", "microservice everything", etc.

> I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.

To generate elegant code with more restrictions, it means more thinking tokens and more stronger adherence to instructions. So tha naive view that they are doing it for billing is wrong.

It's not massive amount of code for the sake of it, it's that it built a complex large piece of software itself. If you compared the linux kernel LOC to some guys toy kernel on GitHub the LOC difference should give you an idea on the complexity. "Yeah why is Tolstoy flexing about his number of pages in his book, I wrote Spot goes to school and its only 20 pages"

People want to do X, so the metric is how much X can be done.

Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.

Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.

It is a very valid question. My intution (no grounding) is to the model training. Optimizations traditionally have worked well in human wrote software with either experience of the developer , usage of architectural patterns or a second ir third pass of fine tuning. In case of model written code - (e/p one token at a time), only possible orchitectural optimization is either with a strict guardrail on patterns to use for a specific implementation OR by giving a second or third optmization path. All of which burns more tokens, but can lead to better software.

I don’t think the flex here is the amount of code alone. Their goal is to show that AI can improve productivity, the number of lines is just the proxy to that. This article is a marketing piece after all.

Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.

  • Correct on the first part, partially correct on the second. LOC is a bad metric, but it is at least a legible one. Lots of people working on better ways to measure Software Productivity!

Don't compare human loc with machine loc.

Compare machine to machine (as these headlines come) and discount that by a factor.

  • You can't really do that here because one of the key arguments for this, as people in the thread focus on is "1/10th of time" estimate, the comparison with humans is here already, albeit it is just an estimate and no actual comparison has been done.

    This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?

    I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.

Because sloppy code and doesn't matter. He wrote he completed it in 1/10th time. That's equivalent to 1/10th the engineers.

That is a business win. That is really all that matters in capitalism.

The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.

So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.

Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.

  • The real flex is delivering it in one-tenth of the time. Mentioning lines of code is mostly noise.

    I’ve worked with 20-year-old codebases and products that grew organically over decades and still sit well below a million lines of code. Using LOC as some kind of health or success metric makes me more suspicious than impressed.

It's a huge flex if the alternative is no code at all. Reward hacking aside, LOC resonates with me in the sense that I've seen 10+ projects to fruition that wouldn't have even begun without an agentic harness and an LLM.

It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.