← Back to context

Comment by jaredklewis

15 hours ago

> what's the evidence

What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.

In my experience, evidence for the efficacy of software engineering practices falls into two categories:

- the intuitions of developers, based in their experiences.

- scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.

Evidence for this LLM pattern is the same. Some developers have an intuition it works better.

My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.

  • I’ve read dozens of them and find them unconvincing for the reasons outlined. If you want a more specific critique, link a paper.

    I personally like and use tests, formal verification, and so on. But the evidence for these methods are weak.

    edit: To be clear, I am not ragging on the researchers. I think it's just kind of an inherently messy field with pretty much endless variables to control for and not a lot of good quantifiable metrics to rely on.

You can measure customer facing defects.

Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.

Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.

  • > Also, lines of code is not completely meaningless metric.

    Comparing lines of code can be meaningful, mostly if you can keep a lot of other things constant, like coding style, developer experience, domain, tech stack. There are many style differences between LLM and human generated code, so that I expect 1000 lines of LLM code do a lot less than 1000 lines of human code, even in the exact same codebase.