Comment by bartread

7 days ago

Do you have a sense for how much overhead this is all adding? Or, to put it another way, what I’m really asking is what productivity gain (or loss) are you seeing versus traditional engineering?

13 comments

bartread

lossolo 7 days ago

In our experience, it depends on the task and the language. In the case of trivial or boilerplate code, even if someone pushes 3k-4k lines of code in one day, it's manageable because you can just go through it. However, 3k lines of interconnected modules, complex interactions, and intricate logic require a lot of brainpower and time to review properly and in most cases, there are multiple bugs, edge cases that haven't been considered, and other issues scattered throughout the code.

agentultra 7 days ago
And empirical studies on informal code review show that humans have a very small impact on error rates. It disappears when they read more than roughly 200 SLOC in an hour.
- lossolo 7 days ago
  
  Interesting, do you have a link to the study? Our experience is different, at least when reviewing LLM generated code, we find quite a few errors, especially beyond 200 LOC. It also depends on what you're reviewing, 200 LOC != 200 LOC. A boilerplate 200 LOC change? A security sensitive 200 LOC change? A purely algorithmic and complex 200 LOC change?
  
  1 reply →

eugmill 7 days ago

Isn't the current state of thing such that it's really hard to tell? I think the METR study showed that self-reported productivity boosts aren't necessarily reliable.

I have been messing with vibe engineering on a solo project and I have such a hard time telling if there's an improvement. It's this feeling of "what's faster, one lead engineer coding or one lead engineer guiding 3 energetic but naive interns"?

rhetocj23 7 days ago

Very curious to hear responses about this too

oblio 7 days ago
The problem with this is that software engineering is a very unorganized and fashion/emotion driven domain.
We don't have reliable productivity numbers for basically... anything.
I <feel> that I'm more productive with statically typed languages but I haven't seen large scale, reliable studies. Same with unit tests, integration tests, etc.
And then there are all the types of software engineering: web frontend, web API, mobile frontend, command line frontend, Windows GUI, MacOS GUI, Linux backend (10 million different stacks), Windows backend (1 million different stacks), throwaway projects, WordPress webpages, etc, etc.
- rhetocj23 7 days ago
  
  Yeah I agree.
  A controlled experiment done with a representative sample would be lovely. In the long-run it comes down to the financial impact that occurs incrementally because of LLMs.
  In the short-run, from what I see, firms are trying to play-up the operational efficiency gains they have achieved. Which then signals promise to investors in the stock market, for which, investors then translate this promise into expectations about the future which are then reflected in the present value of equity.
  But in reality it seems to be reducing head-count because they over-hired before the hype and furore of LLMs.
  
  1 reply →
- criemen 7 days ago
  
  I wanted to point you at https://neverworkintheory.org/ which attempted to bridge the gap between academia and software engineering. Turns out the site shut down, because (quoting their retrospective)
  > Twelve years after It Will Never Work in Theory launched, the real challenge in software engineering research is not what to do about ChatGPT or whatever else Silicon Valley is gushing about at the moment. Rather, it is how to get researchers to focus on problems that practitioners care about and practitioners to pay attention to what researchers discover. This was true when we started, it was true 10 years ago, and it remains true today.
  The entire retrospective [1] is well worth a read, and unfortunately reinforcing your exact point about software development being fashion/emotion driven.
  [1] https://www.computer.org/csdl/magazine/so/2024/03/10424425/1...
- bartread 7 days ago
  
  The other problem is the perennial, how much of what we do actually has value?
  Churning out 5x (or whatever - I’m deliberately being a bit hyperbolic) as much code sounds great on the face of it but what does it matter if little to none of it is actually valuable?
  You correctly identify that software development is often driven by fashion and emotion but the much much bigger problem is that product and portfolio management is driven by fashion and emotion. How much stuff is built based on the whims of CEOs or other senior stakeholders without any real evidence to back it up?
  I suppose the big advantage of being more “productive” is that you can churn through more wrong ideas more quickly and thus perhaps improve your chances of stumbling across something that is valuable.
  But, of course, as I’ve just said: if that’s to work it’s absolutely predicated on real (and very substantial) productivity gains.
  Perhaps I’m thinking about this wrong though: it’s not about production where standards, and the need to be vigilant, are naturally high, but really the gains should be seen mostly in terms of prototyping and validating multiple/many solutions and ideas.
  
  2 replies →