← Back to context

Comment by systemf_omega

3 days ago

> B2B SaaS

Perhaps that's part of it.

People here work on all kinds of industries. Some of us are implementing JIT compilers, mission-critical embedded systems or distributed databases. In code bases like this you can't just wing it without breaking a million things, so LLM agents tend to perform really poorly.

> People here work on all kinds of industries.

Yes, it would be nice to have a lot more context (pun intended) when people post how many LoC they introduced.

B2B SaaS? Then can I assume that a browser is involved and that a big part of that 200k LoC is the verbose styling DSL we all use? On the other hand, Nginx, a production-grade web server, is 250k LoC (251,232 to be exact [1]). These two things are not comparable.

The point being that, as I'm sure we all agree, LoC is not a helpful metric for comparison without more context, and different projects have vastly different amounts of information/feature density per LoC.

[1] https://openhub.net/p/nginx

  • I primarily work in C# during the day but have been messing around with simple Android TV dev on occasion at night.

    I’ve been blown away sometimes at what Copilot puts out in the context of C#, but using ChatGPT (paid) to get me started on an Android app - totally different experience.

    Stuff like giving me code that’s using a mix of different APIs and sometimes just totally non-existent methods.

    With Copilot I find sometimes it’s brilliant but it’s so random as to when that will be it seems.

    • > Stuff like giving me code that’s using a mix of different APIs and sometimes just totally non-existent methods.

      That has been my experience as well. We can control the surprising pick of APIs with basic prompt files that clarify what and how to use in your project. However, when using less-than-popular tools whose source code is not available, the hallucinations are unbearable and a complete waste of time.

      The lesson to be learned is that LLMs depend heavily on their training set, and in a simplistic way they at best only interpolate between the data they were fed. If a LLM is not trained with a corpus covering a specific domain them you can't expect usable results from it.

      This brings up some unintended consequences. Companies like Microsoft will be able to create incentives to use their tech stack by training their LLMs with a very thorough and complete corpus on how to use their technologies. If Copilot does miracles outputting .NET whereas Java is unusable, developers have one more reason to adopt .NET to lower their cost of delivering and maintaining software.

  •   > when people post how many LoC they introduced.
    

    Pretty ironic you and the GP talk about lines of code.

    From the article:

      Garman is also not keen on another idea about AI – measuring its value by what percentage of code it contributes at an organization.
    
      “It’s a silly metric,” he said, because while organizations can use AI to write “infinitely more lines of code” it could be bad code.
    
      “Often times fewer lines of code is way better than more lines of code,” he observed. “So I'm never really sure why that's the exciting metric that people like to brag about.”
    

    I'm with Garman here. There's no clean metric for how productive someone is when writing code. At best, this metric is naive, but usually it is just idiotic.

    Bureaucrats love LoC, commits, and/or Jira tickets because they are easy to measure but here's the truth: to measure the quality of code you have to be capable of producing said code at (approximately) said quality or better. Data isn't just "data" that you can treat as a black box and throw in algorithms. Data requires interpretation and there's no "one size fits all" solution. Data is nothing without its context. It is always biased and if you avoid nuance you'll quickly convince yourself of falsehoods. Even with expertise it is easy to convince yourself of falsehoods. Without expertise it is hopeless. Just go look at Reddit or any corner of the internet where there's armchair experts confidently talking about things they know nothing about. It is always void of nuance and vastly oversimplified. But humans love simplicity. You need to recognize our own biases.

    • > Pretty ironic you and the GP talk about lines of code.

      I was responding specifically to the comment I replied to, not the article, and mentioning LoC as a specific example of things that don't make sense to compare.

      1 reply →

On the other hand, fault-intolerant codebases are also often highly defined and almost always have rigorous automated tests already, which are two contexts where coding agents specifically excel in.

I work on brain dead crud apps much of my time and get nothing from LLMs.

  • Try Claude Code. You’ll literally be able to automate 90% of the coding part of your job.

    • We really need to add some kind of risk to people making these claims to make it more interesting. I listened to the type of advice you're giving here on more occasions than I can remember, at least once for every major revision of every major LLM and always walked away frustrated because it hindered me more than it helped.

      > This is actually amazing now, just use [insert ChatGPT, GPT-4, 4.5, 5, o1, o3, Deepseek, Claude 3.5, 3.9, Gemini 1, 1.5, 2, ...] it's completely different from Model(n-1) you've tried.

      I'm not some mythical 140 IQ 10x developer and my work isn't exceptional so this shouldn't happen.

      4 replies →

    • I've been working on macOS and Windows drivers. Can't help but disagree.

      Because of the absolute dearth of high-quality open-source driver code and the huge proliferation of absolutely bottom-barrel general-purpose C and C++, the result is... Not good.

      On the other hand, I asked Claude to convert an existing, short-ish Bash script to idiomatic PowerShell with proper cmdlet-style argument parsing, and it returned a decent result that I barely had to modify or iterate on. I was quite impressed.

      Garbage in, garbage out. I'm not altogether dismissive of AI and LLMs but it is really necessary to know where and what their limits are.

      1 reply →

  • I found the opposite - I am able to get 50% improvement in productivity for day to day coding (mix of backend, frontend), mostly in Javascript but have helped in other languages. But you have to carefully review though - and have extremely well written test cases if you have to blindly generate or replace existing code.

> In code bases like this you can't just wing it without breaking a million things, so LLM agents tend to perform really poorly.

This is a false premise. LLMs themselves don't force you to introduce breaking changes into your code.

In fact, the inception of coding agents was lauded as a major improvement to the developer experience because they allow the LLMs themselves to automatically react to feedback from test suites, thus speeding up how code was implemented while preventing regressions.

If tweaking your code can result in breaking a million things, this is a problem with your code and how you worked to make it resilient. LLMs are only able to introduce regressions if your automated tests are unable to catch any of these million of things breaking. If this is the case then your problems are far greater than LLMs existing, and at best LLMs only point out the elephant in the room.