Comment by margorczynski

19 days ago

With how stochastic the process is it makes it basically unusable for any large scale task. What's the plan? To roll the dice until the answer pops up? That would be maybe viable if there was a way to automatically evaluate it 100% but with a human in the loop required it becomes untenable.

> What's the plan?

Call me old school, but I find the workflow of "divide and conquer" to be as helpful when working with LLMs, as without them. Although what is needed to be considered a "large scale task" varies by LLMs and implementation. Some models/implementations (seemingly Copilot) struggles with even the smallest change, while others breeze through them. Lots of trial and error is needed to find that line for each model/implementation :/

  • The relevant scale is the number of hard constraints on the solution code, not the size of task as measured by "hours it would take the median programmer to write".

    So eg., one line of code which needed to handle dozens of hard-constraints on the system (eg., using a specific class, method, with a specific device, specific memory management, etc.) will very rarely be output correctly by an LLM.

    Likewise "blank-page, vibe coding" can be very fast if "make me X" has only functional/soft-constraints on the code itself.

    "Gigawatt LLMs" have brute-forced there way to having a statistical system capable of usefully, if not universally, adhreading to one or two hard constraints. I'd imagine the dozen or so common in any existing application is well beyond a Terawatt range of training and inference cost.

    • Keep in mind that the model of using LLM assumes the underlying dataset converges to production ready code. Thats never been proven, cause we know they scraped sourcs code without attribution.

  • Its hard for me to think of a small, clearly defined coding problem an LLM cant solve.

    • There are several in the linked post, primarily:

      "Your code does not compile" and "Your tests fail"

      If you have to tell an intern that more than once on a single task, there's going to be conversations.

  • I mean I guess this isn't very ambitious, but it's a meaningful time saver if I basically just write code in natural language, and then Copilot generates the real code based on that. I don't have to look up syntax details, or what some function somewhere was named, etc. It will perform very accurately this way. It probably makes me 20% more efficient. It doubles my efficiency in a language I'm unfamiliar with.

    I can't fire half my dev org tomorrow with that approach, I can't really fire anyone, so I guess it would be a big letdown for a lot of execs. Meanwhile though we just keep incrementally shipping more stuff faster at higher quality so I'm happy...

    This works because it treats the LLM like what it actually is: an exceptionally good if slightly random text transformer.

I suspect that the plan is that MS has spent a lot, really a LOT, of money on this nonsense, and there is now significant pressure to put, something, anything, out even if it is worse than useless.

The plan is to improve AI agents from their current ~intern level to a level of a good engineer.

  • They are not intern level.

    Even if it could perform at a similar level to an intern at a programming task, it lacks a great deal of the other attributes that a human brings to the table, including how they integrate into a team of other agents (human or otherwise). I won't bother listing them, as we are all humans.

    I think the hype is missing the forest for the trees, and I think exactly this multi-agent dynamic might be where the trees start to fall down in front of us. That and the as currently insurmountable issues of context and coherence over long time horizons.

    • My impression is that Copilot acts a lot like one of my former coworkers, who struggled with:

      -Being a parent to a small child and the associated sleep deprivation.

      -His reluctance to read documentation.

      -There being a language barrier between him the project owners. Emphasis here, as the LLM acts like someone who speaks through a particularly good translation service, but otherwise doesn't understand the language spoken.

    • The real missing the forest for the trees is thinking that software and the way users will use computers is going to remain static.

      Software today is written to accommodate every possible need of every possible user, and then a bunch of unneeded selling point features on top of that. These massive sprawling code bases made to deliver one-size fits all utility.

      I don't need 3 million LOC Excel 365 to keep track of who is working on the floor on what day this week. Gemini 2.5 can write an applet that does that perfectly in 10 minutes.

      2 replies →

  • Seems like that is taking a very long time, on top of some very grandiose promises being delivered today.

    • I look back over the past 2-3 years and am pretty amazed with how quick change and progress have been made. The promises are indeed large but the speed of progress has been fast. Not defending the promise but “taking a very long time” does not seem to be an accurate representation.

      23 replies →

  • You are really underselling interns. They learn from a single correction, sometimes even without a correction, all by themselves. Their ability to integrate previous experience in the context of new problems is far, far above what I've ever seen in LLMs

  • Yes but they are supposed to be PhD level 5 years ago if you are listening to sama et al.

    • Especially ironic considering he's neither a developer nor a PhD. He's the smooth talking "MBA idea guy looking for a technical cofounder" type that's frequently decried on HN.

  • Without handholding (aka being used as a tool by a competent programmer instead of as an independent “agent”), they’re currently significantly worse than an intern.

  • This looks much worse than an intern. This feels like a good engineer who has brain damage.

    When you look at it from afar, it looks potentially good, but as you start looking into it for real, you start realizing none of it makes any sense. Then you make simple suggestions, it does something that looks like what you asked, yet completely missing the point.

    An intern, no matter how bad it is, could only waste so much time and energy.

    This makes wasting time and introducing mind-bogglingly stupid bugs infinitely scalable.

  • The plan went from the AI being a force multiplier, to a resource hungry beast that have to be fed in the hope it's good enough to justify its hunger.

  • I mean, I think this is a _lot_ worse than an intern. An intern isn't constantly going to make PRs with failing CI, for a start.