Comment by maccard

1 day ago

> Generative AI, as we know it, has only existed ~5-6 years, and it has improved substantially, and is likely to keep improving.

Every 2/3 months we're hearing there's a new model that just blows the last one out of the water for coding. Meanwhile, here I am with Opus and Sonnet for $20/mo and it's regularly failing at basic tasks, antigravity getting stuck in loops and burning credits. We're talking "copy basic examples and don't hallucinate APIs" here, not deep complicated system design topics.

It can one shot a web frontend, just like v0 could in 2023. But that's still about all I've seen it work on.

You’re doing exactly the thing that the parent commenter pointed out: Complaining that they’re not perfect yet as if that’s damning evidence of failure.

We all know LLMs get stuck. We know they hallucinate. We know they get things wrong. We know they get stuck in loops.

There are two types of people: The first group learns to work within these limits and adapt to using them where they’re helpful while writing the code when they’re not.

The second group gets frustrated every time it doesn’t one-shot their prompt and declares it all a big farce. Meanwhile the rest of us are out here having fun with these tools, however limited they are.

  • Someone else said this perfectly farther down:

    > The whole discourse around LLMs is so utterly exhausting. If I say I don't like them for almost any reason, I'm a luddite. If I complain about their shortcomings, I'm just using it wrong. If I try and use it the "right" way and it still gets extremely basic things wrong, then my expectations are too high.

    As I’ve said, I use LLMs, and I use tools that are assisted by LLMs. They help. But they don’t work anywhere near as reliably as people talk about them working. And that hasn’t changed in the 18 months since I first promoted v0 to make me a website.

Sure, but think about what it's replacing.

If you hired a human, it will cost you thousands a week. Humans will also fail at basic tasks, get stuck in useless loops, and you still have to pay them for all that time.

For that matter, even if I'm not hiring anyone, I will still get stuck on projects and burn through the finite number of hours I have on this planet trying to figure stuff out and being wrong for a lot of it.

It's not perfect yet, but these coding models, in my mind, have gotten pretty good if you're specific about the requirements, and even if it misfires fairly often, they can still be useful, even if they're not perfect.

I've made this analogy before, but to me they're like really eager-to-please interns; not necessarily perfect, and there's even a fairly high risk you'll have to redo a lot of their work, but they can still be useful.

  • I am an AI-skeptic but I would agree this looks impressive from certain angles, especially if you're an early startup (maybe) or you are very high up the chain and just want to focus on cutting costs. On the other hand, if you are about to be unemployed, this is less impressive. Can it replace a human? I would say no its still long way to go, but a good salesman can convince executives that it does and thats all that matters.

    • > On the other hand, if you are about to be unemployed, this is less impressive

      > salesman can convince executives that it does

      I tend to think that reality will temper this trend as the results develop. Replacing 10 engineers with one engineer using Cursor will result in a vast velocity hit. Replacing 5 engineers with 5 "agents" assigned to autonomously implement features will result in a mess eventually. (With current technology -- I have no idea what even 2027 AI will do). At that point those unemployed engineers will find their phones ringing off the hook to come and clean up the mess.

      Not that unlike what happens in many situations where they fire teams and offshore the whole thing to a team of average developers 180 degrees of longitude away who don't have any domain knowledge of the business or connections to the stakeholders. The pendulum swings back in the other direction.

  • You’ve missed my point here - I agree that gen AI has changed everything and is useful, _but_ I disagree that it’s improved substantially - which is what the comment I replied to claimed.

    Anecdotally I’ve seen no difference in model changes in the last year, but going from LLM to Claude code (where we told the LLMs they can use tools on our machines) was a game changer. The improvement there was the agent loop and the support for tools.

    In 2023 I asked v0.dev to one shot me a website for a business I was working on and it did it in about 3 minutes. I feel like we’re still stuck there with the models.

    • My experience in 2024 AI tools like copilot was if the code compiled first time it was an above average result and I’d need a lot of manual tweaking.

      There were definitely languages where it worked better (JS), but if I told people here I had to spend a lot of time tweaking after it, at least half of them assumed I was being really anal about spacing or variable names, which was simply not the case.

      It’s still the case for cheaper models (GPT-mini remains a waste of my timetime), but there’s mid level models like Minimax M2 that can produce working code and stuff like Sonnet can produce usable code.

      I’m not sure the delta is enough for me that I’d pay for these tools on my own though…

    • I've been coding with LLMs for less than a year. As I mentioned to someone in email a few days ago: In the first half, when an LLM solved a problem differently from me, I would probe why and more often than not overrule and instruct it to do it my way.

      Now it's reversed. More often than not its method is better than mine (e.g. leveraging a better function/library than I would have).

      In general, it's writing idiomatic mode much more often. It's been many months since I had to correct it and tell it to be idiomatic.

    • In my experience it has gotten considerably better. When I get it to generate C, it often gets the pointer logic correct, which wasn't the case three years ago. Three years ago, ChatGPT would struggle with even fairly straightforward LaTeX, but now I can pretty easily get it to generate pretty elaborate LaTeX and I have even had good success generating LuaTeX. I've been able to fairly successfully have it generate TLA+ spec from existing code now, which didn't work even a year ago when I tried it.

      Of course, sample size of one, so if you haven't gotten those results then fair enough, but I've at least observed it getting a lot better.

There’s a subtle point a moment when you HAVE to take the driver wheel from the AI. All issues I see are from people insisting to use far beyond the point it stops being useful.

It is a helper, a partner, it is still not ready go the last mile

  • It's funny how many people don't get that. It's like adding a pretty great senior or staff level engineer to sit on-call next to every developer and assist them, for basically free (I've never used any of the expensive stuff yet. Just things like Copilot, Grok Code in JetBrains, just asking Gemini to write bits of code for me).

    If you hired a staff engineer to sit next to me, and I just had him/her write 100% of the code and never tried to understand it, that would be an unwise decision on my part and I'd have little room to complain about the times he made mistakes.

  • As someone else said in this thread:

    > The whole discourse around LLMs is so utterly exhausting. If I say I don't like them for almost any reason, I'm a luddite. If I complain about their shortcomings, I'm just using it wrong. If I try and use it the "right" way and it still gets extremely basic things wrong, then my expectations are too high.

    I’m perfectly happy to write code, to use these tools. I do use them, and sometimes they work (well). Other times they have catastrophic failures. But apparently it’s my failure for not understanding the tool or expecting too much of the tool, while others are screaming from the rooftops about how this new model changes everything (which happens every 3 months at this point)

    • There's no silver bullet. I’m not a researcher, but I’ve done my best to understand how these systems work—through books, video courses, and even taking underpaid hourly work at a company that creates datasets for RLHF. I spent my days fixing bugs step-by-step, writing notes like, “Hmm… this version of the library doesn’t support protocol Y version 4423123423. We need to update it, then refactor the code so we instantiate ‘blah’ and pass it to ‘foo’ before we can connect.”

      That experience gave me a deep appreciation for how incredible LLMs are and the amazing software they can power—but it also completely demystified them. So by all means, let’s use them. But let’s also understand there are no miracles here. Go back to Shannon’s papers from the ’60s, and you'll understand that what seems to you like "emerging behaviors" are quite explainable from an information theory background. Learn how these models are built. Keep up with the latests research papers. If you do, you’ll recognize their limitations before those limitations catch you by surprise.

      There is no silver bullet. And if you think you’ve found one, you’re in for a world of pain. Worse still, you’ll never realize the full potential of these tools, because you won’t understand their constraints, their limits, or their pitfalls.

      1 reply →

> We're talking "copy basic examples and don't hallucinate APIs" here, not deep complicated system design topics.

If your metric is an LLM that can copy/paste without alterations, and never hallucinate APIs, then yeah, you'll always be disappointed with them.

The rest of us learn how to be productive with them despite these problems.

  • > If your metric is an LLM that can copy/paste without alterations, and never hallucinate APIs, then yeah, you'll always be disappointed with them.

    I struggle to take comments like this seriously - yes, it is very reasonable to expect these magical tools to copy and paste something without alterations. How on earth is that an unreasonable ask?

    The whole discourse around LLMs is so utterly exhausting. If I say I don't like them for almost any reason, I'm a luddite. If I complain about their shortcomings, I'm just using it wrong. If I try and use it the "right" way and it still gets extremely basic things wrong, then my expectations are too high.

    What, precisely, are they good for?

    • I think what they're best at right now is the initial scaffolding work of projects. A lot of the annoying bootstrap shit that I hate doing is actually generally handled really well by Codex.

      I agree that there's definitely some overhype to them right now. At least for the stuff I've done they have gotten considerably better though, to a point where the code it generates is often usable, if sub-optimal.

      For example, about three years ago, I was trying to get ChatGPT to write me a C program to do a fairly basic ZeroMQ program. It generated something that looked correct, but it would crash pretty much immediately, because it kept trying to use a pointer after free.

      I tried the same thing again with Codex about a week ago, and it worked out of the box, and I was even able to get it to do more stuff.

      1 reply →

    • It seems like just such a weird and rigid way to evaluate it? I am a somewhat reasonable human coder, but I can't copy and paste a bunch of code without alterations from memory either. Can someone still find a use for me?

    • For a long time, I've wanted to write a blog post on why programmers don't understand the utility of LLMs[1], whereas non-programmers easily see it. But I struggle to articulate it well.

      The gist is this: Programmers view computers as deterministic. They can't tolerate a tool that behaves differently from run to run. They have a very binary view of the world: If it can't satisfy this "basic" requirement, it's crap.

      Programmers have made their career (and possibly life) being experts at solving problems that greatly benefit from determinism. A problem that doesn't - well either that needs to be solved by sophisticated machine learning, or by a human. They're trained on essentially ignoring those problems - it's not their expertise.

      And so they get really thrown off when people use computers in a nondeterministic way to solve a deterministic problem.

      For everyone else, the world, and its solutions, are mostly non-deterministic. When they solve a problem, or when they pay people to solve a problem, the guarantees are much lower. They don't expect perfection every time.

      When a normal human asks a programmer to make a change, they understand that communication is lossy, and even if it isn't, programmers make mistakes.

      Using a tool like an LLM is like any other tool. Or like asking any other human to do something.

      For programmers, it's a cardinal sin if the tool is unpredictable. So they dismiss it. For everyone else, it's just another tool. They embrace it.

      [1] This, of course, is changing as they become better at coding.

      6 replies →

    • Its strong enough to replace humans at their jobs and weak enough that it cant do basic things. Its a paradox. Just learn to be productive with them. Pay $200/month and work around with its little quirks. /s

>Every 2/3 months we're hearing there's a new model that just blows the last one out of the water for coding

I haven't heard that at all. I hear about models that come out and are a bit better. And other people saying they suck.

>Meanwhile, here I am with Opus and Sonnet for $20/mo and it's regularly failing at basic tasks, antigravity getting stuck in loops and burning credits.

Is it bringing you any value? I find it speeds things up a LOT.

I have a hard time believing that this v0, from 2023, achieved comparable results to Gemini 3 in Web design.

Gemini now often produces output that looks significantly better than what I could produce manually, and I'm an expert for web, although my expertise is more in tooling and package management.

Frankly I think the 'latest' generation of models from a lot of providers, which switch between 'fast' and 'thinking' modes, are really just the 'latest' because they encourage users to use cheaper inference by default. In chatgpt I still trust o3 the most. It gives me fewer flat-out wrong or nonsensical responses.

I'm suspecting that once these models hit 'good enough' for ~90% of users and use cases, the providers started optimizing for cost instead of quality, but still benchmark and advertise for quality.