← Back to context

Comment by SupremumLimit

2 days ago

It's an interesting review but I really dislike this type of techno-utopian determinism: "When models inevitably improve..." Says who? How is it inevitable? What if they've actually reached their limits by now?

Models are improving every day. People are figuring out thousands of different optimizations to training and to hardware efficiency. The idea that right now in early June 2025 is when improvement stops beggars belief. We might be approaching a limit, but that's going to be a sigmoid curve, not a sudden halt in advancement.

  • I think at this point we're reaching more incremental updates, which can score higher on some benchmarks but then simultaneously behave worse with real-world prompts, most especially if they were prompt engineered for a specific model. I recall Google updating their Flash model on their API with no way to revert to the old one and it caused a lot of people to complain that everything they've built is no longer working because the model is just behaving differently than when they wrote all the prompts.

    • Isn't it quite possible they replaced that Flash model with a distilled version, saving money rather than increasing quality? This just speaks to the value of open-weights more than anything.

  • 5 years ago a person would be blown away by today’s LLMs. But people today will merely say “cool” at whatever LLMs are in use 5 years from now. Or maybe not even that.

    • Most of the developers I know personally who have been radicalized by coding agents, it happened within the past 9 months. It does not feel like we are in a phase of predictable boring improvement.

      5 replies →

    • 5 years ago GPT2 was already outputting largely coherent speech, there's been progress but it's not all that shocking

  • It is copium that it will suddenly stop and the world they knew before will return.

    ChatGPT came out in Nov 2022. Attention Was All There Was in 2017, we were already 5 years in the past. Or 5 years of research to catch up to, and then from 2022 to now ... papers and research have been increasing exponentially. Even in if SOTA models were frozen, we still have years of research to apply and optimize in various ways.

    • I think it's equally copium that people keep assuming we're just going to compound our way into intelligence that generalizes enough to stop us from handholding the AI, as much as I'd genuinely enjoy that future.

      Lately I spend all day post-training models for my product, and I want to say 99% of the research specific to LLMs doesn't reproduce and/or matter once you actually dig in.

      We're getting exponentially more papers on the topics and they're getting worse on average.

      Every day there's a new paper claiming an X% gain by post-training some ancient 8B parameter model and comparing it to a bunch of other ancient models after they've overfitted on the public dataset of a given benchmark and given the model a best of 5.

      And benchmarks won't ever show it, but even ChatGPT 3.5-Turbo has better general world knowledge than a lot models people consider "frontier" models today because post-training makes it easy to cover up those gaps with very impressive one-prompt outputs and strong benchmark scores.

      -

      It feels like things are getting stuck in a local maxima: we are making forward progress, the models are useful and getting more useful, but the future people are envisioning takes reaching a completely different goal post that I'm not at all convinced we're making exponential progress towards.

      There maybe exponential number of techniques claiming to be ground breaking, but what has actually unlocked new capabilities that can't just as easily be attributed to how much more focused post-training has become on coding and math?

      Test time compute feels like the only one and we're already seeing the cracks form in terms of its effect on hallucinations, and there's a clear ceiling for the performance the current iteration unlocks as all these models are converging on pretty similar performance after just a few model releases.

    • The copium is I think many people got comfortable post financial crisis with nothing much changing or happening. I think many people really liked a decade stretch with not much more than web framework updates and smart phone versioning.

      We are just back on track.

      I just read Oracular Programming: A Modular Foundation for Building LLM-Enabled Software the other day.

      We don't even have a new paradigm yet. I would be shocked that in 10 years I don't look back at this time of writing a prompt into a chatbot and then pasting the code into an IDE as completely comical.

      The most shocking thing to me is we are right back on track to what I would have expected in 2000 for 2025. In 2019 those expectations seemed like science fiction delusions after nothing happening for so long.

      1 reply →

What is ironic, if we buy in to the theory that AI will write majority of the code in the next 5-10 years, what is it going to train on after? ITSELF? Seems this theoretic trajectory of "will inevitably get better" is is only true if humans are producing quality training data. The quality of code LLMs create is very well proportionate on how mature and ubiquitous the langues/projects are.

  • I think you neatly summarise why the current pre-trained LLM paradigm is a dead end. If these models were really capable of artificial reasoning and learning, they wouldn’t need more training data at all. If they could learn like a human junior does, and actually progress to being a senior, then I really could believe that we’ll all be out of a job—but they just do not.

It is "inevitable" in the sense that in 99% of the cases, tomorrow is just like yesterday.

LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.

It's not "techno-utopian determinism". It's a clearly visible trajectory.

Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.

The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.

There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.

Models have improved significantly over the last 3 months. Yet people have been saying 'What if they've actually reached their limits by now?' for pushing 3 years.

  • This is just people talking past each other.

    If you want a model that's getting better at helping you as a tool (which for the record, I do), then you'd say in the last 3 months things got better between Gemini's long context performance, the return of Claude Opus, etc.

    But if your goal post is replacing SWEs entirely... then it's not hard to argue we definitely didn't overcome any new foundational issues in the last 3 months, and not too many were solved in the last 3 years even.

    In the last year the only real foundational breakthrough would be RL-based reasoning w/ test time compute delivering real results, but what that does to hallucinations + even Deepseek catching up with just a few months of post-training shows in its current form, the technique doesn't completely blow up any barriers that were standing the way people were originally touting it.

    Overall models are getting better at things we can trivially post-train and synthesize examples for, but it doesn't feel like we're breaking unsolved problems at a substantially accelerated rate (yet.)

  • For me, improvement means no hallucination, but that only seems to have gotten worse and I'm interested to find out whether it's actually solvable at all.

    • Why do you care about hallucination for coding problems? You're in an agent loop; the compiler is ground truth. If the LLM hallucinates, the agent just iterates. You don't even see it unless you make the mistake of looking closely.

      23 replies →