Comment by Insanity

6 days ago

I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.

They definitely get something barebones up and running, but it's far from a fully fledged application.

I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.

I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.

  • It is sad. I like programming, if I couldn't do it and had to write text (which I do hate, I'm not a writer) it would be make quite a sad world.

  • Exact same experience here. Prior to Opus 4.5 I'd sometimes use AI for some frontend webdev stuff (I am a C/C++/Python programmer; my HTML/CSS/JS knowledge is probably on par with a first-year uni student) and I'd have to manually edit things and retry, tell it not to attempt a paradigm that had failed before or cycle between models in Cursor just to try and get one that could make a simple widget that worked properly.

    Now, I'm using Claude or Codex (GPT-5.5) for frontend and backend and it just gets it right first time more often than not. I've been making use of things like LSPs, Context7 and CLAUDE.md (global and per-repo) and it just stops doing the dumb LLM things that I hate.

  • How do you justify your salary given that you're just using a tool that any of us could use for $20 an hour in your role?

    • How do you justify your salary given that you're just using OSS compiler/editor any of us could use for free in your role ?

      AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.

      3 replies →

    • I don't feel the need to justify my salary, since I'm simply lucky in that regard. But I'm pretty sure you couldn't do my job just because you had access to a coding agent. Most of my time at the office is spent discussing high-level architecture and strategy, ideas, customer requests, backward compatibility, safety, security, quality assurance, etc.

      Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.

      I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).

    • They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.

      What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"

      It is extremely ignorant.

    • Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.

      10 replies →

    • Never to feed the trolls ... but, how does my carpenter deserve $100 an hour when he is using an electric drill and power saw I can get at Home Deepo for $100 bucks?

      Most good developers are not employed because just because they can code well.

      What is over is: fizzbuzz and trivial CS algorithm regurgitation as a gate.

    • How do you justify your salary given that you sit in a chair all day, likely making the world worse, and make 5x as much as someone saving lives, building houses, or teaching kids how to read?

      5 replies →

    • Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.

    • no engineers on staff and stakeholders think the company is incompetent

      Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code

    • You can build things quickly with AI, but you can’t delegate your responsibilities to AI. Once the AI starts struggling, you’ll need to takeover and figure it out.

    • This is _the_ question we must all be able to answer, so here goes my attempt - we all have access to the same tools, before stackoverflow it was forums, books/manuals, so its always been about “getting there, showing up, figuring it out” your hypothetical boss has other things to do than kick a LLM around at that price

    • I don't think you understand how programming as a job works, writing code is the final output of the process but it's not the job in itself.

    • There is no good justification for anyone's salary really, except perhaps doctors and underwater welders.

    • Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.

      It will almost never converge on the general solution that will pass tests you haven't given it yet.

      This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.

      Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.

      5 replies →

  • > Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.

    I still must hand hold it every day, as it always does things wrong. Especially after it got seriously nerfed in March.

    Note: experiences vary a lot depending on the programming language used, and projects. And the experience of the person coding.

  • Same experience here. I now think AI writes much better code than me. So I shifted my focus to finding requirements, analyzing possibilities, and making good plans.

Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.

'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.

I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.

Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.

For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.

I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.

I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.

Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.

Since it's so async I can work on other stuff while they plod along.

I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.

  • > Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases.

    > For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.

    > I do check the documents, and what they're doing. I also check the tests, some more thorough.

    Sounds like programming, but with extra steps.

    • It's software development, but with much less actual programming (in my case none).

      When I said I check the documents, the initial design document was the only I really took a hard look at. The intermediary I just skimmed, looking for red flags or something I had forgotten to tell them. Those documents served as a basis for their work, and as a record of what was done.

      Overall I spent perhaps a few hours on each project, over the course of a few days. I'd check in every half hour or whenever I had time, tell Claude "Great, let's do the next deliverable", or GPT "We're done with phase 4, please do a detailed code review, reference the design document and documentation of previous phases". Then I'd leave them cooking.

    • Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.

      7 replies →

  • That’s not vibing, but waterfall development.

    • Waterfall was famous for wasting developer time and extending delivery dates in exchange for simplifying management. If Claude time is comparatively inexpensive, but human oversight remains necessary, we will switch back to waterfall because the relative importance of the two resources will invert.

    • It's vibing in the sense that I'm not really writing code, and I'm leaving a lot of decision to the models. I let them drive a lot of the design document details, I just made sure it contained the salient points. Implementation plans I just skimmed. Didn't write any code, just did some checks here and there.

      But yes, I did think that it sorta felt like being a team lead for some eager programmers.

  • Do you use anything to orcheatrate multiple agent pitted against each other (coder, reviewer, tester, etc)?

    • Currently just manual. I'm not pushing the frontier here, just getting my feet wet.

      While both Claude Code and Codex are capable harnesses, I definitely think there's a lot more to be gained from the harnesses. Quite a few of the times I needed to nudge the steering wheel it was things that a separate agent with the right prompt could have picked up on.

  • None of it is non-trivial tho. You might think so, but it’s not.

    • It wasn't trivial in that I used a lot of my programming and domain knowledge, both when iterating on the design document and skimming implementation plans.

      I didn't use it often, but when it was needed it was needed.

I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.

I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.

At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.

  • Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.

    (Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").

Opus 4.5 in November 2025 was legitimately, unironically an inflection point and is the sole reason for the current hysteria.

GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.

  • 5.2 and the first codex model were step function changes in capability

I feel the change. It went from an autocomplete tool, to an agent running 5 tasks in parallel while I just supervise. The improvement is enormous.

While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though

It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.

  • >1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid)

    I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.

    I divide the work to fit within that 100k and use subagent for the tasks.

Anecdata of 1 but it is real. At the end of last year they passed some invisible threshold and became useful. I don't think it is models themselves, but mostly the much more powerful harnesses and I guess their tool calling abilities.

What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.

If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.

It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.

When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.

The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.

Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.

  • I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.

    But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.

    Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.

    • You're completely twisting what I said. I've never talked about people claiming it's not making developers obsolete. We are obviously extremely far from that. I'm talking about people who say it doesn't work to build basic features in their projects correctly.

      Just take a look at this comment on a different topic, which lists all the pre-requisite for those AI models to work well, from the perspective of someone who has bought into the hype: https://news.ycombinator.com/item?id=48157235

      If this is everything needed for an LLM to generate acceptable code, what is even the point of them?

      2 replies →

  • My take is there was one big inflection point around opus 4.5 when they got the agentic stuff working and now whether or not it works depends on whether your use case/area of software engineering is profitable enough for the companies to have spent a bunch of money generating synthetic data to RL on, or if it's similar enough to areas that they've done that for. With similar enough being a very loose constraint given how much overlap there is in a lot of coding fundamentals. Tbh if the models aren't working for you now I don't think they're gonna be working for you in 6 months

It's very real but probably very domain specific. It got really good at a lot of traditional web dev stuff, bash, sql, and writing one off scripts to accomplish random tasks (hence all the agent stuff taking off). And they got good at staying on task. That may not translate to game dev because from what I understand a lot of these gains are basically around post training methods driven by synthetic data generation etc (with potential caveats on how synthetic that data actually is lol). I wouldn't be surprised if the areas of code the llms are good at now are straight up just product decisions of where to allocate budget for generating those synthetic data sets, and game dev stuff might not be at the top of the list because the customer base for that might not be as big

Counterpoint, I'm also vibecoding a game, and even before doing the "proper" setup (a good AGENTS.md, skills people have published for my chosen game engine, Godot), mechanically, the game was pretty spot on. It looked boring, so I used Claude Design to create a few mockups to choose from, chose the one I liked the most, and told Claude Code to redo the game UI with it.

There have been plenty of small issues like tables not having the columns aligned, or the game menu being a bit offset, or one graph being a placeholder instad of connected to the actual value. And of course I've had to instruct it on all the flavour I want.

But honestly, for a simulation strategy game, especially without doing the "proper" setup from the start, it's been _very_ good.

  • UI fit and finish is really hard for these models, even in with text-mode UIs. The super fiddly stuff still needs to be done by hand, at least for now.

Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.

At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.

And also, have good e2e tests.

IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.

  • Sounds very self confident to claim such thing. Something like "If you don't do how me is doing, then you are doing it wrong"

  • At what point is it easier and faster to just code it yourself? I don't trust myself to write better specs than code.

It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.

  • That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.

    Once I work out the kinks, I’ll be able to further automate it.

    Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.

    But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.

    And I know where to make slight changes without burning my allotments.

  • "flash" or "fast" AI models are worse than useless at coding for me. they make my codebase much worse. It's a maintenance burden.

    Gemini Pro on the other hand can be quite a pleasant experience.

I mean this blog post and many from this author are pure evangelism and marketing. Can you find anything critical or any dissent from this author about LLMs?