I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.
They definitely get something barebones up and running, but it's far from a fully fledged application.
I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
Exact same experience here. Prior to Opus 4.5 I'd sometimes use AI for some frontend webdev stuff (I am a C/C++/Python programmer; my HTML/CSS/JS knowledge is probably on par with a first-year uni student) and I'd have to manually edit things and retry, tell it not to attempt a paradigm that had failed before or cycle between models in Cursor just to try and get one that could make a simple widget that worked properly.
Now, I'm using Claude or Codex (GPT-5.5) for frontend and backend and it just gets it right first time more often than not. I've been making use of things like LSPs, Context7 and CLAUDE.md (global and per-repo) and it just stops doing the dumb LLM things that I hate.
Same experience here.
I now think AI writes much better code than me. So I shifted my focus to finding requirements, analyzing possibilities, and making good plans.
Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though
It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
Anecdata of 1 but it is real. At the end of last year they passed some invisible threshold and became useful. I don't think it is models themselves, but mostly the much more powerful harnesses and I guess their tool calling abilities.
What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.
If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.
It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
My take is there was one big inflection point around opus 4.5 when they got the agentic stuff working and now whether or not it works depends on whether your use case/area of software engineering is profitable enough for the companies to have spent a bunch of money generating synthetic data to RL on, or if it's similar enough to areas that they've done that for. With similar enough being a very loose constraint given how much overlap there is in a lot of coding fundamentals. Tbh if the models aren't working for you now I don't think they're gonna be working for you in 6 months
It's very real but probably very domain specific. It got really good at a lot of traditional web dev stuff, bash, sql, and writing one off scripts to accomplish random tasks (hence all the agent stuff taking off). And they got good at staying on task. That may not translate to game dev because from what I understand a lot of these gains are basically around post training methods driven by synthetic data generation etc (with potential caveats on how synthetic that data actually is lol). I wouldn't be surprised if the areas of code the llms are good at now are straight up just product decisions of where to allocate budget for generating those synthetic data sets, and game dev stuff might not be at the top of the list because the customer base for that might not be as big
Counterpoint, I'm also vibecoding a game, and even before doing the "proper" setup (a good AGENTS.md, skills people have published for my chosen game engine, Godot), mechanically, the game was pretty spot on. It looked boring, so I used Claude Design to create a few mockups to choose from, chose the one I liked the most, and told Claude Code to redo the game UI with it.
There have been plenty of small issues like tables not having the columns aligned, or the game menu being a bit offset, or one graph being a placeholder instad of connected to the actual value. And of course I've had to instruct it on all the flavour I want.
But honestly, for a simulation strategy game, especially without doing the "proper" setup from the start, it's been _very_ good.
UI fit and finish is really hard for these models, even in with text-mode UIs. The super fiddly stuff still needs to be done by hand, at least for now.
Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
I mean this blog post and many from this author are pure evangelism and marketing. Can you find anything critical or any dissent from this author about LLMs?
> there’s zero chance any AI lab would train a model for such a ridiculous task.
A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.
For those curious, Simon's first public usage of it is Oct 25th, 2024[0]. While I'm not aware of any specific "pelican riding a bicycle" prompts being tested in a paper[1], the GPT paper did several SVG and tikz tests and the actual image is rather arbitrary. You wouldn't want to optimize for a singular image but also if you're doing halfway decent training a pelican riding a bicycle shouldn't be too hard to draw, and well... you can see several good examples if you look through different pages on [0].
My own informal test when generative AI came out has been "a picture of an old man riding a bicycle over a river". I just ran it for chatgpt with the standard model I have (5.5). It shows the old man on an old bicycle with the bicycle on a slack line and the slack line extending over the river with a medieval village in the background.
The point is that the prompt has a subtle ambiguity - "how is the old man going over the river?". My sense is that most humans would quickly imagine a conventional bridge with a road on it leading over a river and with the river background being in an area developed enough to allow bridge going over it.
So the implication I draw is these things can find/generate stuff that roughly satisfies the conditions (and are getting better at this) but they still fail add the assumptions that people would draw.
So my conclusion is that LLMs are getting better and better at "what they" but there are going to be places where they fail to satisfy human common assumptions.
> but they still fail add the assumptions that people would draw.
I have mixed feelings about this. I agree with the default assumptions you have as to "what people would draw", however what do you want from this cognitive automation?
Do you want, "what most people would do" or do you want "something creative, an outlier, that still satisfies conditions" ?
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...
While reading this thread, I literally just caught an agent putting in the following CSS selector in a rule:
> .row > div > div, .alert
This is fairly simple CSS, not multi-threaded systems development. A bar low enough that you could trip over it. I catch this kind of stuff all the time (literally every run), but only because I read every line. Most of it wouldn't be the end of the world for any particular task, but would eventually result in a complete mess.
I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention. Or they're hyping their book.
Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.
As I commented on another thread
> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
> The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
Not just when using tools, also when using humans. The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people. The code I get from LLM's is usually much better than what I get from my peers. Maybe not one shot, but after some steering it gets there.
It also isn't lazy. When generating test cases for relatively simple pieces of code, it usually tests pretty much every path and doesn't stop right at the 80% code coverage quality gate.
I can imagine if you're at the level of Linus or something, you might conclude differently, but most people aren't there at all.
A standard Docker container, with the container UID/GID mirrored to the host user, holding the host user's API keys, with the host user's project directory bind-mounted. The tooling doesn't even use gVisor / Kata by default which could implement the claim made, but in reality this entire project appears to be security theater.
@hollowturtle I'm surprised - do you really find that sota models aren't good enough to generate production code with steering and babysitting? My experience (Claude Code, mostly Opus 4.6) is that it's fantastic at this. At least in JS + TS + Elixir + Ruby. It does indeed need babysitting, my mental model is that it's an exoskeleton not a junior dev, but IME it's a friggin badass exoskeleton, easily 10x-ing my speed on most work. Notably I do NOT --dangerously-skip-permissions nor use claude code's auto mode, I micromanage and lightly review every line it's writing as it writes it, so I rarely have more than 2 sessions generating simultaneously. I suspect that a lot of the disappointment comes in when people try to delegate to it and trust it to not go off the rails. It hasn't earned that trust from me yet (and hasn't needed to yet).
Granted, I'm mostly working in small-to-medium codebases, 20k-30k LOC incl test suite. I wonder if that's a factor in my positive experience. Curious to hear your thoughts.
It really depends on the task, but, in my experience, small to medium and bigger codebases, the amount of steering to get quality code is not worth it.
I see patterns and solutions emerging from hand coding, I'm not the other way around, I can't start with a prompt, unless again I have the feeling that the task can be one-shot with minimumn effort and context.
Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.
If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps
Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.
Also... I think our era has an intrinsic bias that change=progress, productivity, etc.
Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.
But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.
A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.
Maybe administration was never really a bottleneck.
Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.
Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.
I haven't heard of many belting out features, and increasing prices or sales.
Most bottlenecks are upstream of another bottleneck. Few are a "dam."
I don't know that there was an inflection point. I know that, over the past year, they definitely became useful to me as more than auto complete.
My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:
- Go code that implements the transpiler (parsing Wasm, building an AST)
- Go code that gets generated by serializing the AST to a .go file
- Go code that manipulates the AST (to optimize it), and its effect on the generated code
- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST
- C code that gets compiled to Wasm, then translated to Go, then called by Go
- Go code that gets called by this C code to implement a C stdlib
- WAT and WAST files that are used to implement the Wasm spec tests
I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.
And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).
Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.
I agree, but you contradicted yourself just one line above.
> For generating production code even with a lot of steering and baby sitting? Absolutely not
Moreover this is further in contradiction with several facts:
1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz
2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties
3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?
> I agree, but you contradicted yourself just one line above.
>> > For generating production code even with a lot of steering and baby sitting? Absolutely not
with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not
I had a really fun day yesterday because anthropics limits on their normal 20$ subscription allowed me to play around for the whole day without hitting a limit.
Its 'production' code because its a small browser game which has very small to 0 requirements on security and being perfect but high requirements on 'ever even doing this' and 'fun'.
The code it generated hat 0 compiletime errors. I was able to descripe 10 things to do in one task and it just jugged along solving all of them.
This doesn't need to become so much better to be useful. Its already very useful for a lot ofuse cases like researchers which have to verify the math anyway but are not good in writing code for filtering their testdata, converting them and running it.
Small websites, fun projects, helper tools etc.
But while we speak, in the background stuff is still happening left and right. More compute, better algorithm, more RL etc.
We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant.
> Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet
>> embedding-shape 1 hour ago | root | parent | next [–]
>>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.
If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?
"We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant."
This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.
I'm convinced that the polarization is that one's impression of AI has a direct 1:1 mapping with one's previous level of skill and sensitivity to quality. Most people are by definition average and they are impressed.
Is there anyone in the industry noted for their skill, quality, and taste, e.g. Jonathon Blow, who is impressed and thinks the AI is really good? I haven't seen any. In my personal circle, the best devs I know are either micromanaging or shunning AI; none of them think the agents are capable or really good. The mediocre devs I know are largely on board. This applies both online and off.
Couple this with the fact that no AI focused project has come out, not a single one, that meets a high quality bar with nontrivial complexity.
I am an AI quality sceptic. They can be useful if you don't care for quality, but I never don't care for quality. I live for quality.
My explanation for the lack of shared experience is very language dependent quality. I work in Go and it's gotten really really good. I have to pick the right abstraction and it can be overly verbose at times but it can make in 5 minutes what would have taken me an hour.
Steve Yegge wrote about this in his book Vibe Coding. He says it takes about a year of experience before you're consistently getting good results. He writes about lots of different techniques for doing that, but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire.
> but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire
That has been my experience too. The days when I'm very focused, being extra deliberate and constantly questioning/examining/challenging things, the results are much better. Autopilot days just go through in a daze and the outcome is objectively worse. This has made me much more hands-on and pushed me towards models which are actually not that "clever" like codex at effort=low but fast. Given that I'm doing the meat of the thinking, might as well not be slowed down by the model and lose the flow.
It's been 4 years of using them for me, before writing a book I'd wait to have a decade of experience to share with others, otherwise it would have the same value as a book on a react tutorial
I'd say closer to 6 months for me but probably still some room to improve.
I think getting a decent setup with a fast feedback loop for the agent combined with context (in repo markdown)+memories goes a long way.
After having Claude Code "remember" my preferences and tools, it's more efficient.
It has a tendency to copy existing patterns so a good AGENTS.md with best practices and architectural goals goes a long way to prevent it from duplicating patterns you're trying to get rid of.
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
I think this may depend on the sorts of work you do. For those of us who mostly live in web using established frameworks, that's about when I came to conclude they could do everything and do it well.
I can have opencode discover third party APIs and generate fully working solutions that are well integrated into an existing long-lived codebase. I still review the MRs by hand but I only ever discover spec errors or style issues, not defects in the code itself. This was a big change from ~summer 2025.
This is a really well defined space though with strong conventions. If you're doing something more interesting YMMV.
I believe by now we know exactly what it's good at and what it's terrible at.
The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).
It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.
I'm moderately horrified every time claude runs the same broken, YOLO SWAG git commands from stackoverflow, gets errors, tries a few more things, then finally figures out how to commit and push correctly.
Long term, it can be better to slowly refactor parts of your code base into the way the model expects it to be. Sometimes fighting the gradient of code’s uniqueness vs expectation is not worth it.
> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.
I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.
So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
Grok is OK for general stuff, never tried it for coding.
Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)
Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol
> I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.
Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"
I first started noticing they were actually useful around Dec 2025, through about February. I got pretty good at using them, and was amazed at their utility, especially Claude and Codex. Then sometime in March, they got really frustratingly dumb. Things that they used to get right in one shot suddenly took several tried, and I had to watch them like a hawk because they constantly made stupid mistakes, not following instructions that previously worked. I had one try to fix a failing test like this:
assert_eq x, true if x == true
Both Claude and Codex, both with the latest versions and the original versions that had been working.
Now I just use deepseek. It isn't any dumber, and it costs way less.
I'm curious. What have you actually tried? Are you just prompting the LLM with one off tasks? For good results, you need to take the time to read the documentation for the harness you are using and configure your environment. This tuning can take dozens hours to nail down. Then there's the actual approach for working on your projects. Many people that have good results with agentic coding actually spend the bulk of their time in plan mode where they go back and forth with the LLM designing a granular playbook for the task at hand before they ever have it write any code.
I'm curious. What makes you think that me sharing an example(which one of the many?) of what I actually tried would somehow add something to the conversation? What's the usefulness of just an anecdotal example?
As I said we have a plenty of different envs, codebases, requirements. Things are complex.
You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.
Let me stress this out again:
> That's why the debate is so polizered imo, there isn't a shared experience
Have had fairly good luck with Claude Code Opus 4.7 on xhigh effort.
I think it more reliably does IaC with established patterns especially when it can do a dry run.
Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho
Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.
I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.
you are experiencing reverse Dunning–Kruger effect.
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
Please do not cite Dunning–Kruger effect at random.
Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".
If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
The answer is "for lots of people, but not you".
You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".
Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.
Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.
I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?
I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.
- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.
- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.
Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.
Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.
I work at a company that deploys AI to enterprises
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.
Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.
Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?
In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.
I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
At work the tools handed to most are still essentially chatbots. Getting access to coding tools is an uphill battle because there isn’t really a good way to manage risk yet. Hard enough to keep a coding agent in check locally and ensure it does rm -rf anything. Scale that to thousands of people with limited skill and it doesn’t really work. So currently they just don’t.
That’s in a finance shop. I’d imagine it’s different in programming shops where handing people Claude code is a bit more plausible
Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
They lag behind because we build for ourselves first. We are rolling out Claude to the biz team this week and they will get access to Cowork, which is still preview aiui.
Sales will be another big user of agent automations, for better or worse. Poor usage by Google to craft emails and slides for us is why the suits are getting an Anthropic sub. Stay human in the loop my friends!
I've always been a "power user", making little python programs and figuring out new ways to do things with seemingly unrelated systems. My knowledge is shallow, but very broad.
A year and a few jobs ago I was genuinely up against a wall I could not see breaking through, not if I wanted to ever sleep again. Hundreds of completely bespoke customers. Hideous archaic tooling. Two of us. It was bad times. So I started paying for Claude - desperation move, to try and vibe my way out. Honestly, it's been a little bit like having superpowers.
Not just code generation, which has been great, but gaining knowledge and understanding with incredible velocity - sort of like how RSS felt back in the day, or when Google stopped being worthless in the very end of the 20th C. When Wikipedia started.
So where am I now? Well, I ditched the hell job (I didn't really drink the koolaid of their "Enterprise Solution" anyway), and got a regular day job in my core competency. I guess I do a lot of what is called "vibe coding", all kinds of utilities, what I call my "extracurriculars". A graph view for Asciidoc in VSC to show includes, xrefs, partial includes. Graph view for everything actually - it's surprisingly insightful for PDM and config management. Analysis tools for sensor faults based on Python open source astronomy tools. All sorts of converters and aggregators and cleaners for a devil's piss bucket of enterprise systems. A bazillion new MapTools macros for gaming, making complex RPG systems nearly pushbutton. A little harvest of local LLM systems doing all sorts of things, like my "Reviewinator" for copy edit. I could type the rest of the day and wouldn't come close to the end of the list.
So, pretty amazing. Very interesting systems with what must be some N-dimensional geometry underlying, maybe a signal to an underlying principle of emergence. Who knows?
In the long term, it's going to be Enterprise Software that eats the big losses from these systems. For all sorts of reasons, but mostly because Enterprise is where software goes to die. It's all bespoke to hell, it's all ancient, no one is working there because they want to. So a domain expert, with AI assist and a little know how, is probably going to whip up a superior set of tools in a short enough time to make it really worthwhile. Watch that space: SAP, Siemens, Teamcenter, SalesForce. Watch their consulting revenue.
If you only read bad news (i.e. mass news these days since that sells better) this will be the picture. But I have personally seen some insane stuff happen in biotech. Like, I can't believe we're lucky enough to possibly live our life in this kind of future. We already have actual therepeutics developed using Alphafold being tested right now in real clinical trials, but the next generation of stuff that will go into trials in the next 3-5 years will be insane. We will look back at current medicine like we look back at medieval times today.
My mother is going on 5 years with multiple myeloma, a cancer that would have offed her in 5 months if it weren’t for advances in maintenance chemotherapy.
Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
> Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.
At this juncture I'm left wondering why competing AI labs wouldn't train for this now well known "test".
Graphically perfect, but content-wise nonsense. The pelican's center of gravity is clearly behind the wheel. It needs to be above or very slightly ahead of the wheel.
The length of the pedals keeps changing, and you'll notice that neither of the pedals actually rotates around the hub: consistent with your point about the center of gravity being too far back, the circle the pedals are making is also shifted back too far.
Still impressed. And, to be honest, I don't think that this problem matter much. Physical accuracy is very nice, but for example is not the most important aspect when I watch a fantasy movie. Or even a scifi one.
Google/Gemini has pretty impressive audio visual capabilities. I tried to have Claude add mulch to a landscape picture and it looked like someone hit it with the orange spray paint tool in MS Paint. Nano Banana actually produced something fairly realistic
It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
Willison chose this task because (unlike actual images of pelicans) is was clearly not in training data, but could be reasoned about and composed from what's there. But just like those "how many golf balls can you fit in a 747?" interview questions, it should now be retired.
If it turns out to be a good change or not is to be seen.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.
Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with
I swear to god that DeepSeek V4-Flash is the most useful model available right now. It's SO FAST and is good enough for so many tasks that I run it most of the time for almost everything. Even when it messes up, it's so cheap to iterate that I can fix most problems without changing the model to a more "capable" one.
I'm tired of the pelican bench, it made sense in the beginning, but at this point it got too popular and old to consider the assumptions from one year ago (absence in the sample/training/reinforcement) to still hold.
The tooling has become so good though - the eco-system around the LLM. The models have become really good, yes - but it's definitely slowed in my opinion. The tooling is what really has become great - "harness" is probably the best word. When folk like Elon/Schmidt/Theil/etc. talk about singularities and industrial revolutions - it sounds extremely out of touch - or actually protective of the massive capex they've potentially sunk.
EDIT: Schmidt's booed commencement speech was probably one of the most out-of-touch speeches (outside of a tech interview) I've heard.
Reading through the thread, a lot of the inflection point debate seems to come down to people talking past each other about what got better. My read is that the models themselves didn't really jump in capability around November, but the harnesses around them got considerably more reliable, and the RLVR work earlier in 2025 had been training the models specifically to behave well inside those harnesses, so when the two met you got a compounding effect that felt like a step change even though neither piece was that dramatic on its own.
I think that's probably why everyone in this thread has such different experiences - someone whose workflow is mostly asking a model for code and pasting it in would have seen modest improvement and would reasonably wonder what the fuss is about whereas someone who was already running agents on 20-step loops would have felt a much bigger shift, because the thing that used to kill those runs was the failure at step 12 cascading into garbage by step 20, and that got a lot better.
The local model story Simon kind of glosses over is interesting for the same reason - a 20GB model drawing a decent pelican on a laptop is a cute data point in isolation. The thing worth noticing is that a competent local model inside a good harness now gets you closer to frontier performance than running the frontier model without a harness does.
How much of what is being generated by LLMs is actually value add? My perception is there are lots of great experiments, but little real value.
+ Developers are more productive, but are you all leaving work at 3p and enjoying a new found sense of work-life balance?
+ Companies are investing heavily in AI, yet I'm paying more for the same thing. Jamie Dimon still pays me 0% on my checking despite spending billions on AI.
It may be that simply adopting AI isn't enough. Could new startups that are born-in-AI buck this trend? I wonder what Clayton Christensen would say if he were still around.
December 2025 was the breakthrough for me.
January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.
I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
The openclaw ban pushed me over to 5.5 for some daily usage. I feel like Opus and 5.5 are good at very different things. 5.5 can be too literal, and it does not have as much of a ‘creative’ bent whether that’s toward design, UI/UX, interpreting vague instructions, etc. So, in that way, Opus had sort of spoiled me.
On the other hand, this year I’ve been in the habit of using codex as a bug finder / audit layer, where it shines, and I can tell you, Opus makes a lot of mistakes, and as we all know struggles with laziness — and has gotten good at encoding that laziness into the codebase (// Per instructions, pass this test by default) where it can live for a long time. So, Opus had spoiled me, but more with its ability to sketch holistically than its ability to put out perfect codebases.
Upshot - it was good to switch horses for a while, as you mention. Slightly different skill sets there. And I still reach for claude especially for initial design. But right now the daily driver is 5.5 / xhigh fast mode, and it’s very capable.
I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to
"there’s zero chance any AI lab would train a model for such a ridiculous task"
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?
You will immediately notice the difference if you use it at the threshold.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
I have the same experience. I've been running sequential agents in my own harness that is a standard SDLC pipeline (plan, design, code, build, test). It has gates between each stage to control quality.
The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.
For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.
The pipeline controls the quality far more than the model, empirically.
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
They’re definitely RL training the models on the pelican test. They patch any kind of test that shows them performing poorly by hardcoding some answers into the model.
> One of my projects was a vibe-coded implementation of JavaScript in Python—a loose port of MicroQuickJS—which I called micro-javascript. You can try it out in your browser in this playground.
I'd like to remind everyone here that people on this forum used to actually code truly remarkable and pointless stuff like this, with zero LLMs, using nothing but their brains and motivation from who the heck knows where from.
We all have had the client from hell: they don't know what they want, they change their requirements all the time. Whenever they have a new half-baked idea, I need to scramble and re-design the architecture. They have no clue that a small change request has a big impact on the code.
Well... Now I can be that client. And let AI deal with my incomplete, always changing requirements. And get it done anyway.
Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
>Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?
Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.
Pair those coding plans with the harness of choice including Claude Code and you are good to go.
I made an account on OpenRouter.ai , created an API key, plugged the API key into the Zed editor, and started asking free models questions about my codebase.
Once I felt I had some confidence on what the spend rate would be, I bought $20 USD worth of credits and would occasionally point my editor at a cheap paid model for some real-time questions.
I've still only spent less than $2 in credits so far, as often a free model can answer my question fast enough.
I have not yet tried agentic coding, but at least with OpenRouter API keys it's trivial to cost-cap keys so you can pay for lower latency and still cap your spending.
The honest summary that doesn't show up in the six-month roundup: the unevenness. Boilerplate, tests, scaffolding, glue code: dramatically faster, sometimes 5-10x. Architecture, data modeling, careful security work, judgment calls about what to build: same as before, sometimes slower because tab-completion sneaks in plausible-but-wrong defaults you then have to undo.
The thing headline numbers ("AI made me 3x faster") hide is which 30% of the work the AI sped up and which 70% didn't move. For a solo dev the survivable bet got smaller, and that's the real change, not raw productivity. AI made certain projects worth attempting at all that wouldn't have been viable six months earlier.
what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third.
Scared for the future
Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
There is an entire category of software engineers who exist entirely to knock out features on microservices or do easily automatible QA work whose jobs will disappear.
I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.
There's also an inflection point in Feb-April: Claude got considerably worse, and arguably has not really recovered since then. They claim it's fixed, but my experience it is not as great as it once was. 4.7 is still useless.
Waiting for the next event at this point. Hoping that "inference becomes cheap" when Groq hardware gets delivered.
Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.
top model changes every other month between Claude, GPT and gemini. but its dominated by GPT overall. Claude has taken lead in coding task but GPT 5.5 has come stronger. gemini was good in between. but its dominated by GPT 5.5 and claude overall. Coding is the area where disruption is hardest. Opencalw early this year was a major breakthrough in agentic AI and it is still making noise and becoming more mature and going toward enterprise. Agentic coding is still in adoption phase where teams are trying it , trying to make sense out of it, running it and not beleving it and eventually it is discussion point over tea. it is still in adoption phase but needle has moved from being alient to being something real which team started discussing and using it like a champ.
why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future
My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.
I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
Out of curiosity - what harness did you use, and what model? And how are you prompting? In my mind prompting like:
“You’re going to make frogger in javascript. I want a complete clone of functionality for level 1, with amazing 80s era pixel art sprites. I’m super lazy, so you’re going to have to test everything, right from the start. Pick a test harness, write the tests, including tests for having amazing graphics, gameplay, input, UI, sounds, etc, and write a full workplan, then work through that workplan, in parallel where you can. The workplan should emphasize getting a stripped down version up immediately and have workstreams for all the major requirements after that. Add a final test that assesses how fun the game is by reviewing a real video of a test run. Loop on that final test until you can’t improve things any more.”
Should produce something playable with no further input. As you say, I’m not sure it would produce a codebase we’d want to look at or work on. But, I’d be surprised if this weren’t successful.
'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.
Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..
AI is like Sauron's Ring: it only amplifies the user's innate abilities.
It can either help you conquer the world if you were already doing that anyway or it can make you spend your life in a cave before throwing you into a fucking volcano.
It definitely seems like the point of no return has been passed.
The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.
For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.
The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.
Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.
As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
I'm always surprised to see HN people saying models aren't good.
What are these guys building? The best engineers I know, from startup to big tech admit these models are incredible.
Including people I don't know personally, foundational engineers from every area. The average HN person though, is doing some quantum-alien computation that not even the best developers in the world can grasp.
I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time.
I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.
"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "
Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?
HN has a mechanism that causes popular blogs to stay popular.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
I didn't even submit this one. I didn't actually think this was a good fit for hacker news, the pelican bicycle thing is pretty much played out here already!
Years ago I used to read his blog on Django and found it quite interesting despite being neither a Django nor even a python user - this must have been at least 10 years ago and perhaps more.
When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!
There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.
I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.
They definitely get something barebones up and running, but it's far from a fully fledged application.
I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.
It is sad. I like programming, if I couldn't do it and had to write text (which I do hate, I'm not a writer) it would be make quite a sad world.
8 replies →
Exact same experience here. Prior to Opus 4.5 I'd sometimes use AI for some frontend webdev stuff (I am a C/C++/Python programmer; my HTML/CSS/JS knowledge is probably on par with a first-year uni student) and I'd have to manually edit things and retry, tell it not to attempt a paradigm that had failed before or cycle between models in Cursor just to try and get one that could make a simple widget that worked properly.
Now, I'm using Claude or Codex (GPT-5.5) for frontend and backend and it just gets it right first time more often than not. I've been making use of things like LSPs, Context7 and CLAUDE.md (global and per-repo) and it just stops doing the dumb LLM things that I hate.
How do you justify your salary given that you're just using a tool that any of us could use for $20 an hour in your role?
38 replies →
> Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.
I still must hand hold it every day, as it always does things wrong. Especially after it got seriously nerfed in March.
Note: experiences vary a lot depending on the programming language used, and projects. And the experience of the person coding.
Same experience here. I now think AI writes much better code than me. So I shifted my focus to finding requirements, analyzing possibilities, and making good plans.
[dead]
Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.
'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.
Nitpick but commercial roofers prefer pneumatic over battery.
This is a great analogy. Jan/Feb this year was when the models crossed from useful to essential.
[dead]
I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.
Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.
For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.
I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.
I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.
Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.
Since it's so async I can work on other stuff while they plod along.
I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.
> Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases.
> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.
> I do check the documents, and what they're doing. I also check the tests, some more thorough.
Sounds like programming, but with extra steps.
9 replies →
That’s not vibing, but waterfall development.
2 replies →
Do you use anything to orcheatrate multiple agent pitted against each other (coder, reviewer, tester, etc)?
1 reply →
[flagged]
None of it is non-trivial tho. You might think so, but it’s not.
1 reply →
I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.
I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.
At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.
(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").
Opus 4.5 in November 2025 was legitimately, unironically an inflection point and is the sole reason for the current hysteria.
GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.
5.2 and the first codex model were step function changes in capability
I feel the change. It went from an autocomplete tool, to an agent running 5 tasks in parallel while I just supervise. The improvement is enormous.
While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though
It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.
>1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid)
I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.
I divide the work to fit within that 100k and use subagent for the tasks.
1 reply →
Anecdata of 1 but it is real. At the end of last year they passed some invisible threshold and became useful. I don't think it is models themselves, but mostly the much more powerful harnesses and I guess their tool calling abilities.
What changed I think was the context harvesting capability of the models. What most programmers did was - debugging and figuring out how something works were the time consuming part - the fix was usually trivial. And now models could do in seconds what took a developer hour or more.
If right now we create a smart grep that just takes everything for a piece of code and outlaw llm-s we will not regress to the previous level. The developers needed this context as much as llm-s to do their job.
It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.
When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.
The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.
Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.
I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.
But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.
Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.
3 replies →
My take is there was one big inflection point around opus 4.5 when they got the agentic stuff working and now whether or not it works depends on whether your use case/area of software engineering is profitable enough for the companies to have spent a bunch of money generating synthetic data to RL on, or if it's similar enough to areas that they've done that for. With similar enough being a very loose constraint given how much overlap there is in a lot of coding fundamentals. Tbh if the models aren't working for you now I don't think they're gonna be working for you in 6 months
It's very real but probably very domain specific. It got really good at a lot of traditional web dev stuff, bash, sql, and writing one off scripts to accomplish random tasks (hence all the agent stuff taking off). And they got good at staying on task. That may not translate to game dev because from what I understand a lot of these gains are basically around post training methods driven by synthetic data generation etc (with potential caveats on how synthetic that data actually is lol). I wouldn't be surprised if the areas of code the llms are good at now are straight up just product decisions of where to allocate budget for generating those synthetic data sets, and game dev stuff might not be at the top of the list because the customer base for that might not be as big
Counterpoint, I'm also vibecoding a game, and even before doing the "proper" setup (a good AGENTS.md, skills people have published for my chosen game engine, Godot), mechanically, the game was pretty spot on. It looked boring, so I used Claude Design to create a few mockups to choose from, chose the one I liked the most, and told Claude Code to redo the game UI with it.
There have been plenty of small issues like tables not having the columns aligned, or the game menu being a bit offset, or one graph being a placeholder instad of connected to the actual value. And of course I've had to instruct it on all the flavour I want.
But honestly, for a simulation strategy game, especially without doing the "proper" setup from the start, it's been _very_ good.
UI fit and finish is really hard for these models, even in with text-mode UIs. The super fiddly stuff still needs to be done by hand, at least for now.
Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.
At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.
And also, have good e2e tests.
IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.
Sounds very self confident to claim such thing. Something like "If you don't do how me is doing, then you are doing it wrong"
At what point is it easier and faster to just code it yourself? I don't trust myself to write better specs than code.
It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.
That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.
Once I work out the kinks, I’ll be able to further automate it.
Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.
But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.
And I know where to make slight changes without burning my allotments.
"flash" or "fast" AI models are worse than useless at coding for me. they make my codebase much worse. It's a maintenance burden.
Gemini Pro on the other hand can be quite a pleasant experience.
I mean this blog post and many from this author are pure evangelism and marketing. Can you find anything critical or any dissent from this author about LLMs?
> there’s zero chance any AI lab would train a model for such a ridiculous task.
A lot of people here stated that this is a ridiculous metric, but no one seems to remember that it was introduced in the initial GPT report ("Sparks of Artificial General Intelligence: Early experiments with GPT-4" [1]) by Microsoft about 3 years ago. Shortly after that it was parroted by a network of booster accounts and became a thing every clueless AI hype peddler does to "test" models.
100% marketing, 0% science.
[1] https://arxiv.org/pdf/2303.12712
For those curious, Simon's first public usage of it is Oct 25th, 2024[0]. While I'm not aware of any specific "pelican riding a bicycle" prompts being tested in a paper[1], the GPT paper did several SVG and tikz tests and the actual image is rather arbitrary. You wouldn't want to optimize for a singular image but also if you're doing halfway decent training a pelican riding a bicycle shouldn't be too hard to draw, and well... you can see several good examples if you look through different pages on [0].
[0] https://simonwillison.net/tags/pelican-riding-a-bicycle/?pag...
[1] I'm sure there is because of Simon's fame
My own informal test when generative AI came out has been "a picture of an old man riding a bicycle over a river". I just ran it for chatgpt with the standard model I have (5.5). It shows the old man on an old bicycle with the bicycle on a slack line and the slack line extending over the river with a medieval village in the background.
The point is that the prompt has a subtle ambiguity - "how is the old man going over the river?". My sense is that most humans would quickly imagine a conventional bridge with a road on it leading over a river and with the river background being in an area developed enough to allow bridge going over it.
So the implication I draw is these things can find/generate stuff that roughly satisfies the conditions (and are getting better at this) but they still fail add the assumptions that people would draw.
So my conclusion is that LLMs are getting better and better at "what they" but there are going to be places where they fail to satisfy human common assumptions.
> but they still fail add the assumptions that people would draw.
I have mixed feelings about this. I agree with the default assumptions you have as to "what people would draw", however what do you want from this cognitive automation?
Do you want, "what most people would do" or do you want "something creative, an outlier, that still satisfies conditions" ?
4 replies →
> The coding agents got really good
It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?
Absolutely not, not quite there not even close in my experience.
But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.
But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!
That's why the debate is so polizered imo, there isn't a shared experience
The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...
And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.
I don't want to offend (it's AI coded anyway :)) but that does not scream "high quality" to me. The headline gif on that repo just paints a terrible picture. It can't draw a box correctly, there's random underscores all over the screen. The UI itself is just incredibly incoherent. I don't even know what I'm looking at.
Like, no it doesn't seem like very high quality work... It just seems like a vibe coded tool.
Edit: yes it's wrapping Claude. It's BREAKING the TUI. Not sure what people aren't getting here...
80 replies →
While reading this thread, I literally just caught an agent putting in the following CSS selector in a rule:
> .row > div > div, .alert
This is fairly simple CSS, not multi-threaded systems development. A bar low enough that you could trip over it. I catch this kind of stuff all the time (literally every run), but only because I read every line. Most of it wouldn't be the end of the world for any particular task, but would eventually result in a complete mess.
I think the people doing the heaviest breathing around the elimination of programmers either aren't very good at programming, or they're not paying close attention. Or they're hyping their book.
12 replies →
Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.
As I commented on another thread
> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
9 replies →
> The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.
Not just when using tools, also when using humans. The frame of reference of what is considered 'production code' differs immensely between organizations, teams and people. The code I get from LLM's is usually much better than what I get from my peers. Maybe not one shot, but after some steering it gets there.
It also isn't lazy. When generating test cases for relatively simple pieces of code, it usually tests pretty much every path and doesn't stop right at the 80% code coverage quality gate.
I can imagine if you're at the level of Linus or something, you might conclude differently, but most people aren't there at all.
3 replies →
A standard Docker container, with the container UID/GID mirrored to the host user, holding the host user's API keys, with the host user's project directory bind-mounted. The tooling doesn't even use gVisor / Kata by default which could implement the claim made, but in reality this entire project appears to be security theater.
2 replies →
not going to look at your vibeslop
@hollowturtle I'm surprised - do you really find that sota models aren't good enough to generate production code with steering and babysitting? My experience (Claude Code, mostly Opus 4.6) is that it's fantastic at this. At least in JS + TS + Elixir + Ruby. It does indeed need babysitting, my mental model is that it's an exoskeleton not a junior dev, but IME it's a friggin badass exoskeleton, easily 10x-ing my speed on most work. Notably I do NOT --dangerously-skip-permissions nor use claude code's auto mode, I micromanage and lightly review every line it's writing as it writes it, so I rarely have more than 2 sessions generating simultaneously. I suspect that a lot of the disappointment comes in when people try to delegate to it and trust it to not go off the rails. It hasn't earned that trust from me yet (and hasn't needed to yet).
Granted, I'm mostly working in small-to-medium codebases, 20k-30k LOC incl test suite. I wonder if that's a factor in my positive experience. Curious to hear your thoughts.
It really depends on the task, but, in my experience, small to medium and bigger codebases, the amount of steering to get quality code is not worth it.
I see patterns and solutions emerging from hand coding, I'm not the other way around, I can't start with a prompt, unless again I have the feeling that the task can be one-shot with minimumn effort and context.
Starting with a prompt, or in plan mode, it's not how I trained as an engineer, I cannot foresee what something should be/look like until I explore it myself with code I can relate to, that I'm connected with and that I fully understand, for example my muscle memory suggest me to use a specific data structure only after I see some code patterns emerging, hard to explain hopefully makes sense.
If I ask the agent to do that initial exploring, even with a tremendous amount of instructions, guidelines etc. it usually start with a path I wouldn't have started with. What I tried in such cases is to stop it, correct it and generate again, only to end up with more prompt words than lines of code. This is true for every visual task I'm working on (I program non web UIs). Let alone doing it via spec files, if it's something I don't care about yeah sure, maybe a little tool for entering/editing data, but alas it always default to slop web apps, and I get it I mean most of the training set is on web apps
2 replies →
Coding goodness is just "unevenly distributed."
Irl, (a) different people's ways of working with ai are a million little islands and (b) bottlenecks vary enormously by coder and codebase/task.
Also... I think our era has an intrinsic bias that change=progress, productivity, etc.
Take the "networked computing revolution" of 1990-2000. These computers did land on every desk and every pocket. They are administration powerhouses. Excellent for all manner of administration tasks.
But... what this netted out to is "change." We send a lot more emails than we did letters. We communicate a ton. Secretaries went extinct. But "administration" grew.
A university faculty typically has more admins. Companies hire more accountants, HR, project managers, etc.
Maybe administration was never really a bottleneck.
Code has a lot of this. Everyone has a road map, wishlist, etc. It appears as though "code capacity" is the bottleneck. But maybe most of those companies can't really generate much more value from more software.
Anecdotally, it seems that many mid-tier shops are migrating/ modernizing their stack, and suchlike.
I haven't heard of many belting out features, and increasing prices or sales.
Most bottlenecks are upstream of another bottleneck. Few are a "dam."
I don't know that there was an inflection point. I know that, over the past year, they definitely became useful to me as more than auto complete.
My most recent pet project is a transpiler from Wasm to Go, and I find it incredibly impressive that recent models (I've used Sonnet, Opus and Gemini, far more successfully than GPT), they're able to just pick up the project and work at all these levels:
- Go code that implements the transpiler (parsing Wasm, building an AST)
- Go code that gets generated by serializing the AST to a .go file
- Go code that manipulates the AST (to optimize it), and its effect on the generated code
- Go code that's grafted to the generated code (to implement more advanced opcodes) and how to interact with it from the AST
- C code that gets compiled to Wasm, then translated to Go, then called by Go
- Go code that gets called by this C code to implement a C stdlib
- WAT and WAST files that are used to implement the Wasm spec tests
I find this impressive because I have to think hard about all these levels, and I feel many programmers would have a problem with this.
And it's very often way easier for me to just write: "I want to generate this code, build me the AST that does it", than go "count parenthesis" in the Go code (I do have some LISP experience; it's still easier).
Feel free to scrutinize/criticize the code. Not vibe coded, but plenty of GenAI help.
https://github.com/ncruces/wasm2go
> But we should stop talking about 1s and 0s
I agree, but you contradicted yourself just one line above.
> For generating production code even with a lot of steering and baby sitting? Absolutely not
Moreover this is further in contradiction with several facts:
1. the majority of this industry has always been composed by mediocre/bad developers, often unable to write a fizz buzz
2. the majority of work in this industry is implementing mundane CRUDs to move and transform trivial data across the organization's stakeholders and/or customer or third parties
3. there's lots of stellar and respected engineers leveraging the tools on a regular basis even on problems that are far from trivial and outputting quality code much faster than they would've done otherwise. Mitchell Hashimoto has blogged about it in his work on Ghostty, Sanfilippo has blogged about it in his work on Redis and so did plenty of others. I know several open source stellar developers who benefitted greatly from these tools, yet you think it cannot improve the quality and output of the most mundane tasks out there?
>> But we should stop talking about 1s and 0s
> I agree, but you contradicted yourself just one line above.
>> > For generating production code even with a lot of steering and baby sitting? Absolutely not
with this last sentence I obviously meant in my experience, it's not that hard. I don't buy your facts are highly biased towards web development, that's a common mistake here on HN to think it's the totality of the industry, luckily it's not
10 replies →
I had a really fun day yesterday because anthropics limits on their normal 20$ subscription allowed me to play around for the whole day without hitting a limit.
Its 'production' code because its a small browser game which has very small to 0 requirements on security and being perfect but high requirements on 'ever even doing this' and 'fun'.
The code it generated hat 0 compiletime errors. I was able to descripe 10 things to do in one task and it just jugged along solving all of them.
This doesn't need to become so much better to be useful. Its already very useful for a lot ofuse cases like researchers which have to verify the math anyway but are not good in writing code for filtering their testdata, converting them and running it.
Small websites, fun projects, helper tools etc.
But while we speak, in the background stuff is still happening left and right. More compute, better algorithm, more RL etc.
We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant.
> The code it generated hat 0 compiletime errors
And no spelling errors either!
Also,
> Really? What duplication did you actually find? I count a few small ones in buildMounts and ReadPrompt, maybe 20 lines or so, but hardly anything worthy of such an epithet
>> embedding-shape 1 hour ago | root | parent | next [–]
>>The duplication I'm seeing isn't just "same text repeated" but structural duplication. Doing a quick 5 minute look again just to give you some pointers; runtime.MountSpec construction in buildMounts, Workdir vs aux-dir mount-mode handling, repeated one-off mount append blocks, overlay detection and so on, the list goes on. Just those should account for 200+ lines.
If you don't see any errors or problems, is it because there aren't any problems to see, or because they take a trained eye to spot?
1 reply →
I don’t see how “fun projects” and “take our jobs” fit together in any voluntary sentence.
2 replies →
"We could already be at 95% at 'ai will take your coding job' without knowing because these 5% are so relevant."
This is nonsense. Im not a SWE but a CEO, if that were true I'd be firing without a hitch. And yet this is not the activity we see. Why is that? Perhaps merely writing code is not the entire job.
5 replies →
I'm convinced that the polarization is that one's impression of AI has a direct 1:1 mapping with one's previous level of skill and sensitivity to quality. Most people are by definition average and they are impressed.
Is there anyone in the industry noted for their skill, quality, and taste, e.g. Jonathon Blow, who is impressed and thinks the AI is really good? I haven't seen any. In my personal circle, the best devs I know are either micromanaging or shunning AI; none of them think the agents are capable or really good. The mediocre devs I know are largely on board. This applies both online and off.
Couple this with the fact that no AI focused project has come out, not a single one, that meets a high quality bar with nontrivial complexity.
I am an AI quality sceptic. They can be useful if you don't care for quality, but I never don't care for quality. I live for quality.
This is contradicted by the amount of AI use at top tech firms.
1 reply →
My explanation for the lack of shared experience is very language dependent quality. I work in Go and it's gotten really really good. I have to pick the right abstraction and it can be overly verbose at times but it can make in 5 minutes what would have taken me an hour.
Steve Yegge wrote about this in his book Vibe Coding. He says it takes about a year of experience before you're consistently getting good results. He writes about lots of different techniques for doing that, but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire.
> but also says a lot of it comes down to just getting a feel for when the LLM is going to go haywire
That has been my experience too. The days when I'm very focused, being extra deliberate and constantly questioning/examining/challenging things, the results are much better. Autopilot days just go through in a daze and the outcome is objectively worse. This has made me much more hands-on and pushed me towards models which are actually not that "clever" like codex at effort=low but fast. Given that I'm doing the meat of the thinking, might as well not be slowed down by the model and lose the flow.
1 reply →
It's been 4 years of using them for me, before writing a book I'd wait to have a decade of experience to share with others, otherwise it would have the same value as a book on a react tutorial
1 reply →
I'd say closer to 6 months for me but probably still some room to improve.
I think getting a decent setup with a fast feedback loop for the agent combined with context (in repo markdown)+memories goes a long way.
After having Claude Code "remember" my preferences and tools, it's more efficient.
It has a tendency to copy existing patterns so a good AGENTS.md with best practices and architectural goals goes a long way to prevent it from duplicating patterns you're trying to get rid of.
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
I think this may depend on the sorts of work you do. For those of us who mostly live in web using established frameworks, that's about when I came to conclude they could do everything and do it well.
I can have opencode discover third party APIs and generate fully working solutions that are well integrated into an existing long-lived codebase. I still review the MRs by hand but I only ever discover spec errors or style issues, not defects in the code itself. This was a big change from ~summer 2025.
This is a really well defined space though with strong conventions. If you're doing something more interesting YMMV.
I believe by now we know exactly what it's good at and what it's terrible at.
The problem is that our CEO's fear of the future that pushes them to peculiar decisions that objectively make no sense (cf the infamous discussion of the Microsoft employee on Github that couldn't force its agent to do the proper thing).
It's not the first time I witness this kind of discrepancy and probably not the last, I just learned to adapt to it.
> Absolutely not, not quite there not even close in my experience.
Well... I don't know what you expect but so far I'd like all my colleagues to write code at the level of what I get from codex.
I'm moderately horrified every time claude runs the same broken, YOLO SWAG git commands from stackoverflow, gets errors, tries a few more things, then finally figures out how to commit and push correctly.
Long term, it can be better to slowly refactor parts of your code base into the way the model expects it to be. Sometimes fighting the gradient of code’s uniqueness vs expectation is not worth it.
> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
You can dig up my past comments semi-arguing with simonw where I said AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot
and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)
It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that the project is modular enough where most files can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.
Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel, even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop now. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.
I don't just copy-paste the AI's output, because it's often inefficient anyway (like creating redundant variables/functions), but I find its findings useful for manually cleaning up my shit. Maybe their training data is not that good with GDScript yet which is a bit of a jank language anyway.
So my core code is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI: It just has to put existing blocks together, that already have well-defined interfaces/contracts etc.
I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png
Grok is OK for general stuff, never tried it for coding.
Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work on Google's own Flights website and reverse image search! (it told me to do it myself)
Deepseek refused to talk about Taiwan or Tiananmen Square so I'm not sure if I can trust it for anything else lol
> I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.
I've recently tried codex, and I have it set to plan mode with 5.5 and I'm hitting the limits on a single task on a "medium" sized codebase.
1 reply →
Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"
I set up a hook that reviews every commit and highlights potential bugs (async) and writes to a report to a dir.
Then I have a script that summarises that I usually run before pushing or at end of day.
Works quite well for both improving my code and the code ai wrote.
I first started noticing they were actually useful around Dec 2025, through about February. I got pretty good at using them, and was amazed at their utility, especially Claude and Codex. Then sometime in March, they got really frustratingly dumb. Things that they used to get right in one shot suddenly took several tried, and I had to watch them like a hawk because they constantly made stupid mistakes, not following instructions that previously worked. I had one try to fix a failing test like this:
Both Claude and Codex, both with the latest versions and the original versions that had been working.
Now I just use deepseek. It isn't any dumber, and it costs way less.
I'm curious. What have you actually tried? Are you just prompting the LLM with one off tasks? For good results, you need to take the time to read the documentation for the harness you are using and configure your environment. This tuning can take dozens hours to nail down. Then there's the actual approach for working on your projects. Many people that have good results with agentic coding actually spend the bulk of their time in plan mode where they go back and forth with the LLM designing a granular playbook for the task at hand before they ever have it write any code.
I'm curious. What makes you think that me sharing an example(which one of the many?) of what I actually tried would somehow add something to the conversation? What's the usefulness of just an anecdotal example?
As I said we have a plenty of different envs, codebases, requirements. Things are complex.
You're posing it like I tried just one time. It's been hundreds of hours of tries and I just found out what works best for me, like everyone should do. My original post above isn't that hard to understand.
Let me stress this out again:
> That's why the debate is so polizered imo, there isn't a shared experience
2 replies →
Good is relative. If somebody struggles with getting their hand-written code to compile, the LLM coding agents will look like geniuses to them.
An idiosyncrasy of humanity is that the dumbest individuals tend to also be the loudest.
Which languages and subject matter do you work with?
c/c++, java, kotlin, go, some perl scripting, some javascript. Gaming industry
3 replies →
Have had fairly good luck with Claude Code Opus 4.7 on xhigh effort.
I think it more reliably does IaC with established patterns especially when it can do a dry run.
Python is pretty decent but usually you need good prompting and a little bit of steering to prevent slop. The slop usually works tho
Codex w/ gpt-5.5 seems faster but maybe just a bit below Opus 4.7 quality.
I gave Opus access to a repl (pyrasite-ng) in a running Python process and it managed to find an 8 year old "memory leak"--a module level cache with no eviction. It did that using GC module and exploring the heap. I was pretty happy with that outcome. It would have been quite challenging for me to find myself without at least a few weeks of deep diving into memory leak hunting docs/resources.
you are experiencing reverse Dunning–Kruger effect.
For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.
now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.
Please do not cite Dunning–Kruger effect at random.
Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".
If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!
LLMs can effectively validate your business idea
28 replies →
[dead]
[dead]
[flagged]
> It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".
The answer is "for lots of people, but not you".
You're doing a vague impression of being fair and even-handed, arguing for non-polarization, but underlying everything you're saying is an obvious attitude of poralizing superiority: That _your_ personal experience with AI is the real truth. That _your_ codebase is more intricate and more challenging than what other people are doing. That everyone else is being led by a "marketing hype train".
When I say
> Absolutely not, not quite there not even close in my experience.
I obviously mean in my experience, not the real truth.
> That everyone else is being led by a "marketing hype
That is obvious instead, and I later say there's not 0s or 1s, every job has his intrincancies
[dead]
Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.
Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.
Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.
I don't understand this response. Human artists can and do make SVGs.
2 replies →
I wouldn't wish creating a svg pelican on a bicycle on my worst enemy
> Every modern image-generation model can generate a pelican on a bicycle trivially.
Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.
The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.
I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?
I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...
Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.
We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"
They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.
It makes no sense to me.
I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.
I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.
I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.
I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".
That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.
When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.
AI is a tool. Use it appropriately.
7 replies →
In pure maths:
- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.
- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.
- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.
Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.
Oh, and GPT, and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.
I work at a company that deploys AI to enterprises
The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity
Showing them agents that automate work at scale is a very magical experience
And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.
Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.
I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.
3 replies →
Can I get Claude to view the slide decks for me so I don't waste my time?
Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?
6 replies →
In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.
Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.
As someone who works somewhere where the intranet is a bit of a jungle: what tool do you use to scour the intranet?
Thanks!
2 replies →
My day job is not in the tech industry. I am an editor. Literally nothing has changed for me in the last four years.
As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.
Can you give a sanitized example or a hypothetical scenario of what you mean by “output documents with code agents”? Thanks.
3 replies →
I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.
We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.
I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.
It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).
"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"
That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).
I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.
At work the tools handed to most are still essentially chatbots. Getting access to coding tools is an uphill battle because there isn’t really a good way to manage risk yet. Hard enough to keep a coding agent in check locally and ensure it does rm -rf anything. Scale that to thousands of people with limited skill and it doesn’t really work. So currently they just don’t.
That’s in a finance shop. I’d imagine it’s different in programming shops where handing people Claude code is a bit more plausible
Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.
Some of these are now contributors.
I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.
They lag behind because we build for ourselves first. We are rolling out Claude to the biz team this week and they will get access to Cowork, which is still preview aiui.
Sales will be another big user of agent automations, for better or worse. Poor usage by Google to craft emails and slides for us is why the suits are getting an Anthropic sub. Stay human in the loop my friends!
for non-coders: local AI. a couple years ago you needed a dedicated GPU rig. now a 30B model fits on a laptop and runs offline.
I've always been a "power user", making little python programs and figuring out new ways to do things with seemingly unrelated systems. My knowledge is shallow, but very broad.
A year and a few jobs ago I was genuinely up against a wall I could not see breaking through, not if I wanted to ever sleep again. Hundreds of completely bespoke customers. Hideous archaic tooling. Two of us. It was bad times. So I started paying for Claude - desperation move, to try and vibe my way out. Honestly, it's been a little bit like having superpowers.
Not just code generation, which has been great, but gaining knowledge and understanding with incredible velocity - sort of like how RSS felt back in the day, or when Google stopped being worthless in the very end of the 20th C. When Wikipedia started.
So where am I now? Well, I ditched the hell job (I didn't really drink the koolaid of their "Enterprise Solution" anyway), and got a regular day job in my core competency. I guess I do a lot of what is called "vibe coding", all kinds of utilities, what I call my "extracurriculars". A graph view for Asciidoc in VSC to show includes, xrefs, partial includes. Graph view for everything actually - it's surprisingly insightful for PDM and config management. Analysis tools for sensor faults based on Python open source astronomy tools. All sorts of converters and aggregators and cleaners for a devil's piss bucket of enterprise systems. A bazillion new MapTools macros for gaming, making complex RPG systems nearly pushbutton. A little harvest of local LLM systems doing all sorts of things, like my "Reviewinator" for copy edit. I could type the rest of the day and wouldn't come close to the end of the list.
So, pretty amazing. Very interesting systems with what must be some N-dimensional geometry underlying, maybe a signal to an underlying principle of emergence. Who knows?
In the long term, it's going to be Enterprise Software that eats the big losses from these systems. For all sorts of reasons, but mostly because Enterprise is where software goes to die. It's all bespoke to hell, it's all ancient, no one is working there because they want to. So a domain expert, with AI assist and a little know how, is probably going to whip up a superior set of tools in a short enough time to make it really worthwhile. Watch that space: SAP, Siemens, Teamcenter, SalesForce. Watch their consulting revenue.
Last 6 months is humanity losing control of LLMs.
- Memory market cornering which mitigated the adoption of local AI despite great open model being released.
- Fast penetration of IP exfiltrating tools in companies world-wide.
- Developers producing more code that they can read.
- Autonomous agents killing Open Source by siphoning the attention economy
- Autonomous agents destroyed online communities (including HN)
- Autonomous agents being used in warfare (targeting, propaganda...)
- Widespread vulnerabilities discovered, Widespread supply chain attacks.
- Increasing inequality, fracture in perception, Green indicators, Grim realities.
If you only read bad news (i.e. mass news these days since that sells better) this will be the picture. But I have personally seen some insane stuff happen in biotech. Like, I can't believe we're lucky enough to possibly live our life in this kind of future. We already have actual therepeutics developed using Alphafold being tested right now in real clinical trials, but the next generation of stuff that will go into trials in the next 3-5 years will be insane. We will look back at current medicine like we look back at medieval times today.
Protein structure is not a rate-limiting step in drug discovery.
My mother is going on 5 years with multiple myeloma, a cancer that would have offed her in 5 months if it weren’t for advances in maintenance chemotherapy.
Medicine has done amazing things in my lifetime.
Nothing ever happens.
See you in 10-30 years when people are still dying of the same shit as today like oesophageal cancer and glioblastoma.
Maybe in the next century but by that time you and me both will be under the ground, and no, Amodei's doubling of human lifespan simply won't happen.
1 reply →
AlphaFold is not an LLM. As such, it isn't a fitting example for "good news" related to LLMs.
Alphafold isn't generative and using this as a rebuttal to OP is bad faith
I think it's just further exposed cracks in software engineering that were always there.
Ideally we'll come out of the AI hype cycle having learned better practices.
> Widespread vulnerabilities discovered
This is a good thing
> Widespread supply chain attacks.
This is a bad thing.
That is a half-truth.
Metal Gear Solid 2 was quaint and funny until 2025.
[dead]
[flagged]
> - Memory market cornering (...)
Wait, what? What is that?
> - Fast penetration of IP exfiltrating tools in companies world-wide.
That goes on the benefit side, I believe.
> - Autonomous agents killing Open Source by siphoning the attention economy
Anything attention economy disappearing is a "good riddance" to me.
>Wait, what? What is that?
i believe they are just saying that RAM prices went crazy
> and there’s zero chance any AI lab would train a model for such a ridiculous task.
I'm not sure that's true anymore considering how popular Simon's blog is
> So maybe the AI labs have been paying attention after all!
> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.
As acknowledged in the article.
Gemini 3.1 basically takes it home on that benchmark, anyway, it's done.
1 reply →
Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!
Banana man on the Segway
That bit probably works better in the talk, it was a setup for a joke later on.
It's practically a benchmark now. Some friends have been specifically training models to count the R's in "strawberry"
I asked Gemini for a video of 'pelican riding a unicycle in hyde park' - I was blown away by the output:
https://gemini.google.com/share/55e250c99693
According to OP:
> Why this test? Because pelicans are hard to draw, bicycles are hard to draw, pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.
At this juncture I'm left wondering why competing AI labs wouldn't train for this now well known "test".
Given their proclivity to scrape the entire contents of the internet, it's only a matter of time intentional or otherwise.
I've heard the same has happened with common benchmarks (they've ingested solutions into training data)
I'm surprised by Grok as well:
https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...
Interesting that it does better at making the pelican peddle in the video generation than in image generation.
Graphically perfect, but content-wise nonsense. The pelican's center of gravity is clearly behind the wheel. It needs to be above or very slightly ahead of the wheel.
I don't think it's graphically perfect either.
The length of the pedals keeps changing, and you'll notice that neither of the pedals actually rotates around the hub: consistent with your point about the center of gravity being too far back, the circle the pedals are making is also shifted back too far.
1 reply →
Still impressed. And, to be honest, I don't think that this problem matter much. Physical accuracy is very nice, but for example is not the most important aspect when I watch a fantasy movie. Or even a scifi one.
Maybe the pelican has something heavy in its mouth.
I do hope that JEPA can help resolve the nonsense from AI models.
Google/Gemini has pretty impressive audio visual capabilities. I tried to have Claude add mulch to a landscape picture and it looked like someone hit it with the orange spray paint tool in MS Paint. Nano Banana actually produced something fairly realistic
That’s really impressive, and slightly worrying for creatives involved in film, animation or modelling.
Even more worrying are the implications for fakenews, propaganda, fraud, deception and mental health.
12 replies →
It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.
But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.
13 replies →
I wouldn't be that concerned that animation is going anywhere. Both outputs look really off, especially around the feet.
2 replies →
It's really not
only SVG counts tho, dont know why
Willison chose this task because (unlike actual images of pelicans) is was clearly not in training data, but could be reasoned about and composed from what's there. But just like those "how many golf balls can you fit in a 747?" interview questions, it should now be retired.
1 reply →
It's a test of text-based LLMs to see how good they are at SVG geometry. Video models are a different category of software entirely.
[flagged]
If you're a vulnerability researcher or a security person generally, there's a big inflection point from Spring of this year.
If it turns out to be a good change or not is to be seen.
The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.
The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.
Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...
There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.
A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.
1 reply →
I'm a security person and would love to hear other people's input here as I don't have that much experience with this
Can you be more specific?
Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.
We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.
The Claude Mythos / Project Glasswing thing is real: https://www.anthropic.com/glasswing
I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.
I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/
4 replies →
Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.
Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with
Are you referring to Claude Mythos?
All I see is mention of how various models generate image of "pelican riding bicycle(s)"
Yes, the "pelican riding a bicycle" is the ultimate test of not understanding how LLMs work.
Well, a combination of that and believing that replication of test data is a good measure of progress.
Spicy — why does it show ultimate non-understanding?
2 replies →
We all know the true test of AI is Will Smith eating spaghetti.
Wait, are you saying you don't handcraft svgs of pelicans riding bicycles?
Something that’s largely been ignored: DeepSeek has made context caching virtually free with V4-Flash.
I swear to god that DeepSeek V4-Flash is the most useful model available right now. It's SO FAST and is good enough for so many tasks that I run it most of the time for almost everything. Even when it messes up, it's so cheap to iterate that I can fix most problems without changing the model to a more "capable" one.
I'm tired of the pelican bench, it made sense in the beginning, but at this point it got too popular and old to consider the assumptions from one year ago (absence in the sample/training/reinforcement) to still hold.
The tooling has become so good though - the eco-system around the LLM. The models have become really good, yes - but it's definitely slowed in my opinion. The tooling is what really has become great - "harness" is probably the best word. When folk like Elon/Schmidt/Theil/etc. talk about singularities and industrial revolutions - it sounds extremely out of touch - or actually protective of the massive capex they've potentially sunk.
EDIT: Schmidt's booed commencement speech was probably one of the most out-of-touch speeches (outside of a tech interview) I've heard.
Much of the recent improvement in models is in being trained specifically to make use of the tools the harnesses give them.
Reading through the thread, a lot of the inflection point debate seems to come down to people talking past each other about what got better. My read is that the models themselves didn't really jump in capability around November, but the harnesses around them got considerably more reliable, and the RLVR work earlier in 2025 had been training the models specifically to behave well inside those harnesses, so when the two met you got a compounding effect that felt like a step change even though neither piece was that dramatic on its own.
I think that's probably why everyone in this thread has such different experiences - someone whose workflow is mostly asking a model for code and pasting it in would have seen modest improvement and would reasonably wonder what the fuss is about whereas someone who was already running agents on 20-step loops would have felt a much bigger shift, because the thing that used to kill those runs was the failure at step 12 cascading into garbage by step 20, and that got a lot better.
The local model story Simon kind of glosses over is interesting for the same reason - a 20GB model drawing a decent pelican on a laptop is a cute data point in isolation. The thing worth noticing is that a competent local model inside a good harness now gets you closer to frontier performance than running the frontier model without a harness does.
How much of what is being generated by LLMs is actually value add? My perception is there are lots of great experiments, but little real value.
+ Developers are more productive, but are you all leaving work at 3p and enjoying a new found sense of work-life balance?
+ Companies are investing heavily in AI, yet I'm paying more for the same thing. Jamie Dimon still pays me 0% on my checking despite spending billions on AI.
It may be that simply adopting AI isn't enough. Could new startups that are born-in-AI buck this trend? I wonder what Clayton Christensen would say if he were still around.
December 2025 was the breakthrough for me. January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.
I find your emotional language truly quite fascinating. I've heard people talk like that about drugs.
I actually thought it was a joke comment, but I'm worried now that it's not the case.
Similarly, I've heard people talk like that about things that are not drugs.
1 reply →
I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.
Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.
It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.
The openclaw ban pushed me over to 5.5 for some daily usage. I feel like Opus and 5.5 are good at very different things. 5.5 can be too literal, and it does not have as much of a ‘creative’ bent whether that’s toward design, UI/UX, interpreting vague instructions, etc. So, in that way, Opus had sort of spoiled me.
On the other hand, this year I’ve been in the habit of using codex as a bug finder / audit layer, where it shines, and I can tell you, Opus makes a lot of mistakes, and as we all know struggles with laziness — and has gotten good at encoding that laziness into the codebase (// Per instructions, pass this test by default) where it can live for a long time. So, Opus had spoiled me, but more with its ability to sketch holistically than its ability to put out perfect codebases.
Upshot - it was good to switch horses for a while, as you mention. Slightly different skill sets there. And I still reach for claude especially for initial design. But right now the daily driver is 5.5 / xhigh fast mode, and it’s very capable.
I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.
ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.
2 replies →
I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to
[dead]
"there’s zero chance any AI lab would train a model for such a ridiculous task"
Hmm, given how small the nerd community is and how often I met that task either on Hacker News, or on various Substacks, I am not so sure that the AI labs would ignore it completely.
Apart from GLM 5.1 and Qwen 3.6, there are other Chinese models that are noteworthy: Kimi K2.6, Xiaomi MiMo V2.5 Pro, Deepseek v4 and MiniMax M2.7.
100% true - I only had five minutes so I had to edit it down to just a couple, but all of those models are excellent and keep leap-frogging each other.
Looking forward to next time, hoping you mention speculative decoding and MTP :)
It would support your point about the performance of 20GB local models.
About Pelicans on bicycles:
> there’s zero chance any AI lab would train a model for such a ridiculous task
Well, I think this guy's tests have got enough visibility that I wouldn't be surprised if some AI models are trained on it specifically...
yeah, simon's blogs have been on the front page multiple times now, I wouldnt be surprised if all of them added s apecial case for it
Am I crazy, or are these differences between the best models so marginal that you’d get roughly the same performance if you use the same high-quality harness (ie preloaded instructions from md files, including custom skills)?
You will immediately notice the difference if you use it at the threshold.
It's like most people just watching a 'starting nba player' (not superstar, but just starting player) vs one that sits on the bench.
If you were to just watching them play, work out, shoot - you'd never notice the difference.
Put them head to head and it's 98-54 and you start to see the patterns.
It's pretty interesting actually, someone tell me what the 'science' for this is, I'm sure there is some kind of information theory at work here.
Software has innumerable kinds of problems at varying level of complexity and so it provides the perfect testbed for seeing how far models can go in practice.
Should add: you're very right to hint that harness, tooling, and models tuned o both the harness and he kinds of things people do on the harness, as well as some other things do make enormous difference.
Bu and large, SOTA Codex/Claude Code are substantially better - at least for now. That may change.
Head to head is interesting. I had not tried 2 agents on the same task simulateniously with 2 models.
No you're not wrong. Many people will see what you see. Enthusiasts will see it as monumental squeezing out that last drop of performance. In my opinion I think it is okay for enthusiasts to feel that way. I'm just satisfied with getting a tool as an aid.
Personal opinion we need to focus more on efficiency instead of how large or complex a model can get as that model creeps into more resource requirements. If the goal is to cost a billion dollars to operate than we've really lost the idea of what models are supposed to be achieving.
By definition the differences between "best models" are small. It's tautology. If a model is significantly dumber than the others then it's not one of the best models.
To an extent. I've had GPT 5.5 solve problems that Opus 4.7 struggled with, using an identical AGENTS.md/CLAUDE.md and no skills.
The difference is very noticeable as your codebase gets bigger and you give higher and higher level tasks.
I've certainly had things that Opus fixed using some kind of work around that GPT-5.5 actually solved.
And the difference between the Sonnet/Gemini/DeepSeek tier to the Opus/GPT-5.5 tier is immediately obvious.
I have the same experience. I've been running sequential agents in my own harness that is a standard SDLC pipeline (plan, design, code, build, test). It has gates between each stage to control quality.
The big benefit of automating this for so long is that I have lots of data. I analyzed it and found that I can change the models out without much of a change in the output quality.
For one-off tasks, where there is no harness and you're just YOLOing with the TUI, yes, big difference. You need a harness.
The pipeline controls the quality far more than the model, empirically.
You have correctly identified that getting a "high-quality harness (ie preloaded instructions from md files, including custom skills)" is the (or at least a) hard part.
Because you have to adjust the harness to your problem space and provide that so you can say it is high-quality.
Many people will stop that discussion at the claude code vs. codex vs. opencode level and then merge that with discussing model performance.
And that is also why "Generate an SVG of a pelican riding a bicycle" is still a benchmark worth discussing. Because at least it is a defined problem space.
Haven’t noticed much significant progress in LLMs myself in 6 months (significant as in new or vastly improved capabilities or understanding, not new releases, there are plenty of those).
I feel like if anything people started to realise the significant limitations of LLMs when you try to use them as ‘agents’ which was the big direction LLM companies tried to push recently.
Best use of LLMs so far IMO is finding vulnerabilities (with human help) and pattern matching in other domains. For generating code and prose they are still mediocre and somewhat unreliable and for use as personal assistant agents I wouldn’t trust them.
So what’s happening with openclaw, the biggest experiment in agentic, vibe coded by the agents themselves? The thing that was so hot a few months ago.
https://github.com/openclaw/openclaw/pulse?period=daily
279 commits to main from 77 authors in the last 24 hours.
Why is there so much churn and how could you trust it with your data? This is changes in ONE day!
If these are useful changes, surely it’d be superhuman by now given months of this pace.
What are people using this for?
They’re definitely RL training the models on the pelican test. They patch any kind of test that shows them performing poorly by hardcoding some answers into the model.
> One of my projects was a vibe-coded implementation of JavaScript in Python—a loose port of MicroQuickJS—which I called micro-javascript. You can try it out in your browser in this playground.
I'd like to remind everyone here that people on this forum used to actually code truly remarkable and pointless stuff like this, with zero LLMs, using nothing but their brains and motivation from who the heck knows where from.
We all have had the client from hell: they don't know what they want, they change their requirements all the time. Whenever they have a new half-baked idea, I need to scramble and re-design the architecture. They have no clue that a small change request has a big impact on the code.
Well... Now I can be that client. And let AI deal with my incomplete, always changing requirements. And get it done anyway.
Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?
Is the only choice to pay for the "max" plans?
Or just read so much about it that you bs your way through an interview and then use the company's resources?
Simon, I'm curious too how much you invest each month researching all the latest and great AI tech?
Opencode has free access to Qwen 3.6 and Deepseek v4 Flash right now.
They're on par with Claude and Codex imo - when you still design architecture and know what the output should be. Claude and GPT 5.5 need less guidance with vibe coding, but we're not yet at a point where that's sustainable anyway even with those models.
Thanks, playing with Opencode now. It just wrote a half-decent Android small app. Pretty good so far!
>Starting from zero today, how would someone quickly get upto speed with the latest and greatest AI tooling on an extremely limited budget?
Z.AI, Moonshot.AI, Xiaomi, Minimax, Alibaba all have coding plans that allows a massive usage of GLM 5.1, Kimi k2.6, Minimax M2.7, Qwen 3.6 Plus, Xiaomi MiMo v2.5 Pro for cheap.
Pair those coding plans with the harness of choice including Claude Code and you are good to go.
Also, Nvidia is offering free access to top models for free through NIM - but you have 40 RPM limits. https://blog.kilo.ai/p/nvidia-nim-kilo-code-free-kimi-k25
I made an account on OpenRouter.ai , created an API key, plugged the API key into the Zed editor, and started asking free models questions about my codebase.
Once I felt I had some confidence on what the spend rate would be, I bought $20 USD worth of credits and would occasionally point my editor at a cheap paid model for some real-time questions.
I've still only spent less than $2 in credits so far, as often a free model can answer my question fast enough.
I have not yet tried agentic coding, but at least with OpenRouter API keys it's trivial to cost-cap keys so you can pay for lower latency and still cap your spending.
$20 chatgpt pro plan gives pretty generous usage both of codex, general chat
Ah I'd read so much about the downgrading of that plan I didn't think that was still true?
1 reply →
Opencode go + pi.dev is 10$ a month.
The honest summary that doesn't show up in the six-month roundup: the unevenness. Boilerplate, tests, scaffolding, glue code: dramatically faster, sometimes 5-10x. Architecture, data modeling, careful security work, judgment calls about what to build: same as before, sometimes slower because tab-completion sneaks in plausible-but-wrong defaults you then have to undo.
The thing headline numbers ("AI made me 3x faster") hide is which 30% of the work the AI sped up and which 70% didn't move. For a solo dev the survivable bet got smaller, and that's the real change, not raw productivity. AI made certain projects worth attempting at all that wouldn't have been viable six months earlier.
what are your thoughts on Software engineer replacement. My team has already seen big reductions. Q/A team is gone. Software Engineer reduced by a third. Scared for the future
Ditching the QA team when the single highest challenge is verifying that vibe-coded systems do what they're meant to is extraordinarily short-sighted.
Personally, the more time I spend working with coding agents the least worried I am for my career. Getting the best results out of them is really hard. They amplify existing skills and experience, so the more experience you have the better.
I believe that many of those saying that they "never write code anymore" or are experiencing "10x productivity," are heavily underestimating (or outright misrepresenting) how much they are guiding the model, and ignoring everything else that goes into shipping fit for purpose software. I frequently see zero measurements or factual arguments supplied to support such claims. I also see many people say that they are "vibe coding," when they are almost certainly reviewing, editing, or otherwise steering the output.
I wonder why there is such a mad dash to trump up the capabilities of coding agents. And why such loose terminology and lack of rigor? I thought programmers were supposed to be rational people (har har!)
1 reply →
Have you seen the automated tests that QA members deliver? My experience is that they are horrible, and it's not so hard to beat that low quality bar with an LLM.
I have a theory: if they were good at writing automated tests, they would have been developers instead of QA engineers.
Not saying that there aren't any high quality QA engineers, I worked with some. But LLM's raised the bar in a way that most QA engineers can't reach.
3 replies →
If you're famous, you'll be fine. If you're in retiring age, you don't care. Otherwise, good luck! We put ourselves on the street by not protesting what is happening.
I think the general population earning median wages will have very little sympathy for first world software engineers earning vast amounts of money.
What are you going to tell them? Suddenly you're earning what they're earning for sitting at a desk every day?
1 reply →
There is an entire category of software engineers who exist entirely to knock out features on microservices or do easily automatible QA work whose jobs will disappear.
I think there will be larger markets, more companies, more jobs than before due to AI, but also a very painful transition period
AI reduces the cost of producing software (and other intellectual tasks), which greatly improves the viability for more and more ambitious projects. As far as we know the amount of problems software (and humanity) can solve is unbounded
It feels like the market has shifted in SWE yet again to heavily prioritize a new set of skills, of which those in the top quartile are desired more than ever
The problems in any domain are infinite. But, alas, money is not.
What are these skills?
2 replies →
> Google released the Gemma 4 series of models, which are the most capable open weight models I’ve seen from a US company.
Implying another country has a better model? I'm being pokey here because I'm very curious! I know Gemma is efficient, but I also remember Qwen and Kiwi being referred to as optimized. The difference being that Gemma is using less tokens, but maybe Qwen/Kiwi's quality is higher? I dont know.
The consensus right now is that Qwen3.6 in its 27B and 35B-A3B versions is better for coding whereas Gemma4 is stronger when it comes to OCR, audio transcription and the likes. Margins are slim though and the harness at these model sizes is the most important factor.
In my experience the qwen models are best locally, but gemma ones have always been good. gemma4 is a notable improvement.
There's also an inflection point in Feb-April: Claude got considerably worse, and arguably has not really recovered since then. They claim it's fixed, but my experience it is not as great as it once was. 4.7 is still useless.
Waiting for the next event at this point. Hoping that "inference becomes cheap" when Groq hardware gets delivered.
Also LinkedIn wars of people trying to claim throne as most AI-pilled, throwing down strawmen stories of luddites yelling at data centres who'll lose their job to a single person doing 100x work.
top model changes every other month between Claude, GPT and gemini. but its dominated by GPT overall. Claude has taken lead in coding task but GPT 5.5 has come stronger. gemini was good in between. but its dominated by GPT 5.5 and claude overall. Coding is the area where disruption is hardest. Opencalw early this year was a major breakthrough in agentic AI and it is still making noise and becoming more mature and going toward enterprise. Agentic coding is still in adoption phase where teams are trying it , trying to make sense out of it, running it and not beleving it and eventually it is discussion point over tea. it is still in adoption phase but needle has moved from being alient to being something real which team started discussing and using it like a champ.
why is there no talk about the world is already run by AI by proxy? ie bureaucrats using chatgpt to make their speeches decisions shopfront designs etc. I just dont seem to read about this, intead its more this nebulous specific date in the future
My goal post for "AI will definitely replace most SWEs" was to reproduce a particular 90s programming game one shot and then add multiplayer support with minimal prompting.
Opus 4.5 hit that point in November.
I tried this a while ago, haven’t tried again recently. The models were producing code that was clearly lifted from stuff in their training data, and what I ended up with was a fairly decent game in html and js after a bit of tidy up, though it felt like several code paradigms smooshed together rather than a coherent whole, but it mostly worked. Not something I’d want to maintain but it was impressive at the time.
They were able to one-shot famous games (like asteroid or pong), I suspect because they had been trained on multiple versions of that game. So like producing Harry Potter, with the right prompt it was able to produce a license stripped version of code it had seen. I tried another arcade game like frogger and it failed really badly and took a lot longer, never got it working.
The whole exercise left me feeling they have a long way to go, I don’t see how anyone could think they would replace SWE unless they didn’t look at the code produced, even now.
Out of curiosity - what harness did you use, and what model? And how are you prompting? In my mind prompting like:
“You’re going to make frogger in javascript. I want a complete clone of functionality for level 1, with amazing 80s era pixel art sprites. I’m super lazy, so you’re going to have to test everything, right from the start. Pick a test harness, write the tests, including tests for having amazing graphics, gameplay, input, UI, sounds, etc, and write a full workplan, then work through that workplan, in parallel where you can. The workplan should emphasize getting a stripped down version up immediately and have workstreams for all the major requirements after that. Add a final test that assesses how fun the game is by reviewing a real video of a test run. Loop on that final test until you can’t improve things any more.”
Should produce something playable with no further input. As you say, I’m not sure it would produce a codebase we’d want to look at or work on. But, I’d be surprised if this weren’t successful.
8 replies →
'Producing Images' or even 'Some Code that is Valid and Compiles' is in some ways one of the most misleading ways we assess quality of the AI.
It is getting very good at producing code that compiles - at the algorithmic level.
This is definitely noteworthy - and the AI is crossing a critical 'productivity threshold'.
But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
Everyone has tried to get AI to 'Draw The Thing They Want' and you notice immediately how it's almost impossible to 'adjust the image' along the vector you want - because ... and this is key:
-> the AI doesn't really understand what a Duck is, it's components, or fully how it made the duck <-
It just knows how to 'incant' the duck.
This becomes very clear when you try to get the AI to write proper documentation - it fails so miserably, even with direct guidance.
This is really strong evidence of how poorly the AI is generalizing, and that it is not 'understanding' rather it's 'synthesizing' from patterns.
We already kind of knew that - but we have not yet built an intuition for that until now.
Only now can we see 'how amazing the pattern synthesis' is - it's almost magic, and yet how it falls off a cliff otherwise
This has deep implications for the 'road ahead' and the kinds of things we're going to be able to do with AI.
In short: the AI is 'Wizard Level Code Helper, Researcher, and Worker' - but it very clearly lacks capabilities even one level of abstraction above the code itself.
LLMs were first trained by 'text' and now ... they are 'trained by our compilers'. Basically g++, javac, tsc are the 'Verifiable Human Rewards' in the post-training and reinforcement learning - and the AI is getting extremely good at producing 'code that compiles', but that's definitely an indirection from 'code that does what we want'.
It's astonishing that it took us all this time to internalize and start to discover what I think will be in hindsight a very obvious 'threshold' of it's capabilities.
We are constantly 'amazed' at the work that it can do, and therefore over-project it's capabilities.
I have no doubt that even with these limitations - the AI will unlock a lot more as it gets better - and - that it will 'creep up' the layers of abstraction of it's understanding.
But I strongly believe that the AI is going to get much 'wider' (pattern matching dominance) before it gets 'higher' (intrinsic understanding) - and - that this may be a fundamental limitation.
This may be 'the Le Cunn' insight - when he talks about the limitations of LLMs in detail - I believe this is that insight writ large.
Even the term AI - or certainly 'AGI' may be a misleading metaphor - were we to have always called it 'Stochastic Algorithms' or something along those lines, it's possible that our intuition would be framed a bit better.
The most interesting thing is how it is definitely amazing, world changing, novel and powerful and some ways - and obviously useless in others at the same time. That's the 'threshold' we need to better understand.
> But 'Drawing of a Proper Duck' is almost arbitrary because it may have nothing to do with the 'Specific Duck You Wanted'.
That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
The model actually has to understand what parts of a pelican and bicycle come together in something like an anatomically plausible way. That's a higher level of abstraction than something like passing the same prompt to Stable Diffusion etc
(The new Nano Banana/GPT Image 2.0 models are different though - they have significant world knowledge baked in)
"That's a higher level of abstraction"
No, it's not because it's seen 'anatomy' for Pelicans, Animals - even how it's represented in Animals.
If you try to get the AI to actually decompose it and start to 'draw pelicans' in very obscure ways, it will immediately fail.
Try to get the AI to draw the pelican form a very odd angle - like underneath, to the right, one wing extended, one wing not ... 0% chance.
Precisely because it does not understand those things.
FYI it's a slightly unfair case because it does not have 'world model' yet, which will actually solve that problem, but even then not through very much abstracting.
We're a long way away - but in the meantime, there's lots to unpack.
10 replies →
> That might be the case, but Simon's case "Generate an SVG of a pelican riding a bicycle" is very different.
When it was new, sure. Right now, models can be trained on that because everybody uses it as a benchmark.
2 replies →
Wow! Actually a sensible comment under all the astroturfing that even this place is so full of now.
I think that there's a lot to be improved in harnesses and the way the models are interacting with harnesses. For example, the harness should be able to steer the model when thinking.
This is most impressive because the last 6 months in LLMs has actually been more like a hyper-compression of decades of tech progress.
"... though when it tried to animate it the bicycle bounced off into the top and the bicycle got warped."
Should be the pelican bounced off.
Spot on. Building our tool, we found AI is magic at scraping competitor data, but terrible at market validation. The 'why' is strictly human.
So, the best way to use LLMs is to wait for your competitors to do market validation and then scrape their data.
Hmmm......
It's always been much easier to copy an existing product than to make a new one nobody's thought of before.
sorry but how this comment refers to the commented post?
https://imgur.com/a/UlGcBou
> I put together these annotated slides from my five minute lightning talk at PyCon US 2026
Is there a video or audio of this talk?
Is the RLVR the key breakthrough for the uplift or is there more to it?
Does that suggest the uplift was only for things that are easily verifiable like code?
Yes, with good RLVR at scale you can greatly improve performance especially on benchmarks
The hope was that good RLVR on relatively contrived datasets (like benchmarks) would be generalized to good software taste, which has somewhat succeeded but also the models fail in horrible ways still
And the hope beyond that is that good skills in fundamental problem solving tasks (coding, math) would generalize to tasks beyond math and code, which did happen but less so
I would say that most improvements are in easily verifiable things like code or math. Atleast that's where all the amazing results seem to be coming from.
Other domains I am not sure but I've heard from people like Cal Newport that the rate of increase outside of code and math are not as equally impressive
RL we're gonna find out will get abandoned cuz we don't even know what is getting "aligned", just my naive gut feeling don't take it seriously
>pelicans can’t ride bicycles... and there’s zero chance any AI lab would train a model for such a ridiculous task.
humm
Is there a video of this talk?
The claw thing really came and went fast lol
I just started a new job and the person I report to was just excited to tell me about it, here in Mid May
"and then you have to get a mac mini, and then, and then"
smile and nod, it pays weekly
I mean yeah? It was marketing campaign to boost the model providers and give Steinberger a cozy job at OpenAI. Hook, line and sinker.
Wake me up when we have an agent with constant learning and changing weights that I can have personally, not some LLM that can always fall prone to jailbreak and context injection attacks.
You think most of this stuff here is organic? Oh boy..
AI is like Sauron's Ring: it only amplifies the user's innate abilities.
It can either help you conquer the world if you were already doing that anyway or it can make you spend your life in a cave before throwing you into a fucking volcano.
It definitely seems like the point of no return has been passed.
The size of the codebase doesn't matter anymore. In fact, I am finding that the larger the codebase the better the performance. Starting from scratch with vague ambition is not the same as solving a specific stack trace over a mountain of decade-old code. The later performs better and is also more exciting for the business. It would seem more callers = more constraints to verify against.
For the last 3 months I've felt like I've been dropping gps guided bombs from orbit. No one can tell the difference between AI authored and my hand written code, other than via the implication of the radically increased daily work volume. There's definitely AI in there, but it's like a homogeneous cybernetic blend of my work and the computer's. I own all of it, can explain all of it immediately, but I only wrote maybe 10% of it by hand.
The development team should be mostly "solved" by now with regard to the AI transformation. If you are still at Home Depot picking out your proverbial hammer, it's time to start heading for the self checkout. The rest of the business is where the real money and headlines will be made at this point. AI writing code is ancient news now. Custom harnesses that business people can use to automate workflows will print a lot more money. Bringing some bacon to the rest of the business may also help to preserve your career path in these uncertain times.
Remember what Jobs said about the customer. A lot of times, people don’t know what they want until you show it to them. Most people wouldn't have believed the iPhone was even remotely possible until the moment it was publicly revealed and made available for purchase. I am finding the same effect in the business with AI. What it can actually do when well engineered and applied to the domain will usually outperform the expectations of its users by a wide margin. All these fears about alignment, hallucinations, cost, ethics, the environment, my ego/career, etc., seem to melt away like some kind of luxurious chocolate once the performance becomes clear to the executive staff. I was able to convince the board with an unsolicited, 5 minute demo I didn't even personally deliver. I've never seen these people sign contracts so quickly.
As someone who uses AI daily (not in agent mode, just user-interactive), I have definitely noticed major quality improvements over the past few months. And that's surprising, because when you use something daily, you tend to overlook the big jumps.
I haven't looked into any sort of "agent" mode, just because I don't yet quite trust the AI not to do something dumb. Also, I don't use M365, where Copilot is integrated, so I suppose I would have to set it up myself.
What real world problem is closely linked to the skill of drawing a pelican riding a bicycle?
I'm always surprised to see HN people saying models aren't good. What are these guys building? The best engineers I know, from startup to big tech admit these models are incredible. Including people I don't know personally, foundational engineers from every area. The average HN person though, is doing some quantum-alien computation that not even the best developers in the world can grasp.
I'm so glad Simon is documenting this. The field is evolving so fast, so rapidly, so hungry for data and money, that few are willing to zoom out and document everything big picture so we can see the changes over time. I mean do you guys remember "Do anything now"? Just a distant memory, a funny party trick.
TL;DR:
"Coding agents got really good - here, a bunch of non-releavant slop-pictures of pelicans riding bikes as a key benchmark AND a couple of hardly relevant edge-case demo-projects of mine to prove it right! "
Come on man, where is the AI writing all the code in 6 months? We're close to June and Amodei's latest statement from January does not look like going into fulfilling over the next weeks, does it now?
a 5-minute video version (with local TTS model) https://tldr-api.manatee.work/v/dmYg0U
Does this guy have a "publish to front page of HN" button on his blog editor?
HN has a mechanism that causes popular blogs to stay popular.
It's a winner-takes-all karma prize for being first to post the article.
This causes a rush of people to post.
HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.
This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.
This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.
One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )
He’s pretty well known in the HN community. https://en.wikipedia.org/wiki/Simon_Willison
thats a cool wiki picture
I didn't even submit this one. I didn't actually think this was a good fit for hacker news, the pelican bicycle thing is pretty much played out here already!
I liked the article, so if he has such a button I hope he keeps clicking it.
He's one of the main developers behind Django.
Years ago I used to read his blog on Django and found it quite interesting despite being neither a Django nor even a python user - this must have been at least 10 years ago and perhaps more.
When he resurfaced in my feeds as an AI commentator it took me quite a long while to join the dots that he was the same person!
its better than ex-google CEO spam i see astroturfed everywhere else
he usually have good posts so people usually upvote
He has the most measured (and often quite detailed) posts on LLM and LLM progress, and is the opposite of hype.
[dead]
[dead]
[flagged]
[flagged]
[flagged]
[flagged]
[dead]
[dead]
[dead]
[flagged]
We've banned this account.
We detached this comment from https://news.ycombinator.com/item?id=48189072 and marked it off topic.
[flagged]
[flagged]
Certainly a massive AI booster. What Are the conflicts of interest?
I met Simon for the first time this year at pycon. Wow, what a great guy.
It’s good to see dates being hard coded re. Improvements in the models that should deliver material gains.
As time progresses one now has a yard stick to measure against progress. No more excuses - show me the money baby.
There's something fitting about the mystical nature of LLMs and scrolling through a bunch of goofy pelicans on bicycles representing report cards for the bleeding edge of technology.
How are these even graded? Qwen3.6-35B-A3B gets high marks for a pelican with a gaping hole in its bill?
edit: Just noticed its feet are disconnected from its legs as well (but right on the pedals!). Pardon my French but that's Chinese af.