Comment by timr
8 hours ago
> And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!
People have been doing this parlor trick with various "substantial" programs [1] since GPT 3. And no, the models aren't better today, unless you're talking about being better at the same kinds of programs.
[1] If I have to see one more half-baked demo of a running game or a flight sim...
"And no, the models aren't better today"
Can you expand on that? It doesn't match my experience at all.
It’s a vague statement that I obviously cannot defend in all interpretations, but what I mean is: the performance of models at making non-trivial applications end-to-end, today, is not practically better than it was a few years ago. They’re (probably) better at making toys or one-shotting simple stuff, and they can definitely (sometimes) crank out shitty code for bigger apps that “works”, but they’re just as terrible as ever if you actually understand what quality looks like and care to keep your code from descending into entropy.
I think "substantial" is doing a lot of heavy lifting in the sentence I quoted. For example, I’m not going to argue that aspects of the process haven’t improved, or that Claude 4.5 isn't better than GPT 4 at coding, but I still can’t trust any of the things to work on any modestly complex codebase without close supervision, and that is what I understood the broad argument to be about. It's completely irrelevant to me if they slay the benchmarks or make killer one-shot N-body demos, and it's marginally relevant that they have better context windows or now hallucinate 10% less often (in that they're more useful as tools, which I don't dispute at all), but if you want to claim that they're suddenly super-capable robot engineers that I can throw at any "substantial" problem, you have to bring evidence, because that's a claim that defies my day-to-day experience. They're just constantly so full of shit, and that hasn't changed, at all.
FWIW, this line of argument usually turns into a mott and bailey fallacy, where someone makes an outrageous claim (e.g. "models have recently gained the ability to operate independently as a senior engineer!"), and when challenged on the hyperbole, retreats to a more reasonable position ("Claude 4.5 is clearly better than GPT 3!"), but with the speculative caveat that "we don't know where things will be in N years". I'm not interested in that kind of speculation.
Have you spent much time with Codex 5.1 or 5.2 in OpenAI Codex or a Claude Opus 4".5 in Claude code over the last ~6 weeks?
I think they represent a meaningful step change in what models can build. For me they are the moment we went from building relatively trivial things unassisted to building quite large and complex system that take multiple hours, often still triggered by a single prompt.
Some personal examples from the past few weeks.
- A spec-compliant HTML5 parsing library by Codex 5.2: https://simonwillison.net/2025/Dec/15/porting-justhtml/
- A CLI-based transcript export and publishing tool by Opus 4.5: https://simonwillison.net/2025/Dec/25/claude-code-transcript...
- A full JavaScript interpreter in dependency/free Python (!) https://github.com/simonw/micro-javascript - and here's that transcript published using the above-mentioned tool: https://static.simonwillison.net/static/2025/claude-code-mic...
- A WebAssembly runtime in Python which I haven't yet published
The above projects all took multiple prompts, but were still mostly built by prompting Claude Code for web on my iPhone in between Christmas family things.
I have a single-prompt one:
- A Datasette plugin that integrates Cloudflare's CAPTCHA system: https://github.com/simonw/datasette-turnstile - transcript: https://gistpreview.github.io/?2d9190335938762f170b0c0eb6060...
I'm not confident any of these projects would have worked with the coding agents and models we had had four months ago. There is no chance they would've worked with the January 2025 available models.