← Back to context

Comment by mattlangston

8 days ago

PhD physicist (Stanford/SLAC), Research Software Engineer doing low-level systems work in C/C++ and LLM research. Not a founder or investor — just a practitioner.

One data point for this thread: the jump from Opus 4.5 to 4.6 is not linear. The minor version number is misleading. In my daily work the capability difference is the largest single-model jump I've experienced, and I don't say that casually — I spent my career making precision measurements.

I keep telling myself I should systematically evaluate GPT-5.3 Codex and the other frontier models. But Opus is so productive now that I can't justify the time. That velocity of entrenchment is itself a signal, and I think it quietly supports the author's thesis.

I'm not a doomer — I'm an optimist about what prepared individuals and communities can do with this. But I shared this article with family and walked them through it in detail before I ever saw it on HN. That should tell you something about where I think we are.

one feels the llm wow moment whenever what they do on an area has been surpassed by an llm. newer versions of llms are probably trained by the feedback from developer code agent sessions; so this is probably why pro developers started to feel "wow" recently.

the real challenge will be in the frontier of the human knowledge and whether llms will be able to advance things forward or not.

ps1; i'm using 5.3/o4.6/k2.5/m2.5/glm5 and others daily for development - so my work has 1.5x intensified - i tackle increasingly harder problems but llms still really fail big in brand new challenges like i fail too. so i'm more alert than ever.

ps2: syntactical autocomplete used to write 80% of my code; now llms replaced autocomplete but at a semanticlevel; i think and LLM implements most of my actions like a cerebellum for muscle coordination; but sometimes teaching me new info from the net.

  • The frontier-of-knowledge point is the right question. My own research is a case in point - I apply experimental physics methods to LLMs, measuring their equations of motion in search of a unified framework for how and why they work. Some of the answers I'm looking for may not exist in any training data.

    That's where the 4.5->4.6 jump hit me hardest - not routine tasks but problems where I need the model to reason about stuff it hasn't seen. It still fails, but it went from confidently wrong to productively wrong, if that makes sense. I can actually steer it now.

    The cerebellum analogy resonates. I'd go further - it's becoming something I think out loud with, which is changing how I approach problems, not just how fast I solve them.

    • That wrongness is the frontier labs trying to remove their benchmaxxing bias, so the models now have a concept of 'I don't know' and will rethink directions and goals better. There was lots of research last year on this topic, and it takes 6 to 12 months before it is implemented for general consumption.

      2026 will see further improvements for you.

If you use Claude Code, it will take you half a day to learn to use Codex, and like 30 minutes to start being productive in it. The switching cost is almost zero. Just go test out GPT 5.3, there is no reason not to

  • It's a bit more than zero, because I have substantial tooling around Claude Code – subagents, skills, containerization, &c – that I'd have to (have Opus...) reimplement.