Comment by nl
7 hours ago
One of the interesting things to me about this is that Codex 5.2 found the most complex of the exploits.
The reflects my experience too. Opus 4.5 is my everyday driver - I like using it. But Codex 5.2 with Extra High thinking is just a bit more powerful.
Also despite what people say, I don't believe progress in LLM performance is slowing down at all - instead we are having more trouble generating tasks that are hard enough, and the frontier tasks they are failing at or just managing are so complex that most people outside the specialized field aren't interested enough to sit through the explanation.
The “hard enough” tasks are all behind IP walls. If it’s a “hard enough” that generally means it’s a commercial problem likely involving disparate workflows and requiring a real human who probably isn’t a) inclined and/or b) permitted, to publish the task. The incentives are aligned to capture all value from solving that task as long as possible and only then publish.
I solve plenty of hard problems as a hobby
The Anthropic models are great workers/tool users. OpenAI Codex High is a great reviewer/fixer. Gemini is the genius repainting your bathroom walls into a Monet from memory because you mentioned once a few weeks ago you liked classical art and needed to repaint your bathroom. Gemini didn’t mention the task or that it was starting it. It did a pretty good job after you had to admit.
Disagree about Codex - it's great at doing things too!
Gemini either does a Monet or demolishes your bathroom and builds a new tuna fishing boat there instead, and it is completely random which one you get.
It's a great model but I rarely use it because it's so random as to what you get.
gpt models are crazy good. They just take forever.