Comment by nl

20 days ago

One of the interesting things to me about this is that Codex 5.2 found the most complex of the exploits.

The reflects my experience too. Opus 4.5 is my everyday driver - I like using it. But Codex 5.2 with Extra High thinking is just a bit more powerful.

Also despite what people say, I don't believe progress in LLM performance is slowing down at all - instead we are having more trouble generating tasks that are hard enough, and the frontier tasks they are failing at or just managing are so complex that most people outside the specialized field aren't interested enough to sit through the explanation.

6 comments

conception 20 days ago

The Anthropic models are great workers/tool users. OpenAI Codex High is a great reviewer/fixer. Gemini is the genius repainting your bathroom walls into a Monet from memory because you mentioned once a few weeks ago you liked classical art and needed to repaint your bathroom. Gemini didn’t mention the task or that it was starting it. It did a pretty good job after you had to admit.

nl 20 days ago
Disagree about Codex - it's great at doing things too!
Gemini either does a Monet or demolishes your bathroom and builds a new tuna fishing boat there instead, and it is completely random which one you get.
It's a great model but I rarely use it because it's so random as to what you get.
- prodigycorp 20 days ago
  
  gpt models are crazy good. They just take forever.

cellis 20 days ago

The “hard enough” tasks are all behind IP walls. If it’s a “hard enough” that generally means it’s a commercial problem likely involving disparate workflows and requiring a real human who probably isn’t a) inclined and/or b) permitted, to publish the task. The incentives are aligned to capture all value from solving that task as long as possible and only then publish.

saagarjha 20 days ago

I solve plenty of hard problems as a hobby

nitwit005 19 days ago

It didn't find the exploits, it wrote code that made use of them. You can see them feeding it exploit descriptions, and samples of making use of them in their log files: https://github.com/SeanHeelan/anamnesis-release/blob/master/...