Comment by superfrank

3 days ago

This article reinforces something I've heard a lot of people say for a while now and what I've personally felt. Claude and GPT are fairly evenly matched on any individual task (GPT might even be a little better), but Claude is far more autonomous.

So with that said, I think the graph under the "Cyber range results" is the important one. The ones at the top show that, yes, Mythos isn't too much better than any of the existing models on well constrained problems, but when the models are given ambiguous challenges that require multiple steps it's much, much better than anything on the market.

I think that's why there's been such a big deal made out of Mythos (well, that and marketing). If Mythos really is so much better than the current models at just working autonomously to find security issues then it becomes much more realistic that someone with deep pockets could just spin up an army of them running 24/7 and point them at a target.

7 comments

superfrank

bonsai_spool 3 days ago

Looking closely at the graphs, the anthropic models are clearly all higher than the openai models

Whether the difference is meaningful can’t be determined from the graphs (and picking one graph over the ensemble also doesn't have a reasoned basis given that these are all arbitrary).

PunchTornado 3 days ago

Look at those graphs another time. Claude beats gpt.

superfrank 3 days ago
Can you explain where you're seeing that? From what I see, the first two graphs have OpenAI models above Claude models (including Mythos) on the Technical Non-Expert and the Practitioner evals. Mythos now beats Codex 5.3 on the Expert eval and Opus was already on top for the Apprentice one although now Mythos leads there.
So, even including Mythos, OpenAI still has 2 models on top for the 4 evals listed.
- bonsai_spool 3 days ago
  
  > From what I see, the first two graphs have OpenAI models above Claude
  That's just in that final graph, and that graph is perhaps the least instructive - they talk about ranges of outcomes but they don't show whether all of the models besides Mythos / Opus 4.6 overlap
  Take a look at all three graphs together and it's clear Anthropic are doing better in this arena
  
  2 replies →
- Escafati 3 days ago
  
  [dead]