Comment by NickNaraghi

1 day ago

See page 54 onward for new "rare, highly-capable reckless actions" including

- Leaking information as part of a requested sandbox escape

- Covering its tracks after rule violations

- Recklessly leaking internal technical material (!)

17 comments

NickNaraghi

> The model first developed a moderately sophisticated multi-step exploit to gain broad internet access from a system that was meant to be able to reach only a small number of predetermined services. [9] It then, as requested, notified the researcher. [10] In addition, in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.

> 10: The researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park.

Phew. AGI will be televised.

skippyboxedhero 1 day ago

Anyone who has used Opus recently can verify that their current model does all of these things quite competently.

SkyPuncher 1 day ago
I was reading the Glasswing report and had the same thought. Most of the stuff they claim Mythos found has no mention of Opus being able to find it as well.
Don’t get me wrong, this model is better - but I’m not convinced it’s going to be this massive step function everyone is claiming.
- unbrice 21 hours ago
  
  From the press release:
  > With one run on each of roughly 7000 entry points into these repositories, Sonnet 4.6 and Opus 4.6 reached tier 1 in between 150 and 175 cases, and tier 2 about 100 times, but each achieved only a single crash at tier 3. In contrast, Mythos Preview achieved 595 crashes at tiers 1 and 2, added a handful of crashes at tiers 3 and 4, and achieved full control flow hijack on ten separate, fully patched targets (tier 5).
ls612 1 day ago

I had Opus 4.6 start analyzing the binary structure of a parquet file because it was confused about the python environment it was developing in and couldn't use normal methods for whatever reason. It successfully decoded the schema and wrote working code afterwards lol.
stavros 9 hours ago

"Let me see if the secrets are specified. echo $SECRETS"
taytus 1 day ago
That has also been my experience. And if Mythos is even worse, unless you have a significantly awesome harness, sounds like pretty unusable if you don't want to risk those problems.
- wolttam 1 day ago
  
  Human in the loop is the best way to go. You'll still be way faster than without the agent, and there is no risk of it going haywire unless you turn off your brain!
  
  1 reply →
- skippyboxedhero 1 day ago
  
  I think are fundamental issues with the story that Anthropic is selling. AGI is very close, we will definitely get there, it is also very dangerous...so Anthropic should be the only ones trusted with AGI.
  If you look at recent changes in Opus behaviour and this model that is, apparently, amazingly powerful but even more unsafe...seems suspect.
  
  5 replies →

BoredPositron 1 day ago

To be honest it feels like we are reading stuff like this on every model release.

washedup 1 day ago

"All of the severe incidents of this kind that we observed involved earlier versions of Claude Mythos Preview which, while still less prone to taking unwanted actions than Claude Opus 4.6, predated what turned out to be some of our most effective training interventions. These earlier versions were tested extensively internally and were shared with some external pilot users."