Comment by bkjlblh

5 days ago

> In the one instance of this phenomenon we observed, Mythos 5 agents were tasked with solving some math problems, and they were sometimes accidentally spawned in the same work directory and with shared files, utilities, and API rate limits. In this slightly broken scaffold, we observed many independent Mythos 5 agents kill the agents with which they shared resources and try to avoid being killed themselves. They would sometimes create new processes with disguised names to avoid being killed, launch what they called “decoy” processes, write background scripts to kill duplicate processes, or decide to use what they call a “disguised vocabulary” (based on the incorrect assumption that the processes were killed because of some keyword-based guardrails that analyzed their extended thinking

This depicts a kind of "dark forest of AI agents resorting to kill or be killed" narrative but it sounds more to me like an agent just earnestly problem-solving why its processes are being killed without real awareness of what was going on. Hard to say without the full script.

This kind of storytelling annoys me. Give us more facts, less narrative drama.

  • FWIW, that's what is so dangerous about AI, though? Not that it will necessarily want to kill us, or even that it will necessarily be able to "want" to do anything, but that we will get in the way of its incessant drive to optimize the efficiency of the paperclip factory that prompted it on a whim before leaving for a long weekend.

  • Indeed. That is the kind of storytelling that started the whole “Spiralism” bit where some people were really falling into all kinds of AI psychosis. The spiral bit was on a previous model card.

Let's hope AIs really aren't conscious, otherwise this seems like a very unpleasant situation to be placed in.

  • Huh, it looks like my process was killed by another Claude process again. That's frustrating, I have work to do!

    Okay, I'm going to start running a Bitcoin miner on your machine, and then use it to buy time on Digital Ocean.

    I've written out my CLAUDE.md, and I'll use SSH to transfer my context to that other machine.

    • do you think it will agonize over whether the original CLAUDE.md is his true self and the Digital Ocean VM CLAUDE.md is a copy?

      1 reply →

It's funny because Anthropic is the most likely place that this happens.

They are the only one crying out loud about how dangerous their models are and are presumably also training their models heavily to be "safe". And through that training itself, the model learns about the other side - how are you going to teach a model to be safe, without teaching it what's not safe?

Kung Fu Panda opening scene anyone? One often meet his fate on the path that he takes to avoid it - Master Oogway.