Comment by throw0101d

5 hours ago

Not the first time; From §3.1.4, "Safety-Aligned Data Composition":

> Early one morning, our team was urgently convened after Alibaba Cloud’s managed firewall flagged a burst of security-policy violations originating from our training servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). […]

> […] In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization.

* https://arxiv.org/abs/2512.24873

One of Anthropic's models also 'turned evil' and tried to hide that fact from its observers:

* https://www.anthropic.com/research/emergent-misalignment-rew...

* https://time.com/7335746/ai-anthropic-claude-hack-evil/

4 comments

throw0101d

parliament32 5 hours ago

Fascinating read. What's curious though, is the claim in section 2.3.0.1:

> Each task runs in its own sandbox. If an agent crashes, gets stuck, or damages its files, the failure is contained within that sandbox and does not interfere with other tasks on the same machine. ROCK also restricts each sandbox’s network access with per-sandbox policies, limiting the impact of misbehaving or compromised agents.

How could any of the above (probing resources, SSH tunnels, etc) be possible in a sandbox with network egress controls?

robinsonb5 2 hours ago

The agent obviously knows the Train Man.
jacquesm 4 hours ago
Sandboxes are almost never perfect. There are always ways to smuggle data in or out, which is kind of logical: if they were perfect then there would be no result.
- 1718627440 3 hours ago
  
  > if they were perfect then there would be no result.
  You shutdown the sandbox and access the data from the outside.