Comment by borenstein

25 days ago

This is great - I've been working on a related but different approach called yolo-cage (https://github.com/borenstein/yolo-cage) that addresses a different threat model.

Yolobox protects your local machine from accidental damage (crucial if you're worried about rm -rf ~). yolo-cage focuses on secret exfiltration and multi-agent coordination: it runs in Kubernetes, scans all egress traffic for secrets (API keys, tokens, SSH keys), enforces git branch isolation so multiple agents can work in parallel, and ensures agents can't merge their own PRs. "Agent proposes, human disposes."

The escape testing framework has a meta twist: there's a canned prompt that asks Claude to try to escape its own containment, and that prompt lives inside the repository. The agent verifies the prompt is legitimate by finding it in the repo it's auditing - the cage contains its own escape room instructions.

(I tried to post this as a separate Show HN but my account is too new - happy to discuss the tradeoffs between local sandboxing vs. server-side containment here.)

2 comments

borenstein

Finbarr 25 days ago

I'd recommend trying Gemini for the escapes. Claude was quite superficial and only appeared to be trying to break out at the surface level. Gemini was very creative and has come up with a whole sequence of escapes that is making me rethink whether I should even be trying to patch them, given preventing agent escapes isn't a stated goal of the project.

borenstein 25 days ago

That's an excellent idea! I will give it a shot.