Comment by bob1029
20 hours ago
My approach to safety at the moment is to mostly lean on alignment of the base model. At some point I hope we realize that the effectiveness of an agent is roughly proportional to how much damage it could cause.
I currently apply the same strategy we use in case of the senior developer or CTO going off the deep end. Snapshots of VMs, PITR for databases and file shares, locked down master branches, etc.
I wouldn't spend a bunch of energy inventing an entirely new kind of prison for these agents. I would focus on the same mitigation strategies that could address a malicious human developer. Virtual box on a sensitive host another human is using is not how you'd go about it. Giving the developer a cheap cloud VM or physical host they can completely own is more typical. Locking down at the network is one of the simplest and most effective methods.
No comments yet
Contribute on Hacker News ↗