Comment by saagarjha

15 hours ago

Doing this in general is really hard. Unfortunately the blog post doesn't really go into detail of how hard, though it does mention some cases. For example, if you run your agent in a VM with network access, it can come across something that prompt injects it into encoding a secondary prompt injection for the artifact that comes out of the VM, which then infects your local, more privileged agent.

Another case that came up when we were doing computer use analysis at a previous role was that we tried to figure out if user input was trusted to not be bad. Generally, if the user typed it, that would be OK, but what about the user's files? Or their calendar events? Well, the whole point of the product was that the agent would manage those for you, which meant that they were no longer trustworthy to not have injections in them. (Hey, can you look up when the Super Bowl is and remind me to book plane tickets for that weekend?) If you do this kind of taint analysis you will quickly find that it's super difficult to stop this kind of thing and just putting a sandbox or VM around things often does not help.

1 comment

saagarjha

dist-epoch 15 hours ago

[dead]