Comment by ptak_dev

4 months ago

[flagged]

4 comments

ptak_dev

Matchlock[0] is probably the best solution I've come across so far WRT problem 1 and 2:

> Matchlock is a CLI tool for running AI agents in ephemeral microVMs - with network allowlisting, secret injection via MITM proxy, and VM-level isolation. Your secrets never enter the VM.

In a nutshell, it solves problem #2 through a combination of a network allowlist and secret masking/injection on a per-host basis. Secrets are never actually exposed inside the sandbox. A placeholder string is used inside the sandbox, and the mitm proxy layer replaces the placeholder string with the actual secret key outside of the sandbox before sending the request along to its original destination.

Furthermore, because secrets are available to the sandbox only on a per-host basis, you can specify that you want to share OPENAI_API_KEY only with api.openai.com, and that is the only host for which the placeholder string will be replaced with the actual secret value.

edit to actually add the link

[0] https://github.com/jingkaihe/matchlock

tcbrah 4 months ago

problem 2 is actually scarier than most people realize because it compounds. your agent reads a README in some dependency, that README has injection instructions, now the agent is acting on behalf of the attacker with whatever permissions you gave it. filesystem sandboxing doesnt help because the dangerous action might be "write a backdoor into the file i already have write access to" which is completely within the sandbox rules.

the short-lived scoped credentials approach someone mentioned upthread is probably the best practical mitigation right now. but even that breaks down when the agent legitimately needs broad access to do its job - like if its refactoring across a monorepo it kinda needs write access to everything.

i think the actual answer long term is something closer to capability-based security where each tool call gets its own token scoped to exactly what that specific action needs. but nobody has built that yet in a way that doesnt make the agent 10x slower.

eelke 4 months ago

Problem 2 is mitigated by only allowing trusted sources through firewall rules.

brap 4 months ago

I think these are 2 independent axis:

1. Destructive by accident 2. Destructive because it was prompt-injected

And

1. Fucks up filesystem 2. Fucks up external systems via credentials