Comment by eclipsetheworld

3 hours ago

I have been working on this issue for a bit, and the most interesting approach I have seen so far comes from the research domain of information-flow control, specifically Microsoft’s FIDES work.

The idea is not to distinguish instructions from data. It is closer to having different privilege levels. Not all code has to run in kernel space, some code runs in unprivileged user space. So what is the equivalent for LLM agents?

In FIDES-style systems, every piece of information that enters the agent context is labeled along two dimensions: integrity and confidentiality. Integrity captures whether the data is trusted or untrusted (i.e. could it contain a prompt injection attack). Confidentiality captures who is allowed to see or receive it [0].

The privileged agent, sometimes called the planning agent, should not directly see untrusted data because it would be susceptible to prompt injection attacks. In the article’s example, a bank transaction’s sender-supplied reference would be untrusted. Instead, the planning agent receives a variable token. It can then either delegate processing of that variable to an unprivileged / quarantined agent with no or limited tool access, or pass the token as a reference to a tool.

Tools then have policies attached to their arguments and outputs. These policies specify which integrity and confidentiality levels are allowed, and whether the tool call may proceed. The policy also determines how the result should be labeled.

For example:

1. High-confidentiality data should not be allowed to flow into a `send_email` tool call addressed to an external recipient.

2. A tool call whose result depends on untrusted input should generally produce untrusted output.

3. A sensitive side-effecting tool should be able to reject calls that are influenced by untrusted context.

So the answer to “how do you separate data from instructions?” may be: you do not rely on the model to do that separation. You track provenance and privilege outside the model, and then enforce the security policy at the tool boundary.

[0] In the simplest implementation, confidentiality is assessed with a binary low/high value, however, in a more advanced implementation, confidentiality can be represented as the set of users or principals allowed to learn that information.

0 comments

eclipsetheworld

No comments yet

Contribute on Hacker News ↗