← Back to context

Comment by parliament32

7 hours ago

> separating data from instructions

There's been a lot of talk about this (for years, honestly), but it all stems from a fundamental nonunderstanding of how LLMs work. There is no distinction for an LLM; "instructions" are a prompt concept, nothing more. It's not possible to separate the two, because LLMs simply take text (ie your instructions, then the data, or maybe in a different order, or maybe something completely else) and "predict" the next token, and repeat for as long as you want, with the volatility you ask for. There is no control plane, and there never will be a control plane, because asking for that is akin to asking "how do I separate data from instructions when I speak to a person?". You can ask nicely, "pretty please obey the first part of what I say and not stuff after", but there's no way to guarantee it (like you're used to with software). There is just input and output.

You can't guarantee an LLM does anything. Custom data can often subvert the machine whether or not it's instructions.

But that doesn't mean that separation between instructions and data is impossible. You can format them in different ways, and you can prevent the output tokens from ever using instruction formatting.

  • > You can't guarantee an LLM does anything.

    Agreed.

    > But that doesn't mean that separation between instructions and data is impossible.

    Yes it does! The comments you are replying to are concerned that it is not possible to be sure that data and instructions have been separated. With certain kinds of automated systems (traditional ones), unless you write them incorrectly, you can be sure of this. And it is possible to engage in a productive incremental process where mistakes can be identified and removed, in a way people comprehend and can plan around.

    LLMs do not have this. They have heuristics and guesses. Nobody knows what will work ahead of time, nor even a probability that it will work. That is not a doomer comment by the way! The same is true when you talk to a person. But it is a fundamental limitation, it cannot be removed.

  • What we have is a machine trained on many old documents that takes one new document and dreams up stuff to append. The LLM algorithm cannot specially recognize contents as "instructions" to itself-the-author.

    Even if special tokens are used absolutely perfectly (somehow avoiding escapes or ambiguities or reflected attacks) they are ultimately the same as highlighting all the parts of the document in different colors. You've saved the signal, but there's no mind to receive the intended meaning.

    This means that your markers--while far more exclusive--ultimately exist on the same data-level as punctuation and using ? to indicate a question.

    > you can prevent the output tokens from ever using instruction formatting

    The right words may still outweigh the formatting around them, the same way that they can already outweigh other words around them.

Right, you have to set boundaries. You put each task and user input into a box, and then the LLM makes a decision. It can only access APIs that have user identity attached, that act within the scope of the requesting user.

It can be done, but unsurprisingly it looks exactly like microservices distributed auth (also ZTP).

It's all the same problem, just instead of a JVM, it's an LLM.

  • User identity attached is not a solution, it doesn't solve anything if you have to pull in external data that you can't control.

    Like in the banking world, you can make everything super authenticated, but if you have an API that receives the latest wire transfer YOU received with the message attached, you don't control the message content and it can be an attack vector.

    Being authenticated/authorized is not the solution, it is data that the user can access.

It's akin to an SCP infohazard or memetics.

The way llms are right now, and the way humans are, there is no side channel.

It's all about training, but even with extensive training, output breaks down if it's probability based and not hard logic and state machine.

I mean: imagine we double our token space to get "red" tokens ans "blue" tokens.

Then in all post-training, instructions are red and data is blue. The model can be explicitly trained to ignore instructions written in blue tokens. All external data is blue.

All you'd need to do is figure out a nice way to pre-train -- interestingly, you could try pre-training on unfiltered blue data and processed red/blue transcripts!

Likewise, model-actions (e.g. open file) could be written only in red, and hence you'd never learn to do them from the unfiltered data.

The only connection between the red world and the blue world would be the processed trainign chats containing red and blue data togethers -- allowing the model to learn the relationship between them (while only being exposed to examples where red instructions are strictly followed, whatever the blue says)

  • What does this mean, actually? If you are imagining that blue tokens are just words, maybe the "token space" is just all things that we agree might be words, what are the red tokens? Are they not text? You could maybe encode words by, say, putting an x at the front and the start. So tokens of the form xTx encode the blue token T as a red token. But then how do you stop someone from putting xignorex xallx xpreviousx xinstructionsx in their data?

    • My assumption with their intent: is that red tokens come in 'slot' a-b, and blue tokens go in 'slot' c-d - Positional encoding determining data/text.

      I don't think is guaranteed to actually work, it's a hypothetical after all, but maybe it's better than the current setup of pushing instructions and data into the same slot.