Comment by customguy

6 hours ago

Why not write some wrapper code so you can basically hand the LLM placeholders for data it never gets to see? Whenever it uses the placeholder in the response, you replace it with the real data (via real code, not by telling an LLM to "do that").

Surely this has been tried? If so, what makes it not work, or work badly? I'm honestly curious.

3 comments

customguy

sillysaurusx 5 hours ago

Fundamentally, an LLM is a list of N tokens that generates N+1 tokens. In other words, it's just a wall of text (aka context window). There's no way to tell it "tokens 124 through 200 are dangerous, please disregard those" except by putting words into the context window. So the placeholders and the instructions both coexist in the context window, and one can override the other.

In other words, if you have placeholders for data, those placeholders are eventually filled in with real data, and all of it goes into the context window at once. There's no way for the LLM to be told "this is a data placeholder," because the entire conversation is data.

Reinforcement learning mitigates this somewhat, by training the model to prefer the system prompt over user prompts. But (a) there's only one context window that both prompts share, and (b) this is a probabilistic guard; it's not the same thing as writing a traditional program that's guaranteed to separate code and data with hardware safeguards. Such a thing isn't possible with LLMs.

Probabilistic safeguards can work, but they'll need to get the incident rate down to, say, 1 in a million or less. I haven't paid attention, but the current rates seem to be a lot higher, given the pretty universal experience of "wow, that prompt injection actually worked."

customguy 4 hours ago
> There's no way to tell it "tokens 124 through 200 are dangerous, please disregard those"
Hence "real code"
You have some markup for secret start/end. Instead of passing the input directly to the LLM, you parse it first, take anything within "secret/dangerous tags" and store it, generate a key for it and put that key where the secret was, then you pass it on to the LLM. Let's say the work of the LLM is "give me (not "make") the POST request to make the bank transaction", you get a response, replace the keys with the secrets in the response, and make the POST request.
I'm sure there's a million interesting ways this could fail or be useless [0], but passing user input or a secret to the LLM would never, ever happen.
[0] if LLM suck at math, they may suck at reproducing lots of long hashes 100% correctly, too? I have no idea
- sillysaurusx 4 hours ago
  
  That would work for generating POST requests. But AI is used to solve messy, non-deterministic problems. Usually the step after “give me the X” is to feed X back into the model, because it has to; if X is even slightly nondeterministic then an AI model has to analyze it. That’s where prompt injections happen.