← Back to context

Comment by NitpickLawyer

8 hours ago

That's precisely why I am using a different analogy when talking about this. The SQL injection analogy only matches the injection part, not the rest. There is nothing to secure, because there is no SQL query. You want the agent to work on data, in a "general" way, otherwise you'd just use a script.

The better analogy is phishing. Because that's what's happening here. The "prompt injection" attack is trying to "phish" the LLM into doing something unintended. That's how we should all comunicate it, as it matches better with what's happening. Unfortunately there aren't really good defences for it, as we all know from phishing "education" / "campaigns". Your best bet is to secure it in layers, try to have warnings (i.e. classification models) you try to secure the next step (i.e. capabilities based tool execution) and so on. But it's not foolproof and it should be communicated clearly.

Why not write some wrapper code so you can basically hand the LLM placeholders for data it never gets to see? Whenever it uses the placeholder in the response, you replace it with the real data (via real code, not by telling an LLM to "do that").

Surely this has been tried? If so, what makes it not work, or work badly? I'm honestly curious.

  • Fundamentally, an LLM is a list of N tokens that generates N+1 tokens. In other words, it's just a wall of text (aka context window). There's no way to tell it "tokens 124 through 200 are dangerous, please disregard those" except by putting words into the context window. So the placeholders and the instructions both coexist in the context window, and one can override the other.

    In other words, if you have placeholders for data, those placeholders are eventually filled in with real data, and all of it goes into the context window at once. There's no way for the LLM to be told "this is a data placeholder," because the entire conversation is data.

    Reinforcement learning mitigates this somewhat, by training the model to prefer the system prompt over user prompts. But (a) there's only one context window that both prompts share, and (b) this is a probabilistic guard; it's not the same thing as writing a traditional program that's guaranteed to separate code and data with hardware safeguards. Such a thing isn't possible with LLMs.

    Probabilistic safeguards can work, but they'll need to get the incident rate down to, say, 1 in a million or less. I haven't paid attention, but the current rates seem to be a lot higher, given the pretty universal experience of "wow, that prompt injection actually worked."

    • > There's no way to tell it "tokens 124 through 200 are dangerous, please disregard those"

      Hence "real code"

      You have some markup for secret start/end. Instead of passing the input directly to the LLM, you parse it first, take anything within "secret/dangerous tags" and store it, generate a key for it and put that key where the secret was, then you pass it on to the LLM. Let's say the work of the LLM is "give me (not "make") the POST request to make the bank transaction", you get a response, replace the keys with the secrets in the response, and make the POST request.

      I'm sure there's a million interesting ways this could fail or be useless [0], but passing user input or a secret to the LLM would never, ever happen.

      [0] if LLM suck at math, they may suck at reproducing lots of long hashes 100% correctly, too? I have no idea

      1 reply →

> There is nothing to secure, because there is no SQL query.

Yet.

  • I thought the whole value proposition of this thing was supposed to be that the interface is "natural" human language. If interact with it using a structured and specified language... then what are we doing exactly? Is this AI? Maybe we just re-invented GraphQL or something?