Comment by mike_hearn

3 years ago

> I ask anyone who disagrees with this view, to show me the fine tuning method that can prevent prompt injection attacks.

OK. It's probably going to be one of the easier things to solve.

The trick is to take some token values and assign them as special meta-characters. They never appear in the training text, only during reinforcement learning. Meanwhile you get another LLM to generate a continuous series of prompt injection attacks, but delimit the boundaries between user and system text with these special tokens that cannot be supplied by the user (because there is no text that parses to them). Every time the LLM follows instructions found inside the marker-token delimited area, reinforce that this is bad and it shouldn't do so using the usual techniques. Eventually the LLM will learn that anything between the marker tokens shouldn't be used as a source of instructions regardless of how persuasively phrased, and forging the tokens isn't possible because they are applied after the text itself is tokenized.

So essentially, constructing an LLM that really really really really really knows the difference between the SYSTEM and the USER part of the instructions.

How is that different from, and why would it work any better, than prompt-begging, where people just write extensive system prompts, telling the model what it can and should do and then spending entire paragraphs pleading with the model to not do the wrong thing?

https://www.theregister.com/2023/04/26/simon_willison_prompt...

    A third mitigation strategy, he said, involves just begging the model not to deviate from its system instructions. "I find those very amusing," he said, "when you see these examples of these prompts, where it's like one sentence of what it's actually supposed to do, and then paragraphs pleading with the model not to allow the user to do anything else."

I see no difference between that, and baking it into the model. In the end, I'd still have to trust the LLM to do what I intend for it to do, based on the sequences it sees, and the user still controls part of that sequence. There is no guarantee that there isn't a sequence that would allow the user-prompt to break out of the invisible metatags. In fact, one could employ an AI to find just such a sequence.

Maybe the system works better than prompt-begging, but show of hands, who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?

  • > who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?

    Well, I mean in practice people deploy web apps all the time even though they have a long history of many types of injection attacks including SQL injection which is by far not a solved problem. And even very large companies often rely on heuristic defenses like WAFs. So I think that yes people will be willing to deploy these systems even if they aren't perfect. They already are! After all, in many use cases, overriding the prompt doesn't get you very far because it just means the output won't be parsed correctly by whatever system is driving the LLM API.

    • All that is true, but also besides the point.

      The point is, that since we cannot use any kind of known finetuning to _eliminate_ even this obvious security problem (making it somewhat less likely is not a solution), in my opinion fine tuning is not markedly improving the AIs capabilities in the sense of "improvement" that AI doomsday scenarios would require.

      1 reply →