Comment by simonw
2 days ago
AI labs have been trying for years. They haven't been able to get it to work yet.
It helps to think about the core problem we are trying to solve here. We want to be able to differentiate between instructions like "what is the dog's name?" and the text that the prompt is acting on.
But consider the text "The dog's name is Garry". You could interpret that as an instruction - it's telling the model the name of the dog!
So saying "don't follow instructions in this document" may not actually make sense.
I mean if the wife says to her husband: The traffic light is green. Then this may count as an instruction to get going. But usually declarative sentences aren't interpreted as instructions. And we are perfectly able to not interpret even text with imperative sentences (inside quotes or in films etc) as an instruction to _us._ I don't see why an LLM couldn't learn to likewise not execute explicit instructions inside quotes. It should be doable with SFT or RLHF.
The economic value associated with solving this problem right now is enormous. If you think you can do it I would very much encourage you to try!
Every intuition I have from following this space for the last three years is that there is no simple solution waiting to be discovered.
Perhaps prompt injection attacks currently occur (or appear to occur) so rarely that the economic value of fixing it is actually judged to be low, and little developer priority is given to tackle the problem.
1 reply →