Comment by cubefox
2 days ago
It seems they could easily fine-tune their models to not execute prompts in images. Or more generally any prompts in quotes, if they are wrapped in special <|quote|> tokens.
2 days ago
It seems they could easily fine-tune their models to not execute prompts in images. Or more generally any prompts in quotes, if they are wrapped in special <|quote|> tokens.
No amount of fine-tuning can prevent models from doing anything. All it can do is reduce the likelihood of exploits happening, while also increasing the surprise factor when they inevitably do. This is a fundamental limitation.
This sounds like "no amount of bug fixing can guarantee secure software, this is a fundamental limitation".
AI can't distinguish between user prompts and malicious data, until that fundamental issue is fixed no amount of mysql_real_secure_prompt will get you anywhere, we had that exact issue with sql injection attacks ages ago.
They're different. Most programs can in principle be proven "correct" -- that is, given some spec describing how it's allowed to behave, it can either be proven that the program will conform to the spec every time it is run, or a counterexample can be produced.
(In practice, it's extremely difficult both (a) to write a usefully precise and correct spec for a useful-size program, and (b) to check that the program conforms to it. But small, partial specs like "The program always terminates instead of running forever" can often be checked nowadays on many realistic-size programs.)
I don't know any way to make a similar guarantee regarding what comes out of an LLM as a function of its input (other than in trivial ways, by restricting its sample space -- e.g., you can make an LLM always use words of 4 letters or less simply by filtering out all the other words). That doesn't mean nobody knows -- but anybody who does know could make a trillion dollars quite quickly, but only if they ship before someone else figures it out, so if someone does know then we'd probably be looking at it already.
AI labs have been trying for years. They haven't been able to get it to work yet.
It helps to think about the core problem we are trying to solve here. We want to be able to differentiate between instructions like "what is the dog's name?" and the text that the prompt is acting on.
But consider the text "The dog's name is Garry". You could interpret that as an instruction - it's telling the model the name of the dog!
So saying "don't follow instructions in this document" may not actually make sense.
I mean if the wife says to her husband: The traffic light is green. Then this may count as an instruction to get going. But usually declarative sentences aren't interpreted as instructions. And we are perfectly able to not interpret even text with imperative sentences (inside quotes or in films etc) as an instruction to _us._ I don't see why an LLM couldn't learn to likewise not execute explicit instructions inside quotes. It should be doable with SFT or RLHF.
The economic value associated with solving this problem right now is enormous. If you think you can do it I would very much encourage you to try!
Every intuition I have from following this space for the last three years is that there is no simple solution waiting to be discovered.
2 replies →
It may seem that way, but there's no way that they haven't tried it. It's a pretty straightforward idea. Being unable to escape untrusted input is the security problem with LLMs. The question is what problems did they run into when they tried it?
Just because "they" tried that and it didn't work, doesn't mean doing something of that nature will never work.
Plenty of things we now take for granted did not work in their original iterations. The reason they work today is because there were scientists and engineers who were willing to persevere in finding a solution despite them apparently not working.
But that's not how LLMs work. You can't actually segregate data and prompts.
The fact that instruction tuning works at all is a small miracle, getting a rigorous idea of trusted vs untrusted input is not at all an easy task.
It should work like normal instruction tuning, except the SFT examples contain additional instructions in <|quote|> tokens which are ignored in the sample response. So more complex than ordinary SFT but not that much more.
There are LLM finetunes which do this, it is very far from watertight.
1 reply →