Comment by helltone
2 days ago
No amount of fine-tuning can prevent models from doing anything. All it can do is reduce the likelihood of exploits happening, while also increasing the surprise factor when they inevitably do. This is a fundamental limitation.
This sounds like "no amount of bug fixing can guarantee secure software, this is a fundamental limitation".
AI can't distinguish between user prompts and malicious data, until that fundamental issue is fixed no amount of mysql_real_secure_prompt will get you anywhere, we had that exact issue with sql injection attacks ages ago.
They're different. Most programs can in principle be proven "correct" -- that is, given some spec describing how it's allowed to behave, it can either be proven that the program will conform to the spec every time it is run, or a counterexample can be produced.
(In practice, it's extremely difficult both (a) to write a usefully precise and correct spec for a useful-size program, and (b) to check that the program conforms to it. But small, partial specs like "The program always terminates instead of running forever" can often be checked nowadays on many realistic-size programs.)
I don't know any way to make a similar guarantee regarding what comes out of an LLM as a function of its input (other than in trivial ways, by restricting its sample space -- e.g., you can make an LLM always use words of 4 letters or less simply by filtering out all the other words). That doesn't mean nobody knows -- but anybody who does know could make a trillion dollars quite quickly, but only if they ship before someone else figures it out, so if someone does know then we'd probably be looking at it already.