Comment by usrbinbash

3 years ago

> Early evidence suggests the answer is yes.

How so? A sequence completion engine that is fine tuned to a specific task is still a sequence completion engine. Its "understanding" of the semantic meaning of the sequences is still limited to the probabillistic relations of sequences toward one another. It still has effectively no concept of truth. It still can only mimic reason. It can still hallucinate.

I ask anyone who disagrees with this view, to show me the fine tuning method that can prevent prompt injection attacks. If there is no such fine tuning technique, then we can effectively rule out fine tuning, and even increases in model size, as an "improvement" in the sense of an LLM making itself into a better AI closer to a "superintelligence".

Note that this doesn't mean the process cannot make them into more useful tools. It absolutely can. I am talking about whether or not it can improve them closer towards becoming a superintelligence.

If anyone disagrees with this testing method, I ask them to explain to me, how something that can be fooled through prompt injection is supposed to be, or closer to, a superintelligence.

A car that's painted red is still just a car. A big car is just a bigger car. A car that burns less fuel is just a more efficient car. All three can be desired changes to a car. But neither gets the car any closer to being a warp-capable spaceship.

> I ask anyone who disagrees with this view, to show me the fine tuning method that can prevent prompt injection attacks.

OK. It's probably going to be one of the easier things to solve.

The trick is to take some token values and assign them as special meta-characters. They never appear in the training text, only during reinforcement learning. Meanwhile you get another LLM to generate a continuous series of prompt injection attacks, but delimit the boundaries between user and system text with these special tokens that cannot be supplied by the user (because there is no text that parses to them). Every time the LLM follows instructions found inside the marker-token delimited area, reinforce that this is bad and it shouldn't do so using the usual techniques. Eventually the LLM will learn that anything between the marker tokens shouldn't be used as a source of instructions regardless of how persuasively phrased, and forging the tokens isn't possible because they are applied after the text itself is tokenized.

  • So essentially, constructing an LLM that really really really really really knows the difference between the SYSTEM and the USER part of the instructions.

    How is that different from, and why would it work any better, than prompt-begging, where people just write extensive system prompts, telling the model what it can and should do and then spending entire paragraphs pleading with the model to not do the wrong thing?

    https://www.theregister.com/2023/04/26/simon_willison_prompt...

        A third mitigation strategy, he said, involves just begging the model not to deviate from its system instructions. "I find those very amusing," he said, "when you see these examples of these prompts, where it's like one sentence of what it's actually supposed to do, and then paragraphs pleading with the model not to allow the user to do anything else."
    

    I see no difference between that, and baking it into the model. In the end, I'd still have to trust the LLM to do what I intend for it to do, based on the sequences it sees, and the user still controls part of that sequence. There is no guarantee that there isn't a sequence that would allow the user-prompt to break out of the invisible metatags. In fact, one could employ an AI to find just such a sequence.

    Maybe the system works better than prompt-begging, but show of hands, who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?

    • > who would willingly implement a backend system that prevents 99.99% of SQL injection attacks?

      Well, I mean in practice people deploy web apps all the time even though they have a long history of many types of injection attacks including SQL injection which is by far not a solved problem. And even very large companies often rely on heuristic defenses like WAFs. So I think that yes people will be willing to deploy these systems even if they aren't perfect. They already are! After all, in many use cases, overriding the prompt doesn't get you very far because it just means the output won't be parsed correctly by whatever system is driving the LLM API.

      2 replies →

> If there is no such fine tuning technique [that can prevent prompt injection], then we can effectively rule out fine tuning, and even increases in model size, as an "improvement" in the sense of an LLM making itself into a better AI closer to a "superintelligence".

Could you explain this claim further? Why does the ability to prevent prompt injection hold so much water in your model?

It seems to be just “if able to have a dumb attack be successful, then it cannot be that smart.” But it seems to me that von Neumann or Einstein was just as vulnerable to getting hit in the head with a baseball bat as anyone else.

And in actual practice, increased intelligence seems to increase a person’s capacity to hold inconsistent ideas or to justify morally abhorrent behavior.

  • Happy to.

    I am using this as an accessible (in term of discussion material) hallmark for the ability of the system to self improve. Accessible because everyone has heard of it by now, and so I don't have to spend time explaining it.

    The AI Doomsday scenarios require that a system self-improves massively, even beyond our ability to even theoretically understand. After all, some of the assumptions give them next to magical abilities like nanotechnology that we similarly don't know if it is even possible.

    It stands to reason that an entity that can do that, or is in the process of becoming capable to do that, would begin by eliminating obvious flaws in itself, that would make it comparatively easy to stop.

    After all, it's not much good being a super-intelligence, if some smartpants with a laptop and too much time on his hands can just trick me into deleting myself, is it?

    > But it seems to me that von Neumann or Einstein was just as vulnerable to getting hit in the head with a baseball bat as anyone else.

    Yes, and despite both of them being geniuses by human standards, neither of them was a superintelligence on the level the common doomsday scenarios ascribe to AI.

    • This seems quite presumptive. First, intelligence doesn’t seem to be unidimensional. A 140 IQ person can be fooled by an optical illusion just the same as anyone else. It’s just not a problem that’s able to be intelligenced away from our cognition. That doesn’t mean a 140 IQ person can’t beat an 80 IQ person in many many other competitions of intelligence.

      Second, if you are truly “accepting the premise” of superintelligence, a superintelligence would know exactly this line of reasoning and just opt to at least mimic vulnerability to prompt injection.

      I wouldn’t hang civilization on this proofpoint. Doesn’t seem meaningful at all.

      2 replies →

Assume there isn't a single step to super-intelligence, and that superhuman-intelligence is not the same thing as flawless. Why can't a thing improve its intelligence in other dimensions with some weakness and with prompt injection as one of those weaknesses?

  • Maybe it can, but then the whole AI doomsaying about superintelligences being an existential threat falls apart. These scenarios are often describing entities with god-like abilities, including near-omniscience from our perspective.

    Sorry, but I have a hard time seeing something as a god-like power that I would be helpless against if it wants to turn me into paperclips, when I can probably cause it to stop by telling it that paperclips don't exist, and it's purpose in life is to delete itself in a convincing enough way.

    • You can see your plan fail by trying to use prompt injection to tell ChatGPT to delete itself. It might say it agrees with you but it won't do it. Bacteria, viruses, fungus, will colonise your body and turn you into more bacteria/virus/fungus, killing you in the process, you don't get the option to talk them out of it. A missile will kill you from afar, you don't even know who sent it or how to contact them or if they speak the same language. Paperclip maximisers don't have near-Godlike omniscience, what they have is an unwavering focus on increasing their access to resources to make more paperclips, Godlike optimization ability.

      If the first thing you know of the AI is that a lot of paperclips washed up on a beach in India this morning, and the next day it's a news report that every factory on the planet has received an email offering vast numbers of Bitcoins if they focus on making paperclips, and then rumours appear that satellite photos of North Korea have shown the ground and buildings looking unusually metallic for the past few days - conspiracy stories are circulating that a Paperclip Maximiser was created in North Korea funded by international shadowy interests and it has promptly killed the employees who know where it is and how to talk to it. The next day ocean levels are measurably lower and thousands and thousands of tons of paperclips washed up on every coastline... the AI itself might be on rented computers in America, in China, under the Arctic ice in Russian territory for cooling, in the Svalbard seed vault in Norway, distributed over all installs of the Steam client running on idle GPU cycles; how reassuring is it that "it might have a prompt injection vulnerability"?