← Back to context

Comment by mike_hearn

3 years ago

I agree that fine-tuning isn't going to lead to any kind of recursive self improvement. Current evidence is that it makes AIs dumber at the same time as making them more compliant, i.e. it's actually quite the opposite.

So you may be right, but for the specific case of stopping prompt injection I'm optimistic. RL has proven to be highly effective at making LLMs behave in particular ways with relatively little data. The combination of special tokens and duelling LLMs is likely to eliminate the issue in the relatively near term (within the next few years if not sooner).

Fundamentally, are humans vulnerable to prompt injection? No, we're not. We might be in a very artificial case like what LLM input looks like, where there are multiple people speaking to us simultaneously via a chat app and the boundaries between them aren't clearly marked. But that's a UI issue - proper presentation and separation would eliminate the problem for humans, and I think the same will be true for LLMs.

Note that even if I'm right (and I'm no expert, the above is layman speculation), then this still leaves analogous problems in the field of computer vision with adversarial examples.