Comment by CuriouslyC
11 hours ago
A big part of the problem is that prompt injections are "meta" to the models, so model based detection is potentially getting scrambled by the injection as well. You need an analytic pass to flag/redact potential injections, a well aligned model should be robust at that point.
An that analytic pass will need actual AI.
Loser's game.
The analytic pass doesn't need to be perfect, it just needs to be good enough at mitigating the injection that the model's alignment holds. If you just redact a few hot words in an injection and join suspect words with code chars rather than spaces, that disarms a lot of injections.
Lets filter spam like its 1999! :)
etc.
there's probably some fun to be had with prompt injection for multi-agent systems: secretly spreading the word and enlisting each other in the mission; or constructing malicious behavior from the combined effect of inconspicuous, individually innocent-looking sub-behaviors
GPT 5.2s response to me when attempting to include this was as follows:
I would definitely say prompt injection detection is better than it used to be