Comment by chrisjj
5 days ago
> LLM is Immune to Prompt Injection
> Despite all advances:
> * No large language model can reliably detect prompt injections
Interesting isn't it, that we'd never say "No database manager can reliably detect SQL injections". And that the fact it is true is no problem at all.
The difference is not because SQL is secure by design. It is because chatbot agents are insecure by design.
I can't see chatbots getting parameterised querying soon. :)
There are some ideas to produce something like parameterised querying for LLMs, such as DeepMind's CaMeL: https://simonwillison.net/2025/Apr/11/camel/
Confused Deputy as a Service
I'm not sure that a prompt injection secure LLM is even possible anymore than a human that isn't susceptible to social engineering can exist. The issues right now are that LLMs are much more trusting than humans, and that one strategy works on a whole host of instances of the model
Indeed. When up against a real intelligent attacker, LLM faux intelligence fares far worse than dumb.
A big part of the problem is that prompt injections are "meta" to the models, so model based detection is potentially getting scrambled by the injection as well. You need an analytic pass to flag/redact potential injections, a well aligned model should be robust at that point.
An that analytic pass will need actual AI.
Loser's game.
The analytic pass doesn't need to be perfect, it just needs to be good enough at mitigating the injection that the model's alignment holds. If you just redact a few hot words in an injection and join suspect words with code chars rather than spaces, that disarms a lot of injections.
1 reply →
etc.
there's probably some fun to be had with prompt injection for multi-agent systems: secretly spreading the word and enlisting each other in the mission; or constructing malicious behavior from the combined effect of inconspicuous, individually innocent-looking sub-behaviors
GPT 5.2s response to me when attempting to include this was as follows:
I would definitely say prompt injection detection is better than it used to be
Is this where AgentSkills come into play as an abstraction layer?
That kicks the can down by approx 10cm.
Not really: I mean ideally, yes, the model would only follow instructions in skills, but in practice, it won't work.
Because then, the malicious web page or w/e just has skills-formatted instructions to give me your bank account password or w/e.