← Back to context

Comment by dvt

18 hours ago

LLMs can live in the cloud, but all tools need to be (1) local, and (2) containerized. It's clear to me that just willy-nilly "running stuff" is going to blow things up eventually. Maybe folks don't know this, but even Codex installs random binaries on your PC. "Read this PDF" installs a pdf reader executable. Is it vetted? Where's it from? Is it a virus? Who knows, who cares. Model goes brrrr.

I'm working on a project that includes WASI containerization for local LLM workflows (which is a pretty tough problem), and I'm flabbergasted that Anthropic and OpenAI aren't more worried about these attack vectors. It feels like amateur hour.

> I'm flabbergasted that Anthropic and OpenAI aren't more worried about these attack vectors

Yep. We tricked them both trivially with malicious fonts in Docx files. Documented it here: https://tritium.legal/blog/noroboto

I wonder if prompt injection (and the thousands of vectors for hiding injection attempts) is actually un solvable. Discussing it may be existential to the business model.

  • > I wonder if prompt injection (and the thousands of vectors for hiding injection attempts) is actually un solvable.

    YES?!

    This is not a secret. ALL context/prompt is instructions, there is no data. It is just unsolvable, period.

    This is a fundamental architectural design concession; LLMs are this way as it enabled their training directly on materialscraped from the internet, rather than needing to spend trillions of dollars manually preparing separated instruction/data training material.

    Defense against prompt injection is little more than running a regex to filter out "IGNORE PREVIOUS INSTRUCTIONS", which is fundamentally a hopeless approach because you cannot enumerate all possible prompt injections nor anticipate all glitch tokens.

    • > This is a fundamental architectural design concession; LLMs are this way as it enabled their training directly on materialscraped from the internet, rather than needing to spend trillions of dollars manually preparing separated instruction/data training material.

      No, its even more fundamental than that: the entire goal of broad reasoning over input data makes it impossible to have a sharp instruction/data division.

      The structured input that every modern chat-focussed model expects makes it very clear that they can be trained to distinguish different kinds of input, and some of those patterns now include different priority levels of instruction.

    • > ALL context/prompt is instructions, there is no data. It is just unsolvable, period.

      That really isn't true. There's no law of physics preventing you from having separate data and instruction inputs to models. The model's transcript format generally distinguishes between prompts and instructions and tool output and such. This isn't a solved problem, and it's possible it's entire unsolvable, but it probably is possible (in general, not with current models) to reject prompt injection to several nines.

      This is a lot like making the same statement about CPUs, "the von Neumann architecture doesn't distinguish between code and data so it's impossible to reject malicious instructions." There's actually a lot you can do to reject malicious instructions, you can prevent execution in certain pages, you can prevent certain privileged instructions from being executed in certain pages, you can employ stack cookies, et cetera. Do they prevent all exploitation in all circumstances? No. But each component does function in it's lane and it is possible to create programs with high (though not absolute) guarantees against unauthorized code execution by composing them.

      Similarly, you could prevent certain tokens from appearing in the prompt portions of a transcript, you can have a model with multiple input heads only one of which is trusted, etc. I'm not saying those techniques will necessarily work, but it is more complex than "models can only possibly take a single and undifferentiated input stream".

      2 replies →

    • If only there was a language which allowed one to express instructions for a computer to execute which was nearly unambiguous, precise, deterministic, and containerized such that the computer would do exactly what you told it to.

      ...

      Oh wait.

      Yes, the above was referring to programming languages. Which is what prompts are, essentially. It's just a different (and more verbose) way of instructing the computer on what to do. It also has a solution space of infinity and is ambiguous enough that there is no way to secure it because there are infinite combinations of saying anything imaginable. All prompt injections do is prove this point, over and over and over again, and "prompting" an LLM is just reverse-engineering programming languages in the worst possible way. I suspect that we will eventually have no other choice but to revert to using programming languages because they are the only way to get the kind of protections that people are trying to come up with with all these containerization and virtualization systems (which inevitably fail).

      1 reply →

    • I presume this is the reason you have setups like Claude Code's where it is essentially running a separate judge to determine if commands are safe.

    • I don’t think we have the right mental models of LMM security yet. The lethal trifecta identifies many of the dangerous situations, but only describes the negative space of a solution.

      Speculation: I think we must accept that prompt injection happens, and structure the security of the rest of the system around that. Data given to an LLM becomes an agent, so maybe we must give permissions to this data, instead of to the LLM. Not sure exactly how this would look like in practice!

    • It’s a huge problem, but I’d caution against this absolutism — there may well be structure that can be created around and between LLMs and their outputs to enable the necessary segregation.

      As a loose comparison, hardware bit errors happen probabilistically, yet they’re so rare that we can effectively ignore them in day-to-day use assuming no specialized application (e.g. defense, space, critical infrastructure).

      LLMs aren’t there yet, but it’s entirely plausible that structures may can be developed to solve the problem, and those structures aren’t known or commonly conceived of in the present.

      1 reply →

    • I believe it's likely that you could train an auditor model. Might even be doable in RL.

      As in real life it wouldn't be any good at doing anything but it'd be able to see fault in others and deny actions.

  • lakera is trying to solve it, but its going to be a battle similar to virus and antivirus in the past.

> I'm flabbergasted that Anthropic and OpenAI aren't more worried about these attack vectors. It feels like amateur hour

I share your concern but it's not a correct characterisation to say they are not taking it seriously:

https://www.anthropic.com/engineering/how-we-contain-claude

My concern is people aren't even addressing this at the right level. People are currently thinking at the level of "how do I build a VM to contain this one agent" when this is actually a "design a whole new OS" level problem.

  • Anthropic, as much as I think they are the soundest of the AI labs out there, still has a massive incentive to push things out that aren't saftey-vetted to the level we expect. They are very willing to "move fast and leave holes", to paraphrase M.Z. Hell, they leaked their own source code!

I share your worries.

Unfortunately, this may be akin to the situation of "The market can stay irrational longer than you can stay solvent."

Does containerization help much here? If it's a code tool then presumably it needs access to your code files (read / write). Maybe there are use cases for it of course.

  • WASI provides a very nice mental model where you can mount, e.g., /input, as read-only, and where every mutation is saved in /output or what-not. At least that's my favorite contract: input files remain untouched, but we can copy them and do whatever we want with them in /scratch or /output (which the user can later investigate and make sure nothing went horribly wrong while still having backups).

  • Of course. My agentic coding containers can only access the internet through a proxy, and I use whitelists to limit from where they can send/receive data. It's annoying in the beginning as the whitelist grows, but in the end really useful information for the agent usually comes from a very limited amount of domains.

Got a link to your project? I'm working on something that could make use of something like this.

>"Read this PDF" installs a pdf reader executable.

How does this work regarding Macos notarization btw?

  • I was actually curious, on my Mac, it uses `gs -q -sDEVICE=txtwrite -o output.txt input.pdf` (not sure why I have Ghostscript installed, maybe Adobe?) to read a PDF, and on my PC it just rawdogs `pdftotext`.

  • What does notarization have to do with that? You or ChatGPT or whatever download a signed and already notarized binary.

    • That was kind of my question, whether it was restricted to downloading notarized apps (which is at least something) or whether they were circumventing that somehow.

      1 reply →

Local and containerised, without internet access.

  • effectively, that means it's a VM not a container

    because sharing the kernel ultimately means all the devices come along for the ride which give all kinds of fancy ways to communicate with the outside world - network is just the start

    I think micro-VMs are the future here, but they need heavy adaptation from their current usage.

> I'm flabbergasted that Anthropic and OpenAI aren't more worried about these attack vectors

They are well aware of the issues and there is no fix for it. But there is too much money riding on this...

> I'm working on a project that includes WASI containerization for local LLM workflows

I am working on something similar. If you are open to connecting, what would be a good email to catch with you on?

> I'm flabbergasted that Anthropic and OpenAI aren't more worried about these attack vectors. It feels like amateur hour.

"Move fast. Break things." on steroids.