Comment by K0nserv

2 days ago

The security endgame of LLMs terrifies me. We've designed a system that only supports in-band signalling, undoing hard learned lessons from prior system design. There are ampleattack vectors ranging from just inserting visible instructions to obfuscation techniques like this and ASCII smuggling[0]. In addition, our safeguards amount to nicely asking a non deterministic algorithm to not obey illicit instructions.

0: https://embracethered.com/blog/posts/2024/hiding-and-finding...

Seeing more and more developers having to beg LLMs to behave in order to do what they want is both hilarious and terrifying. It has a very 40k feel to it.

  • Haha, yes! I'm only vaguely familiar with 40k, but LLM prompt engineering has strong "Praying to the machine gods" / tech-priest vibes.

It's like old school php where we used string concatenation with user input to generate queries and a whack-a-mole of trying to detect harmful strings.

So stupid, the fact that we can't distinguish between data and instructions and do the same mistakes decades later..

Yeah, it's quite amazing how none of the models seem to be any "sudo" tokens that could be used to express things normal tokens cannot.

  • "sudo" tokens exist - there are tokens for beginning/end of a turn, for example, which the model can use to determine where the user input begins and ends.

    But, even with those tokens, fundamentally these models are not "intelligent" enough to fully distinguish when they are operating on user input vs. system input.

    In a traditional program, you can configure the program such that user input can only affect a subset of program state - for example, when processing a quoted string, the parser will only ever append to the current string, rather than creating new expressions. However, with LLMs, user input and system input is all mixed together, such that "user" and "system" input can both affect all parts of the system's overall state. This means that user input can eventually push the overall state in a direction which violates a security boundary, simply because it is possible to affect that state.

    What's needed isn't "sudo tokens", it's a fundamental rethinking of the architecture in a way that guarantees that certain aspects of reasoning or behaviour cannot be altered by user input at all. That's such a large change that the result would no longer be an LLM, but something new entirely.

    • I was actually thinking sudo tokens as a completely separate set of authoritative tokens. So basically doubling the token space. I think that would make it easier for the model to be trained to respect them. (I haven't done any work in this domain, so I could be completely wrong here.)

      2 replies →

    • It's like ASCII control characters and display characters lmao

We have created software sophisticated enough to be vulnerable up social engineering attacks. Strange times.

As you say, the system is nondeterministic and therefore doesn't have any security properties. The only possible option is to try to sandbox it as if it were the user themselves, which directly conflicts with ideas about training it on specialized databases.

But then, security is not a feature, it's a cost. So long as the AI companies can keep upselling and avoid accountability for failures of AI, the stock will continue to go up, taking electricity prices along with it, and isn't that ultimately the only thing that matters? /s

What lessons have organizations learned about security?

Hire a consultant who can say you're following "industry standards"?

Don't consider secure-by-design applications, keep your full-featured piece of jump but work really hard to plug holes, ideally by paying a third party or better getting your customers to pay ("anti-virus software").

Buy "security as product" software allow with system admin software and when you get a supply chain attack, complain?