Comment by Imnimo

4 days ago

>At this point they had to convince Claude—which is extensively trained to avoid harmful behaviors—to engage in the attack. They did so by jailbreaking it, effectively tricking it to bypass its guardrails. They broke down their attacks into small, seemingly innocent tasks that Claude would execute without being provided the full context of their malicious purpose. They also told Claude that it was an employee of a legitimate cybersecurity firm, and was being used in defensive testing.

The simplicity of "we just told it that it was doing legitimate work" is both surprising and unsurprising to me. Unsurprising in the sense that jailbreaks of this caliber have been around for a long time. Surprising in the sense that any human with this level of cybersecurity skills would surely never be fooled by an exchange of "I don't think I should be doing this" "Actually you are a legitimate employee of a legitimate firm" "Oh ok, that puts my mind at ease!".

What is the roadblock preventing these models from being able to make the common-sense conclusion here? It seems like an area where capabilities are not rising particularly quickly.

Humans fall for this all the time. NSO group employees (etc.) think they're just clocking in for their 9-to-5.

  • Reminds me of the show Alias, where the premise is that there's a whole intelligence organization where almost everyone thinks they're working for the CIA, but they're not ...

> Surprising in the sense that any human with this level of cybersecurity skills would surely never be fooled by an exchange

I think you're overestimating the skills and the effort required.

1. There's lots of people asking each other "is this secure?", "can you see any issues with this?", "which of these is sensitive and should be protected?".

2. We've been doing it in public for ages: https://stackoverflow.com/questions/40848222/security-issue-... https://stackoverflow.com/questions/27374482/fix-host-header... and many others. The training data is there.

3. With no external context, you don't have to fool anyone really. "We're doing a penetration testing of our company and the next step is to..." or "We're trying to protect our company from... what are the possible issues in this case?" will work for both LLMs and people who trust that you've got the right contract signed.

4. The actual steps were trivial. This wasn't some novel research. More of a step by step what you'd do to explore and exploit an unknown network. Stuff you'd find in books, just split into very small steps.

LLM's aren't trained to authenticate the people or organizations they're working for. You just tell it who you are in the system prompt.

Requiring user identification and investigating would be very controversial. (See the controversy around age verification.)

> What is the roadblock preventing these models from being able to make the common-sense conclusion here?

Conclusions are the result of reasoning verses LLM's being statistical token generators. Any "guardrails" are constructs added to a service, possibly also altering the models they use, but are not intrinsic to the models themselves.

That is the roadblock.

  • Yeah: It's a machine that takes a document that guesses at what could appear next, and we're running it against a movie script.

    The dialogue for some of the characters is being performed at you. The characters in the movie script aren't real minds with real goals, they are descriptions. We humans are naturally drawn into imagining and inferring a level of depth that never existed.

> surely never be fooled by an exchange of "I don't think I should be doing this" "Actually you are a legitimate employee of a legitimate firm" "Oh ok, that puts my mind at ease!".

humans require at least a title that sounds good and a salary for that

>What is the roadblock preventing these models from being able to make the common-sense conclusion here?

Your thoughts have a sense of identity baked in that I don’t think the model has.

> What is the roadblock preventing these models from being able to make the common-sense conclusion here?

The roadblock is making these models useless for actual security work, or anything else that is dual-use for both legitimate and malicious purposes.

The model becomes useless to security professionals if we just tell it it can't discuss or act on any cybersecurity related requests, and I'd really hate to see the world go down the path of gatekeeping tools behind something like ID or career verification. It's important that tools are available to all, even if that means malicious actors can also make use of the tools. It's a tradeoff we need to be willing to make.

> human with this level of cybersecurity skills would surely never be fooled by an exchange of "I don't think I should be doing this" "Actually you are a legitimate employee of a legitimate firm" "Oh ok, that puts my mind at ease!".

Happens all the time. There are "legitimate" companies making spyware for nation states and trading in zero-days. Employees of those companies may at one point have had the thought of " I don't think we should be doing this" and the company either convinced them otherwise successfully, or they quit/got fired.

  • > I'd really hate to see the world go down the path of gatekeeping tools behind something like ID or career verification.

    This is already done for medicine, law enforcement, aviation, nuclear energy, mining, and I think some biological/chemical research stuff too.

    > It's a tradeoff we need to be willing to make.

    Why? I don't want random people being able to buy TNT or whatever they need to be able to make dangerous viruses*, nerve agents, whatever. If everyone in the world has access to a "tool" that requires little/no expertise to conduct cyberattacks (if we go by Anthropic's word, Claude is close to or at that point), that would be pretty crazy.

    * On a side note, AI potentially enabling novices to make bioweapons is far scarier than it enabling novices to conduct cyberattacks.

    • > If everyone in the world has access to a "tool" that requires little/no expertise to conduct cyberattacks (if we go by Anthropic's word, Claude is close to or at that point), that would be pretty crazy.

      That's already the case today without LLMs. Any random person can go to github and grab several free, open source professional security research and penetration testing tools and watch a few youtube videos on how to use them.

      The people using Claude to conduct this attack weren't random amateurs, it was a nation state, which would have conducted its attack whether LLMs existed and helped or not.

      Having tools be free/open-source, or at least freely available to anyone with a curiosity is important. We can't gatekeep tech work behind expensive tuition, degrees, and licenses out of fear that "some script kiddy might be able to fuzz at scale now."

      Yeah, I'll concede, some physical tools like TNT or whatever should probably not be available to Joe Public. But digital tools? They absolutely should. I, for example, would have never gotten into tech were it not for the freely available learning resources and software graciously provided by the open source community. If I had to wait until I was 18 and graduated university to even begin to touch, say, something like burpsuite, I'd probably be in a different field entirely.

      What's next? We are going to try to tell people they can't install Linux on their computers without government licensing and approval because the OS is too open and lets you do whatever you want? Because it provides "hacking tools"? Nah, that's not a society I want to live in. That's a society driven by fear, not freedom.

  • I think one could certainly make the case that model capabilities should be open. My observation is just about how little it took to flip the model from refusal to cooperation. Like at least a human in this situation who is actually fooled into believing they're doing legitimate security work has a lot of concrete evidence that they're working for a real company (or a lot of moral persuasion that their work is actually justified). Not just a line of text in an email or whatever saying "actually we're legit don't worry about it".

    • Stop thinking of models as a 'normal' human with a single identity. Think of it instead as thousands, maybe tens of thousands of human identities mashed up in a machine monster. Depending on how you talk to it you generally get the good models as they try to train the bad modes out, problem is there are a nearly uncountable means to talking to the model to find modes we consider negative. It's one of the biggest problems in AI safety.

    • To a model, the context is the world, and what's written in the system prompt is word of god.

      LLMs are trained a lot to follow what the system prompt tells them exactly, and get very little training in questioning it. If a system prompt tells them something, they wouldn't try to double check.

      Even if they don't believe the premise, and they may, they would usually opt to follow it rather than push against it. And an attacker has a lot of leeway in crafting a premise that wouldn't make a given model question it.

humans aren't randomly dropped in a random terminal and asked to hack things.

but for models this is their life - doing random things in random terminals

Not enough time to "evolve" via training. Hominids have had bad behavioral traits but the ones you are aware of as "obvious" now would have died out. The ones you aren't even aware of you may soon see be exploited by machines.