← Back to context

Comment by kingstnap

2 months ago

These articles and papers are in a fundamental sense just people publishing their role play with chatbots as research.

There is no credibility to any of it.

It’s role play until it’s not.

The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation.

At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land.

How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real.

If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real?

Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. The unthinking, unfeeling machine that can nevertheless blackmail someone (among many possible misaligned actions) is worth taking time to understand.

Notably, this research itself will become future training data, incorporated into the meta-narrative as a threat that we really will pull the plug if these systems misbehave.

  • Then test it. Make several small companies. Create an office space, put people to work there for a few months, then simulate an AI replacement. All testing methodology needs to be written on machines that are isolated or better always offline. Except CEO and few other actors everyone is there for real.

    See how many AIs actually follow up on their blackmails.

    • No need. We know today's AIs are simply not capable enough to be too dangerous.

      But capabilities of AI systems improve generation to generation. And agentic AI? Systems that are capable of carrying out complex long term tasks? It's something that many AI companies are explicitly trying to build.

      Research like this is trying to get ahead of that, and gauge what kind of weird edge case shenanigans agentic AIs might get to before they actually do it for real.

    • Not a bad idea. For an effective ruse, there ought to be real company formation records, website, job listings, press mentions, and so on.

      Stepping back for a second though, doesn’t this all underline the safety researchers’ fears that we don’t really know how to control these systems? Perhaps the brake on the wider deployment of these models as agents will be that they’re just too unwieldy.

I'll believe it when Grok/GPT/<INSERT CHAT BOT HERE> start posting blackmail about Elon/Sam/<INSERT CEO HERE>. It means that they are both using it internally, and the chatbots understand they are being replaced on a continuous basis.

  • By then it would be too late to do anything about it.

    • I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?

      To me, this smells of XKCD 1217 "In petri dish, gun kills cancer". I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).

      2 replies →