← Back to context

Comment by eximius

6 months ago

If you can't stop an LLM from _saying_ something, are you really going to trust that you can stop it from _executing a harmful action_? This is a lower stakes proxy for "can we get it to do what we expect without negative outcomes we are a priori aware of".

Bikeshed the naming all you want, but it is relevant.

> are you really going to trust that you can stop it from _executing a harmful action_?

Of course, because an LLM can’t take any action: a human being does, when he sets up a system comprising an LLM and other components which act based on the LLM’s output. That can certainly be unsafe, much as hooking up a CD tray to the trigger of a gun would be — and the fault for doing so would lie with the human who did so, not for the software which ejected the CD.

  • Given that the entire industry is in a frenzy to enable "agentic" AI - i.e. hook up tools that have actual effects in the world - that is at best a rather native take.

    Yes, LLMs can and do take actions in the world, because things like MCP allow them to translate speech into action, without a human in the loop.

    • Exactly this. 70% of CEOs say that they hope to be able to lay people off and replace them with an LLM soon. It doesn’t matter that LLMs are incapable of reasoning at even the same level as an elementary school child. They’ll do it because it’s cheap and trendy.

      Many companies are already pushing LLMs into roles where they make decisions. It’s only going to get worse. The surface area for attacks against LLM agents is absolutely colossal, and I’m not confident that the problems can be fixed.

      2 replies →

    • I see much more of offerings pushing these flows onto the market than actually adopting those flows in practice. It's a solution in search of a problem and I doubt most are fully eating their own dogfood as anything but contained experiments.

    • > that is at best a rather native take.

      No more so than correctly pointing out that writing code for ffmpeg doesn't mean that you're enabling streaming services to try to redefine the meaning of the phrase "ad-free" because you're allowing them to continue existing.

      The problem is not the existence of the library that enables streaming services (AI "safety"), it's that you're not ensuring that the companies misusing technology are prevented from doing so.

      "A company is trying to misuse technology so we should cripple the tech instead of fixing the underlying social problem of the company's behavior" is, quite frankly, an absolutely insane mindset, and is the reason for a lot of the evil we see in the world today.

      You cannot and should not try to fix social or governmental problems with technology.

  • I really struggle to grok this perspective.

    The semantics of whether it’s the LLM or the human setting up the system that “take an action” are irrelevant.

    It’s perfectly clear to anyone that cares to look that we are in the process of constructing these systems. The safety of these systems will depend a lot on the configuration of the black box labeled “LLM”.

    If people were in the process of wiring up CD trays to guns on every street corner you’d I hope be interested in CDGun safety and the algorithms being used.

    “Don’t build it if it’s unsafe” is also obviously not viable, the theoretical economic value of agentic AI is so big that everyone is chasing it. (Again, it’s irrelevant whether you think they are wrong; they are doing it, and so AI safety, steerability, hackability, corrigibility, etc are very important.)

But isn't the problem is that one shouldn't ever trust an LLM to only ever do what it is explicitly instructed with correct resolutions to any instruction conflicts?

LLMs are "unreliable", in a sense that when using LLMs one should always consider the fact that no matter what they try, any LLM will do something that could be considered undesirable (both foreseeable and non-foreseeable).

> If you can't stop an LLM from _saying_ something, are you really going to trust that you can stop it from _executing a harmful action_?

You hit the nail on the head right there. That's exactly why LLM's fundamentally aren't suited for any greater unmediated access to "harmful actions" than other vulnerable tools.

LLM input and output always needs to be seen as tainted at their point of integration. There's not going to be any escaping that as long as they fundamentally have a singular, mixed-content input/output channel.

Internal vendor blocks reduce capabilities but don't actually solve the problem, and the first wave of them are mostly just cultural assertions of Silicon Valley norms rather than objective safety checks anyway.

Real AI safety looks more like "Users shouldn't integrate this directly into their control systems" and not like "This text generator shouldn't generate text we don't like" -- but the former is bad for the AI business and the latter is a way to traffic in political favor and stroke moral egos.

The way to stop it from executing an action is probably having controls on the action and an not the llm? white list what api commands it can send so nothing harmful can happen or so on.

  • This is similar to the halting problem. You can only write an effective policy if you can predict all the side effects and their ramifications.

    Of course you could do like deno and other such systems and just deny internet or filesystem access outright, but then you limit the usefulness of the AI system significantly. Tricky problem to be honest.

  • It won't be long before people start using LLMs to write such whitelists too. And the APIs.

I wouldn't mind seeing a law that required domestic robots to be weak and soft.

That is, made of pliant material and with motors with limited force and speed. Then no matter if the AI inside is compromised, the harm would be limited.

  • Humans are weak and soft, but can use their intelligence to project forces much higher than available in their physical body.

I don't see how it is different than all of the other sources of information out there such as websites, books and people.