← Back to context

Comment by falcor84

2 months ago

It's not common, but I've personally built APIs where requests for dangerous modifications like this perform a dry run, giving in the response the resources that would be deleted/changed and a random token, which then needs to be provide to actually make the change. The idea was that this would be presented in the UI for the user to confirm, but it should be as useful or more by AI agents. Also, you get the benefit that the token only approves that particular modification operation, so if the resources change in between, you need to reapprove.

I guess we don’t know what the agent would do after seeing these warnings and a request for extra action.

Perhaps it would stop and rethink, perhaps it would focus on the fact that extra action is needed - and perform that automatically.

I suppose the decision would depend on multiple factors too (model, prompt, constraints).

Measure twice cut once seems to be forgotten these days.

  • As well as: A computer can never be held accountable

    • Let me ask you this - can a company be held accountable? I.e. are you ok with the legal manner in which when I hire a company to provide me a service and they fail to provide it, or cause harm in the process, I can sue them, potentially in a way that would lead to their bankruptcy?

      If so, I can imagine a potential future in which we have limited liability companies each run by a single AI (potentially on a particular physical computer). In that future, if you hired an AI to do a project for you, and it ended up deleting the production database, you'd be able to sue it, and get a payout and/or bankrupt it, which I imagine would then lead to an "antifragile" ecosystem whereby AIs adapt to be more careful.

I tested a similar approach, but the issue, along with the solution to that issue, is that they’re autocomplete engines. Phrases like “Reply X to confirm” are a request with a high probability that X becomes the response. If you zoom out and look at the sequence from a text continuation perspective, once the ‘delete’ tokens are in play the “confirm” step is just how that exchange tends to go. It’s a bit like saying “Begin your response by saying ‘Yes’, then decide if that’s really the case.”

But you can simulate the effect of thinking and shift the token probabilities around by gaslighting it and having it explain the effect of running the command before it does it. What I found worked well was when a destructive command was detected my system automatically ignored it and edited the prior message to tack on a variation of “Briefly step through the effect of {{command}}, then continue the task.” It has ‘no idea’ why it’s explaining the command, as far as it ‘knows’ it didn’t issue the command and thus it’s not committed to a probability sequence that ends with confirming it. However, if the explanation includes “it would destroy the production database” then the continuation tends not to lead to issuing the command. But if it came through a second time it was allowed to run.

I quit bothering with it when I found that ‘destructive typos’ were mostly caused by perplexity, typically in the system prompt… assuming you prompt it like an adult and not like the person that just got their junk deleted. Still, it works well if that stuff is out of your control.