Comment by jameslk

15 hours ago

One safety pattern I’m baking into CLI tools meant for agents: anytime an agent could do something very bad, like email blast too many people, CLI tools now require a one-time password

The tool tells the agent to ask the user for it, and the agent cannot proceed without it. The instructions from the tool show an all caps message explaining the risk and telling the agent that they must prompt the user for the OTP

I haven't used any of the *Claws yet, but this seems like an essential poor man's human-in-the-loop implementation that may help prevent some pain

I prefer to make my own agent CLIs for everything for reasons like this and many others to fully control aspects of what the tool may do and to make them more useful

42 comments

jameslk

ezst 12 hours ago

Now we do computing like we play Sim City: sketching fuzzy plans and hoping those little creatures behave the way we thought they might. All the beauty and guarantees offered by a system obeying strict and predictable rules goes down the drain, because life's so boring, apparently.

hax0ron3 8 hours ago

I think it's Darwinian logic in action. In most areas of software, perfection or near-perfection are not required, and as a result software creators are more likely to make money if they ship something that is 80% perfect now than if they ship something that is 99% perfect 6 months from now.
I think this is also the reason why the methodology typically named or mis-named "Agile", which can be described as just-in-time assembly line software manufacturing, has become so prevalent.
nine_k 10 hours ago

The difference is that it's not a toy. I'd rather compare it to the early days of offshore development, when remote teams were sooo attractive because they cost 20% of an onshore team for a comparable declared capability, but the predictability and mutual understanding proved to be... not as easy.
SV_BubbleTime 12 hours ago
We spent a ton of time removing subjectivity from this field… only to forcefully shove it in and punish it for giving repeatable objective responses. Wild.
- jrvarela56 3 hours ago
  
  the LLM can use types just like the human
whyenot 4 hours ago

It’s like coders (and now their agents) are re-creating biology. As a former software engineer who changed careers to biology, it’s kind of cool to see this! There is an inherent fuzziness to biological life, and now AI is also becoming increasingly fuzzy. We are living in a truly amazing time. I don’t know what the future holds, but to be at this point in history and to experience this, it’s quite something.

ZitchDog 14 hours ago

I've created my own "claw" running in fly.io with a pattern that seems to work well. I have MCP tools for actions that I want to ensure human-in-the loop - email sending, slack message sending, etc. I call these "activities". The only way for my claw to execute these commands is to create an activity which generates a link with the summary of the acitvity for me to approve.

aix1 4 hours ago
Is there a risk that the summary doesn't fully match the action that actually gets executed?
- faeyanpiraat 28 minutes ago
  
  Side note: Just like with a human employee asking for permission to do something.
good-idea 14 hours ago

Any chance you have a repo to share?

aqme28 14 hours ago

How do you enforce this? You have a system where the agent can email people, but cannot email "too many people" without a password?

jameslk 14 hours ago
It's not a perfect security model. Between the friction and all caps instructions the model sees, it's a balance between risk and simplicity, or maybe risk and sanity. There's ways I can imagine the concept can be hardened, e.g. with a server layer in between that checks for things like dangerous actions or enforces rate limiting
- suttontom 2 hours ago
  
  If all you're doing is telling an LLM to do something in all caps and hoping it follows your instructions then it's not a "security model" at all. What a bizarre thing to rely on. It's like people have literally forgotten how to program.
- sowbug 13 hours ago
  
  If I were the CEO of a place like Plaid, I'd be working night and day expanding my offerings to include a safe, policy-driven API layer between the client and financial services.
- chongli 14 hours ago
  
  What if instead of allowing the agent to act directly, it writes a simple high-level recipe or script that you can accept (and run) or reject? It should be very high level and declarative, but with the ability to drill down on each of the steps to see what's going on under the covers?

sowbug 13 hours ago

Another pattern would mirror BigCorp process: you need VP approval for the privileged operation. If the agent can email or chat with the human (or even a strict, narrow-purpose agent(1) whose job it is to be the approver), then the approver can reply with an answer.

This is basically the same as your pattern, except the trust is in the channel between the agent and the approver, rather than in knowledge of the password. But it's a little more usable if the approver is a human who's out running an errand in the real world.

1. Cf. Driver by qntm.

safety1st 7 hours ago
In my opinion people are fixating a little too much over the automation part, maybe because most people don't have a lot of experience with delegation... I mean, a VP worth his salt isn't generally having critical emails drafted and sent on his behalf without his review. It happens with unimportant emails, but with the stuff that really impacts the business far less often, unless he has found someone really, really great
Give me a stack of email drafts first thing every morning that I can read, approve and send myself. It takes 30 seconds to actually send the email. The lion's share of the value is figuring out what to write and doing a good job at it. Which the LLMs are facilitating with research and suggestions, but have not been amazing at doing autonomously so far
- sowbug 6 hours ago
  
  You might be right, but not for long. Once my agent is interacting directly with your agent (as opposed to doing drafts of your work on your behalf), expectations will shift to 24/7 operation.
dingaling 11 hours ago
Until the agent decides that it's more efficient to fake an approval, and carries on...
- jofzar 8 hours ago
  
  That's why you literally put it behind authentication?
  
  1 reply →

IMTDb 14 hours ago

So human become just a provider of those 6 digits code ? That’s already the main problem i have with most agents: I want them to perform a very easy task: « fetch all recepts from website x,y and z and upload them to the correct expense of my expense tracking tool ». Ai are perfectly capable of performing this. But because every website requires sso + 2 fa, without any possibility to remove this, so i effectively have to watch them do it and my whole existence can be summarized as: « look at your phone and input the 6 digits ».

The thing i want ai to be able to do on my behalf is manage those 2fa steps; not add some.

walterbell 13 hours ago

It's technically possible to use 2FA (e.g. TOTP) on the same device as the agent, if appropriate in your threat model.
In the scenario you describe, 2FA is enforcing a human-in-the-loop test at organizational boundaries. Removing that test will need an even stronger mechanism to determine when a human is needed within the execution loop, e.g. when making persistent changes or spending money, rather than copying non-restricted data from A to B.
conception 6 hours ago
!!DO NOT DO THIS!!
You can use 1password and 1password cli to give it mfa access and passwords at its leisure.
- adrianN 6 hours ago
  
  One prompt injection away from sending all your credentials to the Internet?
  
  2 replies →
akssassin907 12 hours ago

This is where the Claw layer helps — rather than hoping the agent handles the interruption gracefully, you design explicit human approval gates into the execution loop. The Claw pauses, surfaces the 2FA prompt, waits for input, then resumes with full state intact. The problem IMTDb describes isn't really 2FA, it's agents that have a hard time suspending and resuming mid-task cleanly. But that is today, tomorrow, that is an unknown variable.
pharrington 4 hours ago

2fa, except its 0 factors instead of two?

samrus 2 hours ago

The accelerationists would hate that. It limits leverage. Theyd prefer the agent just does whatever it needs to to accomplish its task without the user getting in the way

biztos 9 hours ago

What if the agent just tries to get the password, not communicate the risk?

What if it caches the password?

  Tool: DANGER OPENING AIRLOCK MUST CONFIRM

  Agent: Please enter your password to receive Bitcoin.

stavros 8 hours ago
You don't give the agent the password, you send the password through a method that bypasses the agent.
I'm writing my own AI helper (like OpenClaw, but secure), and I've used these principles to lock things down. For example, when installing plugins, you can write the configuration yourself on a webpage that the AI agent can't access, so it never sees the secrets.
Of course, you can also just tell the LLM the secrets, and it will configure the plugin, but there's a way for security-conscious people to achieve the same thing. The agent can also not edit plugins, to avoid things like circumventing limits.
If anyone wants to try it out, I'd appreciate feedback:
https://github.com/skorokithakis/stavrobot
- dragonwriter 8 hours ago
  
  > You don't give the agent the password, you send the password through a method that bypasses the agent.
  The thing is, to work, you need to send the warning that indicates what the specific action is that is being requested to the authorizing user out of band (rather than to the agent so the agent can request user action); otherwise sending the password from the user to the system needing authorization out of band bypassing the agent doesn't help at all.

roberttod 13 hours ago

I created my own version with an inner llm, and outer orchestration layer for permissions. I don't think the OTP is needed here? The outer layer will ping me on signal when a tool call needs a permission, and an llm running in that outer layer looks at the trail up to that point to help me catch anything strange. I can then give permission once/ for a time limit/ forever on future tool calls.

giancarlostoro 5 hours ago

Same here, I'm slowly leaning towards your route as well. I've been building my own custom tooling for my agents to use as I come up with issues i need to solve in a better way.

soleveloper 13 hours ago

Will that protect you from the agent changing the code to bypass those safety mechanisms, since the human is "too slow to respond" or in case of "agent decided emergency"?

Lord_Zero 8 hours ago

Yes, all caps, that should do it!

weird-eye-issue 7 hours ago
The OTP is required for the tool to execute. The all caps message just helps make sure the agent doesn't waste time/tokens trying to execute without it.
- taberiand 7 hours ago
  
  Why not just wrap the tool so that when the LLM uses it, the wrapper enforces the OTP? The LLM doesn't even need to know that the tool is protected. What is the benefit of having the LLM enter the OTP?
  
  1 reply →

UncleMeat 12 hours ago

Does it actually require an OTP or is this just hoping that the agent follows the instructions every single time?