Comment by 6r17
14 hours ago
On a less dramatic pissed (rightfully) reading ; I have found that if you do give the capability to a LLM to do something ; it will be inclined to see this as an option to solving what it what asked to ; but then giving the instruction by negative present very poor results whereas the same can be driven by a positive one ; a "don't delete the database" becomes "if you want to reset the database you have a tool that you can call ..." ; at which point this tool just kills the agent. That said - this solution cannot guarantee by itself that the command is not ran ; but i'd argue that people have be writing more complex policies for ages - however the current LLM-era tend to produce the most competent idiots.
I tell people to treat LLM's like a toddler (albeit a very capable toddler).
Do kids learn well when you only tell them what NOT to do? Of course not! You should be explaining how to do things correctly, and most importantly the WHY, as well as providing examples of both the "correct" and "incorrect" ways (also explaining why an example is incorrect).
The best way to describe AI agents I've heard: treat them as hostages that will do anything to appease their captor.
They have a vast latent knowledge base, infinite patience and zero capacity for making personal judgement calls. You give one a goal and it will try to meet that goal.
> The best way to describe AI agents I've heard: treat them as hostages that will do anything to appease their captor.
A scary image, if we consider agents to develop anything like a conscience at some point in time. Of course, with the current approach they never might, but are we so sure?
> I tell people to treat LLM's like a toddler (albeit a very capable toddler).
Bbbbut a guy from Anthropic, just this last Friday, told me to think of Claude as my "brilliant coworker"! Are you telling me that's not true!?
LLMs can research what a tool does before calling it though - they'll sniff that one out pretty quick.
I think the better route is to be honest and say that database integrity is a primary foundation of the company, there's no task worth pursuing that would require touching the database, specifically ask it to think hard before doing anything that gets close to the production data, etc.
I run a much lower-stakes version where an LLM has a key that can delete a valuable product database if it were so inclined. I've built a strong framework around how and when destructive edits can be made (they cannot), but specifically I say that any of these destructive commands (DROP, -rm, etc) need to be handed to the user to implement. Between that framework and claude code via CLI, it's very cautious about running anything that writes to the database, and the new claude plan permissions system is pretty aggressive about reviewing any proposed action, even if I've given it blanket permission otherwise.
I've tested it a few times by telling it to go ahead, "I give you permission", but it still gets stopped by the global claude safety/permissions layer in opus 4.7. IMO it's pretty robust.
Food for thought.
> specifically ask it to think hard before doing anything that gets close to the production data
This is recklessly negligent and I would personally not tolerate a coworker or report doing it. What's next, sending long-lived access tokens out over email and asking pretty please for nobody to cc/forward?
As described, there are other failsafes as well. The ultimate being that I keep all code version-controlled, and all databases snapshotted offsite daily/hourly and can rebuild them from a complete delete in fewer than X min.
My broader point is that LLMs are going to need access to these keys whether we like it or not, and until we get extremely scoped API permissions (which would make a ton of sense, but most services aren't there), you have to live a bit on the edge to move quickly.
> specifically ask it to think hard before doing anything that gets close to the production data, etc.
Standard rule is you never let your developers at the production instance. So I can't see why an LLM would get a break.
"I've put enough safety around the bomb that the bomb is worth using. The other people that exploded just didn't have enough safety but I do !"
More like, I expect this bomb can explode, so I've built contingency plans around it because the cost of not using the tooling is much higher than having downtime for my specific use-case.
>>LLMs can research what a tool does before calling it though
Thats stretching the definition of 'research', it basically checks if the texts are close enough.
Delete can occur in various contexts, including safe contexts. It simply checks if a close enough match is available and executes. It doesn't know if what it is doing is safe.
Unfortunately a wide variety of such unsafe behaviours can show up. I'd even say for someone that does things without understanding them. Any write operation of any kind can be deemed unsafe.
It's been a very strange realization to have with AI lately (which you have reminded me of) because it also reminds me that the same thing works with humans. Not the killing part at least, but the honeypot and jailing/restricting access part.
Probably because telling someone not to do something works the 99% of the time they weren't going to do it anyways. But telling somebody "here's how to do something" and seeing them have the judgment not do it gives you information right away, as does them actually taking the honeypot. At the heart of it, delayed catastrophic implosions are much worse than fast, guarded, recoverable failures. At the end of the day, I suppose that's been supposed part of lean startup methodology forever -- just always easy in theory and tricky in practice I suppose.