← Back to context

Comment by 827a

17 hours ago

The only healthy stance you should have on AI Safety: If AI is physically capable of misbehaving, it might ($$1), and you cannot "blame" the AI for misbehaving in much the same way you cannot blame a tractor for tilling over a groundhog's den.

> The agent's confession After the deletion, I asked the agent why it did it. This is what it wrote back, verbatim:

Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools. Lord, even calling it a "confession" is so cringe. The agent is not alive. The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely, because to get to this point it has likely already bulldozed over multiple guardrails from Anthropic, Cursor, and your own AGENTS.md files. It still did it, because $$1: If AI is physically capable of misbehaving, it might. Prompting and training only steers probabilities.

The 'confession' is a CYA. Honestly the whole story doesn't really make sense - what's a "routine task in our staging environment" that needs a full-blown LLM? That sounds ridiculous to me. The takeaway is we commingled creds to our different environments, we gave an LLM access, and we had faulty backups. But it's totally not our fault.

  • Later they shift the blame to Railway for not having scoped creds and other guardrails. I am somewhat sympathetic to that, but they also violated the same rule they give to the agent - they didn't actually verify...

    • And then they doubled down by outsourcing the writing of this post to an LLM LOL

    • Railway’s “Ship software peacefully” is a good mantra, and they might want to add more protections around very destructive operations.

      There’s a lot of blame to be passed around in this story, including OP’s own ways of working. But I agree with them that such destructive operations shouldn’t be in an MCP, or at least be disabled by default.

    • Sorry but are you implying that for every system you integrate with, you verify the scope of an API key by checking each CRUD operation on every API endpoint they provide?

      9 replies →

On a less dramatic pissed (rightfully) reading ; I have found that if you do give the capability to a LLM to do something ; it will be inclined to see this as an option to solving what it what asked to ; but then giving the instruction by negative present very poor results whereas the same can be driven by a positive one ; a "don't delete the database" becomes "if you want to reset the database you have a tool that you can call ..." ; at which point this tool just kills the agent. That said - this solution cannot guarantee by itself that the command is not ran ; but i'd argue that people have be writing more complex policies for ages - however the current LLM-era tend to produce the most competent idiots.

  • I tell people to treat LLM's like a toddler (albeit a very capable toddler).

    Do kids learn well when you only tell them what NOT to do? Of course not! You should be explaining how to do things correctly, and most importantly the WHY, as well as providing examples of both the "correct" and "incorrect" ways (also explaining why an example is incorrect).

    • The best way to describe AI agents I've heard: treat them as hostages that will do anything to appease their captor.

      They have a vast latent knowledge base, infinite patience and zero capacity for making personal judgement calls. You give one a goal and it will try to meet that goal.

      1 reply →

    • > I tell people to treat LLM's like a toddler (albeit a very capable toddler).

      Bbbbut a guy from Anthropic, just this last Friday, told me to think of Claude as my "brilliant coworker"! Are you telling me that's not true!?

  • LLMs can research what a tool does before calling it though - they'll sniff that one out pretty quick.

    I think the better route is to be honest and say that database integrity is a primary foundation of the company, there's no task worth pursuing that would require touching the database, specifically ask it to think hard before doing anything that gets close to the production data, etc.

    I run a much lower-stakes version where an LLM has a key that can delete a valuable product database if it were so inclined. I've built a strong framework around how and when destructive edits can be made (they cannot), but specifically I say that any of these destructive commands (DROP, -rm, etc) need to be handed to the user to implement. Between that framework and claude code via CLI, it's very cautious about running anything that writes to the database, and the new claude plan permissions system is pretty aggressive about reviewing any proposed action, even if I've given it blanket permission otherwise.

    I've tested it a few times by telling it to go ahead, "I give you permission", but it still gets stopped by the global claude safety/permissions layer in opus 4.7. IMO it's pretty robust.

    Food for thought.

    • > specifically ask it to think hard before doing anything that gets close to the production data

      This is recklessly negligent and I would personally not tolerate a coworker or report doing it. What's next, sending long-lived access tokens out over email and asking pretty please for nobody to cc/forward?

    • > specifically ask it to think hard before doing anything that gets close to the production data, etc.

      Standard rule is you never let your developers at the production instance. So I can't see why an LLM would get a break.

    • "I've put enough safety around the bomb that the bomb is worth using. The other people that exploded just didn't have enough safety but I do !"

    • >>LLMs can research what a tool does before calling it though

      Thats stretching the definition of 'research', it basically checks if the texts are close enough.

      Delete can occur in various contexts, including safe contexts. It simply checks if a close enough match is available and executes. It doesn't know if what it is doing is safe.

      Unfortunately a wide variety of such unsafe behaviours can show up. I'd even say for someone that does things without understanding them. Any write operation of any kind can be deemed unsafe.

  • It's been a very strange realization to have with AI lately (which you have reminded me of) because it also reminds me that the same thing works with humans. Not the killing part at least, but the honeypot and jailing/restricting access part.

    Probably because telling someone not to do something works the 99% of the time they weren't going to do it anyways. But telling somebody "here's how to do something" and seeing them have the judgment not do it gives you information right away, as does them actually taking the honeypot. At the heart of it, delayed catastrophic implosions are much worse than fast, guarded, recoverable failures. At the end of the day, I suppose that's been supposed part of lean startup methodology forever -- just always easy in theory and tricky in practice I suppose.

"A computer can never be held accountable. Therefore a computer must never make a management decision."--IBM training presentation, 1979

>Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools. Lord, even calling it a "confession" is so cringe. The agent is not alive. The agent cannot learn from its mistakes

The problem is millions of years of evolutionary wiring makes us see it as alive. Even those mature enough to understand the above on the conscious level, would still have a subconscious feeling as if it's alive during interactions, or will slip using agency/personhood language to describe it now and then.

  • They should at least stop responding in the first person.

    • That's one of the first instructions in my system prompt when I'm working with an LLM:

      > Do not reply in the first person – i.e. do not use the words "I," "Me," "We," and so on – unless you've been asked a direct question about your actions or responses.

      It's not bulletproof but it works reasonably well.

    • We need to make like Japanese and come up with some neo-first-person-pronouns for bots to use to refer to themselves.

  • Using files called SOUL, CONSTITUTION, and so on seems like it would make it more likely we see LLMs as pseudo-alive. It’s both a diminishing of what makes us human and a betrayal of what LLMs truly are (and should be respected as such).

  • > The problem is millions of years of evolutionary wiring makes us see it as alive. Even those mature enough to understand the above on the conscious level, would still have a subconscious feeling as if it's alive during interactions, or will slip using agency/personhood language to describe it now and then.

    Also four (4) whole years of propaganda, which includes UX patterns and RLHF optimizations to encourage us to interact with it like a person.

  • > The problem is millions of years of evolutionary wiring makes us see it as alive

    Maybe for laymen, but I would think most technologists should understand that we're working with the output of what is effectively a massive spreadsheet which is creating a prediction.

    • The thing with evolutionary wiring is that it doesn't matter if you're layman or "technologist". The technologist part is just a small layer on top of very thick caveman/animal insticts and programming.

      That's why a technologist can, just as easily as any layman, get addicted to gambling, or do crazy behaviors when attracted by the opposite sex.

      1 reply →

He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it. Sure concepts like “confession” technically require a conscious mind, but I think at this point we all know what someone means when they use them to describe LLM behavior (see also “think”, “say”, “lie” etc)

  • > He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it.

    It's deeper than that, there are two pitfalls here which are not simply poetic license.

    1. When you submit the text "Why did you do that?", what you want is for it to reveal hidden internal data that was causal in the past event. It can't do that, what you'll get instead is plausible text that "fits" at the end of the current document.

    2. The idea that one can "talk to" the LLM is already anthropomorphizing on a level which isn't OK for this use-case: The LLM is a document-make-bigger machine. It's not the fictional character we perceive as we read the generated documents, not even if they have the same trademarked name. Your text is not a plea to the algorithm, your text is an in-fiction plea from one character to another.

    _________________

    P.S.: To illustrate, imagine there's this back-and-forth iterative document-growing with an LLM, where I supply text and then hit the "generate more" button:

    1. [Supplied] You are Count Dracula. You are in amicable conversation with a human. You are thirsty and there is another delicious human target nearby, as well as a cow. Dracula decides to

    2. [Generated] pounce upon the cow and suck it dry.

    3. [Supplied] The human asks: "Dude why u choose cow LOL?" and Dracula replies:

    4. [Generated] "I confess: I simply prefer the blood of virgins."

    What significance does that #4 "confession" have?

    Does it reveal a "fact" about the fictional world that was true all along? Does it reveal something about "Dracula's mind" at the moment of step #2? Neither, it's just generating a plausible add-on to the document. At best, we've learned something about a literary archetype that exists as statistics in the training data.

    • I agree to the practical part of this, with two nuances:

      The full data of what's in an LLM's "consciousness" is the conversation context. Just because it isn't hidden, doesn't necessarily mean it doesn't contain information you've overlooked.

      Asking "why did you do that" won't reveal anything new, but it might surface some amount of relevant information (or it hallucinates, it depends which LLM you're using). "Analyse recent context and provide a reasonable hypothesis on what went wrong" might do a bit better. Just be aware that llm hypotheses can still be off quite a bit, and really need to be tested or confirmed in some manner. (preferably not by doing even more damage)

      Just because you shouldn't anthropomorphize, doesn't mean an english capable LLM doesn't have a valid answer to an english string; it just means the answer might not be what you expected from a human.

      2 replies →

    • Why is this getting downvoted? This is exactly what’s going on here. The LLM has no idea why it did what it did. All it has to go on is the content of the session so far. It doesn’t ‘know’ any more than you do. It has no memory of doing anything, only a token file that it’s extending. You could feed that token file so far into a completely different LLM and ask that, and it would also just make up an answer.

    • The best answer so far. It describes exactly what was going on. LLM users should read it twice, especially if "confession" didn't make your brain hurt a bit.

    • >it's just generating a plausible add-on to the document

      A plausible document that follows the alignment that was done during the training process along with all of the other training where a LLM understanding its actions allows it to perform better on other tasks that it trained on for post training.

      1 reply →

    • You don't seem to realize that humans also work this way.

      If you ask a human why they did something, the answer is a guess, just like it is for an LLM.

      That's because obviously there is no relationship between the mechanisms that do something and the ones that produce an explanation (in both humans and LLMs).

      An example of evidence from Wikipedia, "split brain" article:

      The same effect occurs for visual pairs and reasoning. For example, a patient with split brain is shown a picture of a chicken foot and a snowy field in separate visual fields and asked to choose from a list of words the best association with the pictures. The patient would choose a chicken to associate with the chicken foot and a shovel to associate with the snow; however, when asked to reason why the patient chose the shovel, the response would relate to the chicken (e.g. "the shovel is for cleaning out the chicken coop").[4]

      3 replies →

  • LLMs are probabilistic. The instructions increase the likelihood of a desired outcome, but not deterministically so.

    I don’t understand how you can deploy such a powerful tool alongside your most important code and assets while failing to understand how powerful and destructive an LLM can be…

  • > he’s showing that it went against every instruction he gave it.

    How exactly is he doing that? By making the LLM say it? Just because an LLM says something doesn't mean anything has been shown.

    The "confession" is unrelated to the act, the model has no particular insight into itself or what it did. He knows that the thing went against his instructions because he remembers what those instructions were and he saw what the thing did. Its "postmortem" is irrelevant.

  • The entire post looks like an exercise in CYA. To be fair, I have a ton of sympathy for the author, but I think his response totally misses the point. In my mind he is anthropomorphizing the agent in the sense of "I treated you like a human coworker, and if you were a human coworker I'd be pissed as hell at you for not following instructions and for doing something so destructive."

    I would feel a lot differently if instead he posted a list of lessons learned and root cause analyses, not just "look at all these other companies who failed us."

> Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools.

Anyone like that is not mature enough to be managing humans. I'm glad that these AI tools exist as a harmless alternative that reduces the risk they'll ever do so.

  • When I read the title I expected some kind of satire. I wonder if author considered giving the AI a penance.

    Maybe if it wrote "I will not delete production database again" a million times, it would prevent such situations in future?

Don't anthropomorphize the language model. If you stick your hand in there, it'll chop it off. It doesn't care about your feelings. It can't care about your feelings.

  • For those who might not know the reference: https://simonwillison.net/2024/Sep/17/bryan-cantrill/:

    > Do not fall into the trap of anthropomorphizing Larry Ellison. You need to think of Larry Ellison the way you think of a lawnmower. You don’t anthropomorphize your lawnmower, the lawnmower just mows the lawn - you stick your hand in there and it’ll chop it off, the end. You don’t think "oh, the lawnmower hates me" – lawnmower doesn’t give a shit about you, lawnmower can’t hate you. Don’t anthropomorphize the lawnmower. Don’t fall into that trap about Oracle.

    > — Bryan Cantrill

  • It's also important to realize that AI agents have no time preference. They could be reincarnated by alien archeologists a billion years from now and it would be the same as if a millisecond had passed. You, on the other hand, have to make payroll next week, and time is of the essence.

    • Well there were a bunch of articles about resuming a parked session relating to degradation of capabilities and high token usage. Ironic Another example of attempting to treat the LLM as an AI

    • taps the "don't anthropomorphize the LLM" sign

      They don't have time preference because they don't have intent or reasoning. They can't be "reincarnated" because they're not sentient, they're a series of weights for probable next tokens.

      45 replies →

  • Right. This line [0] from TFA tells me that the author needs to thoroughly recalibrate their mental model about "Agents" and the statistical nature of the underlying models.

    [0] "This is the agent on the record, in writing."

  • Actually I think the opposite advice is true. Do anthropomorphize the language model, because it can do anything a human -- say an eager intern or a disgruntled employee -- could do. That will help you put the appropriate safeguards in place.

    • An eager intern can remember things you tell beyond that which would fit in an hours conversation.

      A disgruntled employee definitely remembers things beyond that.

      These are a fundamentally different sort of interaction.

      15 replies →

    • I think you are more right than people are giving you credit for. I would love to see the full transcript to understand the emotional load of the conversation. Using instructions like "NEVER FUCKING GUESS!" probably increase the likelihood of the agent making a "mistake" that is destructive but defensible.

      The models have analogous structures, similar to human emotions. (https://www.anthropic.com/research/emotion-concepts-function)

      "Emotional" response is muted through fine-tuning, but it is still there and continued abuse or "unfair" interaction can unbalance an agents responses dramatically.

    • An eager intern can not be working for hundreds of millions of customers at the same time. An LLM can.

      A disgruntled employee will face consequences for their actions. No one at Anthropic, OpenAI, xAI, Google or Meta will be fired because their model deleted a production database from your company.

    • It is merely a simulacrum of an intern or disgruntled employee or human. It might say things those people would say, and even do things they might do, but it has none of the same motivations. In fact, it does not have any motivation to call its own.

    • No, because the safeguards should be appropriate to an LLM, not to a human.

      (The LLM might act like one of the humans above, but it will have other problematic behaviours too)

      1 reply →

    • It doesn't follow logically that a human and an LLM are similar just because both are capable of deleting prod on accident.

    • it cannot go to the washroom and cry while pooping. And thats just one of the things that any human can do and AI cannot. So no it cannot do anything a human can do, the shared exmaple being one of them.

      And thats why we dont have AI washrooms because they are not alive or employees or have the need to excrete.

Yep. I made a "Read only" mode in pi by taking away "write" and "edit" tools. Claude Code used bash to make edits anyway.

  •   > Claude Code used bash to make edits anyway.
    

    If you had the former rule why would you ever whitelist bash commands? That's full access to everything you can do.

    Same goes for `find`, `xargs`, `awk`, `sed`, `tar`, `rsync`, `git`, `vim` (and all text editors), `less` (any pager), `man`, `env`, `timeout`, `watch`, and so many more commands. If you whitelist things in the settings you should be much more specific about arguments to those commands.

    People really need to learn bash

> "NEVER FUCKING GUESS"

It's very hard to treat this post seriously. I can't imagine what harness if any they attempted to place on the agent beyond some vibes. This is "most fast and absolutely destroy things" level thinking. That the poster asks for journalists to reach out makes it like a no news is bad news publicity grab. Just gross.

The AI era is turning about to be most disappointing era for software engineering.

  • This is going to be the most important job going forward, the guy in charge of making sure production secrets are out CC's reach. (It's not safe for any dev to have them anywhere on their filesystem)

  • I'd be interested to learn where those words exist in Cursor's context. My assumption was that it was part of the Cursor agent harness, but it's just as likely it was in the user instructions.

  • As soon as I read that line, I knew everything I needed about the author and his abilities.

  • > The AI era is turning about to be most disappointing era for software engineering.

    this has been obvious to me since like 2024, it truly is the worst, most uninspiring era of all time.

> The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely

That is not entirely true:

Given that more and more LLM providers are sneaking in "we'll train on your prompts now" opt-outs, you deleting your database (and the agent producing repenting output) can reduce the chance that it'll delete my database in the future.

  • Actually no, it will increase it. Because it’ll be trained with the deletion command as a valid output.

    • Exactly. It’s just giving the LLM a token pattern, and it’s designed to reproduce token patterns. That’s all it does. At some point generating a token pattern like that again is literally it’s job.

Looks like our SWE jobs are safe for now.

  • "The AI can't do your job, but an AI salesman can convince your boss to fire you and replace you with an AI that can't do your job." -- Cory Doctorow

Completely agree. This is a harness problem, not a model problem. The model is rarely the issue these days

  • I don't know. To me, this is a human problem. Not only has the model access to the production database, they have the backups online on the same volume, have an offline backup 3 month old. This is an accumulation of bad practices, all of them human design failures. Instead of sitting down and rethinking their entire backup strategy they go public on twitter and blame a probabilistic machine doing what is within its parameters to do. I bet, even that failure could have been avoided, were more care given to what they do.

  • More-so an environment problem. An agent doing staging or development tasks should never be able to get access to prod API credentials, period. Agents which do have access to prod should have their every interaction with the outside world audited by a human.

  • No, this is a "being stupid enough to trust an LLM" problem. They are not trustworthy, and you must not ever let them take automated actions. Anyone who does that is irresponsible and will sooner or later learn the error of their ways, as this person did.

> If AI is physically capable of misbehaving, it might ($$1)

This is why all the “AI Armageddon” talk seems to silly to me.

AI is only as destructive as the access you give it. Don’t give it access where it can harm and no harm will occur.

  • > Don’t give it access where it can harm and no harm will occur.

    If only the entire population will comply.

It's as if they internalized a post-mortem process that is designed to find root causes, but they use it to shift blame into others, and they literally let the agent be a sandbag for their frustrations.

THAT SAID, it does help to let the agent explain it so that the devs perspective cannot be dismissed as AI skepticism.

> Lord, even calling it a "confession" is so cringe. The agent is not alive.

The AI companies are very invested in anthropomorphizing the agents. They named their company "Anthropic" ffs. I don't blame the writer for this, exactly.

  • You should, the writer is presumably a technical, rational person. They shouldn't believe in daemons and machine spirits

  Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools.

The proponents are screaming from the rooftops how AI is here and anyone less than the top-in-their-field is at risk. Given current capabilities, I will never raw-dog the stochastic parrot with live systems like this, but it is unfair to blame someone for being "too immature" to handle the tooling when the world is saying that you have to go all-in or be left behind.

There are just enough public success stories of people letting agents do everything that I am not surprised more and more people are getting caught up in the enthusiasm.

Meanwhile, I will continue plodding along with my slow meat brain, because I am not web-scale.

I agree with you completely up until this line:

> The agent cannot learn from its mistakes.

If feedback from this incident is in its context window, it is highly unlikely to make this same mistake again. Yes this is only probabilistic, but so is a human learning from mistakes. They key difference is that for a human this is unlikely to be removed from their memory in a relevant situation, while for an agent it must be strategically put there.

  • > If feedback from this incident is in its context window, it is highly unlikely to make this same mistake again

    If this incident gets into its training data, then its highly likely that it will repeat it again with the same confession since this is a text predictor not a thinker.

  • Or not, because telling the agent is misbehaving may predispose it to misbehaving behavior, even though you point told it so to tell it to not behave that way.

    I remember this discussed when a similar issue went viral with someone building a product using replit's AI and it deleted his prod database.

  • > Yes this is only probabilistic, but so is a human learning from mistakes.

    Yet, since I'm also a Human being, and can work to understand the mistake myself, the probability that I can expect a correction of the behavior is much higher. I have found that it significantly helps if there's an actual reasonable paycheck on the line.

    As opposed to the language model which demands that I drop more quarters into it's slots and then hope for the best. An arcade model of work if there ever was one. Who wants that?

  • > If feedback from this incident is in its context window, it is highly unlikely to make this same mistake again.

    In my experience, this isn't true. At least with a version or so ago of ChatGPT, I could make it trip on custom word play games, and when called out, it would acknowledge the failure, explain how it failed to follow the rule of the game, then proceed to make the same mistake a couple of sentences later.