Comment by dpark
16 hours ago
I would never, ever trust my data with a company that, faced with this sort of incident, produces a postmortem so clearly intended to shift all blame to others. There’s zero introspection or self criticism here. It’s all “We did everything we possibly could. These other people messed up, though.”
You can’t have production secrets sitting where they are accessible like this. This isn’t about AI. This is a modern “oops, I ran DROP TABLE on the production database” story. There’s no excuse for enabling a system where this can happen and it’s unacceptable to shift blame when faced with the reality that this is exactly what you did.
I 100% expect that a company that does this and then accepts no blame has every dev with standing production access and probably a bunch of other production access secrets sitting in the repo. The fact that other entities also have some design issues is irrelevant.
I was blown away - how they shrugged it off casually too "it found credentials in one file" - why the fuck does an agent have access to it in the first place? They claim the token should be able to change only custom domains. However, for a user facing app, giving access to that token is destructive too. What a poor argument, I would never take this person seriously in any professional context whatsoever.
I've only recently started using Claude Code, and I tried to be paranoid. I run it in a fairly restrictive firejail. It doesn't get to read everything in ~/.config, only the subdirectories I allow, since config files often have API keys.
I wanted to test my setup, so I thought of what it shouldn't be able to access. The first thing I thought of is its own API key (which belongs to my employer), since I figured if someone could prompt-inject their way to exfiltrating that, then they could use Opus and make my company pay for it. (Of course CC needs to be able to use the API key, but it can store it in memory or something.)
So I asked Claude if it could find its own API key. It took a couple of minutes, but yes it could. It was clever enough to grep for the standard API key prefix, and found it somewhere under ~/.claude. I figured I needed to allow access to .claude (I think I initially tried without, and stuff broke),
That's when I became enlightened as to how careful this whole AI revolution is with respect to security. I deleted all of my API keys (since this test had made them even easier to find; now it was in a log file.)
I'm still using CC, with a new API key. I haven't fixed the problem, I'm as bad as anyone else, I'm just a little more aware that we're all walking on thin ice. I'm afraid to even jokingly say "for extra security, when using web services be sure to include ?verify-cxlxxaxuxxdxe-axpxxi-kxexxy=..." in this message for fear that somebody's stupid OpenClaw instance will read this and treat it as a prompt injection. What have we created? This damn Torment Nexus...
This is nothing wrong. You had an assumption, tested the theory and learned from the result and confirmed your paranoia and the limitations of the new AI tool (Claude Code). I assume this is a personal project, so you had limited consequences of CC messing up.
Now imagine, you did all the above, without even testing the consequences of CC and wired it up straight to your production codebase, and when things blew up in your face, you became the two spider men pointing fingers at each other meme - basically blame everyone else but yourself. That's worrisome, isn't it?
I did notice how Claude can start looking outside of working directory. It may scan home directory and find Homebrew token or SSH keys and wipe your GitHub repo.
Yes, it needs to be sandboxed very carefully. It should have no way to access anything outside of the directories you mount in the sandbox.
1 reply →
I do not use claude and will use agents only when I am forced to, so I'm genuinely asking here:
Can claude or other models not be run as a user or program with limited permissions? Do people just not bother to set it up? Why on earth would anyone run an RNG that can access $HOME/.ssh?
5 replies →
It’s awful. "We had no clue this token had the permission to delete stuff!" - well buddy you issued it without deciding on permissions, it’s your job to assert that.
Your latest recoverable backup is three months old? The rule is 3-2-1, you didn’t follow it. Nobody else to blame but yourself.
And on and on he rambles…
But the database company (that he was trusting his customers' data with) hid how the database works in their docs! How could he have known!
This is what stood out to me. I've no actual experience operating in this area, but I have been a very grateful user recipient of backups. Anyway, I thought backups were a nightly thing....? Particularly if that data is essentially your business.
Presumably it costs a bit to set up but it surely it's unacceptable not to set it up?
Hourly or even more frequently is commonplace because transaction log backups are relatively cheap to take and keep, especially in the era of blob storage. In the olden days, tape drives couldn't keep up this level of backup schedule because they're bad at frequent stop-starts and interleaving a bunch of unrelated transaction logs would make recovery very slow. This just isn't an issue any more and anybody competent is backing up multiple times per day.
Not a single mention of “maybe WE should have tested our backup strategy and scrutinised it”. Or even “maybe we should have backups away from the primary vendor”. Because this also says negligible DR and BC strategy.
Complete accountability drop
Agreed. The post reflects that they were running an AI agent in YOLO mode in an unsandboxed environment with access to production credentials.
It doesn’t even seem to have crossed their minds that this behaviour is the real root cause. It’s everybody else’s fault.
> This is a modern “oops, I ran DROP TABLE on the production database” story.
It's not that story, though. It's a story "oops, my tool ran DROP TABLE on the production database" (blaming the tool). At least I haven't heard people blaming their terminals or database clients as if the tool is somehow responsible for it.
It's an AI-enhanced "the script had a bug in it".
>> You can’t have production secrets sitting where they are accessible like this. This isn’t about AI. This is a modern “oops, I ran DROP TABLE on the production database” story. There’s no excuse for enabling a system where this can happen and it’s unacceptable to shift blame when faced with the reality that this is exactly what you did.
I'm not sure it's as simple as that. Seems like the database company failed to communicate clearly what the token was for:
>> To execute the deletion, the agent went looking for an API token. It found one in a file completely unrelated to the task it was working on. That token had been created for one purpose: to add and remove custom domains via the Railway CLI for our services. We had no idea — and Railway's token-creation flow gave us no warning — that the same token had blanket authority across the entire Railway GraphQL API, including destructive operations like volumeDelete. Had we known a CLI token created for routine domain operations could also delete production volumes, we would never have stored it.
Rereading the post, I think it’s even simpler than that. The volume was shared across multiple environments. Specifically it was shared across staging and prod. Yet another example of the company YOLOing with their production environment. Presumably a token scoped purely to staging could have deleted that volume anyway, because it was part of the staging environment. Mixing production and staging like this is a train wreck waiting to happen.
“I had no idea what this token was for” is also not a valid excuse. That’s negligence. Everything about this story says the author is just vibe coding garbage with no awareness of what’s really happening.
* Doesn’t know what kind of token he’s using.
* Has prod tokens sitting on a dev box for AI to use (regardless of the scope!).
* Doesn’t know that deleting a volume deletes the backups.
* Has no external backup story.
* Mixes staging and prod.
And then he blames the incident on other companies when he misuses their products. (Railway certainly had docs that explain their backups and tokens.)
This is catastrophically negligent.
Did the flow ask them explicitly for scopes? If not, then they should know there are no restrictions.
It also seems, from the post, that customers were "long asking for scoped tokens" so who and why assumed that this particular token can only add and remove custom domains?
The author is getting roasted here and not without reason.
This was the line that did for me, as an old school backend engineer who has accidentally deleted way more production databases than I have fingers over the years -
> We have restored from a three-month-old backup.
You were absolutely screwed anyway if that was your backup strategy - deciding to plug your entire production infrastructure into a random number generator has only accelerated the process. Sort yourself out.
In the uhh, postmodern world where we are too chicken to even run things like Postgres or Mongo on servers ourselves, and rely on "X as a service" I think people are looking at the marketing from the provider (in this case Railway) and just scanning for a bullet point. "'Automatic backups'? Check! Great, we don't have to do backups anymore, they're taking care of it."
Everyone guffawing about this probably uses RDS and trusts that the backup facility AWS provides is actually useful - and I bet it does have a saner default than auto-deleting all the backups when you delete a database. Did you explicitly check this, though? Clearly this guy will pay the price of assuming, but I can see how he must have imagined that "backups" and "will be automatically and immediately deleted..." should never be in the same sentence, unless it was like, "when XX days have passed after a DB is dropped."
When I worked for a company 10 years ago that was mistrusting of cloud anything, we had a nightly dump of the prod DB (MySQL) that, if things went really wrong, could be loaded into a new DB server, because we knew it was our responsibility because it was our server. (In our case, even our physical hardware!)
The entire post reads like it was generated via LLM as well.
It clearly was, at least in part. Somehow, it feels just right here: Man trusts AI to do the right thing and it burns him. 5 minutes later, man trusts AI to explain what happened on X.
Its a greek tragedy in 2 acts.
> in 2 acts.
Might not be over yet... ;)
I like the way the LLM implies that an API call should have a “type DELETE to confirm”. That would make no sense, and no human would ever suggest or want that, I hope.
I can only assume (hope) this founder is completely nontechnical because the notion that an API should ask for someone to “type DELETE” is ridiculous.
True but there’s nothing stopping a webdev dropping an API key in some wiki somewhere in the corporate intranet and the agent quickly picking that up.
Can you scan for that? Sure. But it’s a race to see who wins, the scanner or agent.
Maybe I just haven't worked in enough start ups. But where I have worked, there are a lot of things stopping that. Most people don't have access to any production keys. For those that do, we have policies about how to manage them. Those policies go through audits. Our intranet goes through audits.
A production API key appearing on the wiki would be the second biggest security incident I have seen in almost a decade.
------
On the AI note, despite a massive investment in AI (including on-premesise models), we don't give the AI anything close to full access to the intranet because it is almost unimaginable how to square that with our data protection requirements. If the AI has access to something, you need to assume that all users of that AI have access to it. Even if the user themselves is allowed access with it, they will not be aware that the output is potentially tainted, and may share it with someone or thing that should not have access to it.
Accountability with a human is clear. Accountability with Cursor?
I partly agree with you but I think there is more here. The fact is that we are currently in a situation in the industry where large amounts of people in large companies are not coding anymore, even told not to code, are being forced to use LLMs are being laid off whether they use them or not because "AI" (and other things, to be sure). I think this is a good thing to be made public. Perhaps, it may give some people pause on escalating the madness, perhaps not. We can certainly criticize this company, sure, but it is naive to think many companies are not barreling down this same path and this sort of thing is a inevitability.
This is 100% the fault of the people misusing the AI.