Comment by HenryKissinger
5 years ago
> During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally
Imagine being this person.
Tomorrow on /r/tifu.
I doubt Facebook engineers are free-typing commands on Bash, so it’s probably not an individual error. More likely to be a race condition or other edge case that wasn’t considered during a review. This might be a script that’s run 1000s of times before with no problems.
Back in Ye Old Dark Ages, I caused a BIG Google outage by running a routine maintenance script that had been run dozens if not hundreds of times before.
Turns out the underlying network software had a race condition that would ONLY be hit if the script ran at the exact same time as some automated monitoring tools polled the box.
At FAANG scale, "one in a million" happens a lot more often than you'd think.
> At FAANG scale, "one in a million" happens a lot more often than you'd think.
And it happens less than you think too, sometimes, which I think is closer to the original point.
of course if one person can knock down an entire global system through a trivial mistake the problem is obviously not the person to begin with, but the architecture of the system.
Not really, there’s essentially always a button that blows everything up. Catastrophic failures usually end up being a large set of safety systems malfunctioning which would otherwise prevent the issue when that button is pressed.
But yes, for these types of problem, the ultimate fault is never “that guy Larry is an idiot”, it takes a large team of cooperating mistakes.
Or the fact that there was a bug in the tool that should have prevented this.
> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.
So this is really two peoples fault. One for issuing the command the other for introducing the bug in the audit tool.
"It was the intern" is only cute once.
Hahahahahahahaha