Comment by blowski
5 years ago
I doubt Facebook engineers are free-typing commands on Bash, so it’s probably not an individual error. More likely to be a race condition or other edge case that wasn’t considered during a review. This might be a script that’s run 1000s of times before with no problems.
Back in Ye Old Dark Ages, I caused a BIG Google outage by running a routine maintenance script that had been run dozens if not hundreds of times before.
Turns out the underlying network software had a race condition that would ONLY be hit if the script ran at the exact same time as some automated monitoring tools polled the box.
At FAANG scale, "one in a million" happens a lot more often than you'd think.
> At FAANG scale, "one in a million" happens a lot more often than you'd think.
And it happens less than you think too, sometimes, which I think is closer to the original point.