Comment by blowski

5 years ago

I doubt Facebook engineers are free-typing commands on Bash, so it’s probably not an individual error. More likely to be a race condition or other edge case that wasn’t considered during a review. This might be a script that’s run 1000s of times before with no problems.

Back in Ye Old Dark Ages, I caused a BIG Google outage by running a routine maintenance script that had been run dozens if not hundreds of times before.

Turns out the underlying network software had a race condition that would ONLY be hit if the script ran at the exact same time as some automated monitoring tools polled the box.

At FAANG scale, "one in a million" happens a lot more often than you'd think.

  • > At FAANG scale, "one in a million" happens a lot more often than you'd think.

    And it happens less than you think too, sometimes, which I think is closer to the original point.