← Back to context

Comment by vitus

5 years ago

Agreed, but with an amendment:

If your tool is capable of draining the whole universe, period, it is too dangerous to exist.

That was one of the big takeaways: global config changes must happen slowly. (Whether we've fully internalized that lesson is a different matter.)

As FB opines at the end, at some point, it's a trade-off between power (being able access / do everything quickly) and safety (having speed bumps that slow larger operations down).

The pure takeaway is probably that it's important to design systems where "large" operations are rarely required, and frequent ops actions are all "small."

Because otherwise, you're asking for an impossible process (quick and protected).

SREs live in a dangerous world, unfortunately. It's entirely possible the "tool" in question is a shell script that gets fed a list of bad cells but some bug causes it to get a list of all the cells instead.

Some tools are well engineered, capable of the Sisyphean task of globally deploying updates but others are rapid prototypes that, sure, are too dangerous to exist, but the whole point of SREs being capable programmers is that the work has problems that are most efficiently solved with one-off code that just isn't (because it can't be) rigorously tested before being used. You can bet there was some of that used in recovering from this incident. (I'm sure there were many eyes reviewing the code before being run, but that only goes so far when you're trying to do something that you never expected, like having to revive Facebook.)

  • The other problem is scale: the standard "save me" for tools like this is a --doit and --no-really-i-mean-it and defaulting to a "this is what I would've done" mode. That falls apart the moment the list of actions is longer then the screen but you're expecting that: after all how can you really tell the difference unless the console scrolls for a really long time?

    There's solutions to that, but of course these sorts of tools all come into existence well before the system reaches a size where how they work becomes dangerous.

If your tool is capable of draining the whole universe

Why did I think of humans, when I read this. :P