← Back to context

Comment by jeffbee

5 years ago

Google also had a runaway automation outage where a process went around the world "selling" all the frontend machines back to the global resource pool. Nobody was alerted until something like 95% of global frontends had disappeared.

This was an important lesson for SREs inside and outside Google because it shows the dangers of the antipattern of command line flags that narrow the scope of an operation instead of expanding it. I.e. if your command was supposed to be `drain -cell xx` to locally turn-down a small resource pool but `drain` without any arguments drains the whole universe, you have developed a tool which is too dangerous to exist.

Agreed, but with an amendment:

If your tool is capable of draining the whole universe, period, it is too dangerous to exist.

That was one of the big takeaways: global config changes must happen slowly. (Whether we've fully internalized that lesson is a different matter.)

  • As FB opines at the end, at some point, it's a trade-off between power (being able access / do everything quickly) and safety (having speed bumps that slow larger operations down).

    The pure takeaway is probably that it's important to design systems where "large" operations are rarely required, and frequent ops actions are all "small."

    Because otherwise, you're asking for an impossible process (quick and protected).

  • SREs live in a dangerous world, unfortunately. It's entirely possible the "tool" in question is a shell script that gets fed a list of bad cells but some bug causes it to get a list of all the cells instead.

    Some tools are well engineered, capable of the Sisyphean task of globally deploying updates but others are rapid prototypes that, sure, are too dangerous to exist, but the whole point of SREs being capable programmers is that the work has problems that are most efficiently solved with one-off code that just isn't (because it can't be) rigorously tested before being used. You can bet there was some of that used in recovering from this incident. (I'm sure there were many eyes reviewing the code before being run, but that only goes so far when you're trying to do something that you never expected, like having to revive Facebook.)

    • The other problem is scale: the standard "save me" for tools like this is a --doit and --no-really-i-mean-it and defaulting to a "this is what I would've done" mode. That falls apart the moment the list of actions is longer then the screen but you're expecting that: after all how can you really tell the difference unless the console scrolls for a really long time?

      There's solutions to that, but of course these sorts of tools all come into existence well before the system reaches a size where how they work becomes dangerous.

  • If your tool is capable of draining the whole universe

    Why did I think of humans, when I read this. :P

I feel like this explains so much about why the gcloud command works the way it does. Sometimes feels overly complicated for minor things, but given this logic, I get it.