← Back to context

Comment by fragmede

5 years ago

SREs live in a dangerous world, unfortunately. It's entirely possible the "tool" in question is a shell script that gets fed a list of bad cells but some bug causes it to get a list of all the cells instead.

Some tools are well engineered, capable of the Sisyphean task of globally deploying updates but others are rapid prototypes that, sure, are too dangerous to exist, but the whole point of SREs being capable programmers is that the work has problems that are most efficiently solved with one-off code that just isn't (because it can't be) rigorously tested before being used. You can bet there was some of that used in recovering from this incident. (I'm sure there were many eyes reviewing the code before being run, but that only goes so far when you're trying to do something that you never expected, like having to revive Facebook.)

The other problem is scale: the standard "save me" for tools like this is a --doit and --no-really-i-mean-it and defaulting to a "this is what I would've done" mode. That falls apart the moment the list of actions is longer then the screen but you're expecting that: after all how can you really tell the difference unless the console scrolls for a really long time?

There's solutions to that, but of course these sorts of tools all come into existence well before the system reaches a size where how they work becomes dangerous.