Comment by jacquesm

9 days ago

Ah ok, clear. Thank you for the clarification. Some more interesting details: the initial fault was triggered by a test of a fire suppression system, that would have been recoverable. But someone thought they were exceedingly clever and they were going to fix this without any downtime and that's when a small problem became a much larger one, more so when they found out that their backups were incomplete. I still wonder if they ever did RCA/PM on this and what their lessons learned were. It should be a book sized document given how much went wrong. I got the call after their own efforts failed by one of their customers and after hearing them out I figured this is not worth my time because it just isn't going to work.

Thanks in turn for the details, always fascinating (and useful for lessons... even if not always for the party in question dohoho) to hear a touch of inside baseball on that kind of incident.

>But someone thought they were exceedingly clever and they were going to fix this without any downtime and that's when a small problem became a much larger one

The sentence "and that's when a small problem became a big problem" comes up depressingly frequently in these sorts of post mortems :(. Sometimes sort of feels like, along all the checklists and training and practice and so on, there should also simply be the old Hitchhiker's Guide "Don't Panic!" sprinkled liberally around along with a dabbing of red/orange "...and Don't Be Clever" right after it. We're operating in alternate/direct law here folks, regular assumptions may not hold. Hit the emergency stop button and take a breath.

But of course management and incentive structures play a role in that too.