Comment by zaphar

4 months ago

Ironically many of those documents for procedures probably lived on that drive...

4 comments

zaphar

Here's a 2024 incident:

> "The outage also hit servers that host procedures meant to overcome such an outage... Company officials had no paper copies of backup procedures, one of the people added, leaving them unable to respond until power was restored."

https://www.reuters.com/technology/space/power-failed-spacex...

ksec 4 months ago

I dont know why but cant stop laughing. And the great thing is that they will get paid again to write the same thing.

comprev 4 months ago

You jest, but I once had a client who's IaC provisioning code was - you guessed it - stored on the very infrastructure which got destroyed.

__turbobrew__ 4 months ago

If you are one of the big boys (FAANG and other large companies who run physical infra) you will have this problem as well. The infra systems run and replace themselves and if something fundamental breaks (for example, your deployment system requires DNS, but your DNS servers are broken, but you cannot deploy to fix them as the deploy service requires DNS).
From what I have seen a lot of time the playbooks to fix these issues are just rawdogging files using rsync manually. Ideally you deploy your infrastructure in cells where rollouts proceed cell by cell so you can catch issues sooner and also implement failover to bootstrap broken cells (in my DNS example, client could talk to DNS servers in the closest non-broken cell using BGP based routing). It is hard to test, and there are some global services (like that big Google outage a few months ago was due to the global auth service being down).