← Back to context

Comment by zaphar

9 days ago

Ironically many of those documents for procedures probably lived on that drive...

I dont know why but cant stop laughing. And the great thing is that they will get paid again to write the same thing.

You jest, but I once had a client who's IaC provisioning code was - you guessed it - stored on the very infrastructure which got destroyed.

  • If you are one of the big boys (FAANG and other large companies who run physical infra) you will have this problem as well. The infra systems run and replace themselves and if something fundamental breaks (for example, your deployment system requires DNS, but your DNS servers are broken, but you cannot deploy to fix them as the deploy service requires DNS).

    From what I have seen a lot of time the playbooks to fix these issues are just rawdogging files using rsync manually. Ideally you deploy your infrastructure in cells where rollouts proceed cell by cell so you can catch issues sooner and also implement failover to bootstrap broken cells (in my DNS example, client could talk to DNS servers in the closest non-broken cell using BGP based routing). It is hard to test, and there are some global services (like that big Google outage a few months ago was due to the global auth service being down).