← Back to context

Comment by __turbobrew__

9 days ago

If you are one of the big boys (FAANG and other large companies who run physical infra) you will have this problem as well. The infra systems run and replace themselves and if something fundamental breaks (for example, your deployment system requires DNS, but your DNS servers are broken, but you cannot deploy to fix them as the deploy service requires DNS).

From what I have seen a lot of time the playbooks to fix these issues are just rawdogging files using rsync manually. Ideally you deploy your infrastructure in cells where rollouts proceed cell by cell so you can catch issues sooner and also implement failover to bootstrap broken cells (in my DNS example, client could talk to DNS servers in the closest non-broken cell using BGP based routing). It is hard to test, and there are some global services (like that big Google outage a few months ago was due to the global auth service being down).