Comment by harshreality

5 years ago

You're talking about backend network connections to facebook's datacenters as if that's the only thing that matters. I'm talking about overall network connection including the internet-facing part.

Facebook's infrastructure at their peering points loses all contact with their respective facebook datacenter(s).

Their response is to automatically withdraw routes to themselves. I suppose they assumed that all datacenters would never go down at the same time, so that client dns redundancy would lead to clients using other dns servers that could still contact facebook datacenters. It's unclear how those routes could be restored without on-site intervention. If they automatically detect when the datacenters are reachable again, that too requires on-site intervention since after withdrawing routes FB's ops tools can't do anything to the relevant peering points or datacenters.

But even without the catastrophic case of all datacenter connections going down, you don't need to be a facebook ops engineer to realize that there are problems that need to be carefully thought through when ops tools depends on the same (public) network routes and DNS entries that the DNS servers are capable of autonomously withdrawing.

0 comments

harshreality

No comments yet

Contribute on Hacker News ↗