Comment by harshreality
5 years ago
> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.
No, it's (clearly) not a guaranteed indication of that. Logic fail. Infrastructure tools at that scale need to handle all possible causes of test failures. "Is the internet down or only the few sites I'm testing?" is a classic network monitoring script issue.
I think you're misunderstanding. The DNS servers (at Facebook peering points) had zero access to Facebook datacenters because the backbone was down. That is as unhealthy as the network connection can get, so they (correctly) stopped advertising the routes to the outside world.
By that point, the Facebook backbone was already gone. The DNS servers stopping BGP advertisements to the outside world did not cause that.
You're talking about backend network connections to facebook's datacenters as if that's the only thing that matters. I'm talking about overall network connection including the internet-facing part.
Facebook's infrastructure at their peering points loses all contact with their respective facebook datacenter(s).
Their response is to automatically withdraw routes to themselves. I suppose they assumed that all datacenters would never go down at the same time, so that client dns redundancy would lead to clients using other dns servers that could still contact facebook datacenters. It's unclear how those routes could be restored without on-site intervention. If they automatically detect when the datacenters are reachable again, that too requires on-site intervention since after withdrawing routes FB's ops tools can't do anything to the relevant peering points or datacenters.
But even without the catastrophic case of all datacenter connections going down, you don't need to be a facebook ops engineer to realize that there are problems that need to be carefully thought through when ops tools depends on the same (public) network routes and DNS entries that the DNS servers are capable of autonomously withdrawing.