Comment by nonbirithm
5 years ago
DNS seems to be a massive point of failure everywhere, even taking out the tools to help deal with outages themselves. The same thing happened to Azure multiple times in the past, causing complete service outages. Surely there must be some way to better mitigate DNS misconfiguration by now, given the exceptional importance of DNS?
> DNS seems to be a massive point of failure everywhere
Emphasis on the "seems". DNS gets blamed a lot because it's the very first step in the process of connecting. When everything is down, you will see DNS errors.
And since you can't get past the DNS step, you never see the other errors that you would get if you could try later steps. If you knew the web server's IP address to try to make a TCP connection to it, you'd get connection timed out errors. But you don't see those errors because you didn't get to the point where you got an IP address to connect to.
It's like if you go to a friend's house but their electricity is out. You ring the doorbell and nothing happens. Your first thought is that the doorbell is messed up. And you're not wrong: it is, but so is everything else. If you could ring it and get their attention to let you inside in their house, you'd see that their lights don't turn on, their TV doesn't turn on, their refrigerator isn't running, etc. But those things are hidden to you because you're stuck on the front porch.
But DNS didn't actually fail. Their design says DNS must go offline if the rest of the network is offline. That's exactly what DNS did.
Sounds like their design was wrong, but you can't just blame DNS. DNS worked 100% here as per the task that it was given.
> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.
I'm not sure the design was even wrong, since the DNS servers being down didn't meaningfully contribute to the outage. The entire Facebook backbone was gone, so even if the DNS servers continued giving out cached responses clients wouldn't be able to connect anyway.
DNS being down instead of returning an unreachable destination did increase load for other DNS resolvers though since empty results cannot be cached and clients continued to retry. This made the outage affect others.
1 reply →
Exactly. And it would actually be worse, because the clients would have to wait for a timeout, instead of simply returning a name error right away.
4 replies →
DNS was very much a proximate cause. In most cases you want your anycast dns servers to shoot themselves in the head if they detect their connection to origin to be interrupted. This would have been an big outage anyways just at a different layer.
Oddly enough, one could consider that behavior something that was put in place to "mitigate DNS misconfiguration"
Seems like the simplest solution would be to just move recovery tooling to their own domain / DNS?