Comment by yuliyp

5 years ago

It's a trade-off.

Imagine you have some DNS servers at a POP. They're connected to a peering router there which is connected to a bunch of ISPs. The POP is connected via a couple independent fiber links to the rest of your network. What happens if both of those links fail?

Ideally the rest of your service can detect that this POP is disconnected, and adjust DNS configuration to point users toward POPs which are not disconnected. But you still have that DNS server which can't see that config change (since it's disconnected from the rest of your network) but still reachable from a bunch of local ISPs. That DNS server will continue to direct traffic to the POP which can't handle it.

What if that DNS server were to mark itself unavailable? In that case, DNS traffic from ISPs near that POP would instead find another DNS server from a different POP, and get a response which pointed toward some working POP instead. How would the DNS server mark itself unavailable? One way is to see if it stopped being able to communicate with the source of truth.

Yesterday all of the DNS servers stopped being able to communicate with the source of truth, so marked themselves offline. This code assumes a network partition, so can't really rely on consensus to decide what to do.