Comment by jaywalk

5 years ago

I'm not sure the design was even wrong, since the DNS servers being down didn't meaningfully contribute to the outage. The entire Facebook backbone was gone, so even if the DNS servers continued giving out cached responses clients wouldn't be able to connect anyway.

DNS being down instead of returning an unreachable destination did increase load for other DNS resolvers though since empty results cannot be cached and clients continued to retry. This made the outage affect others.

  • Source?

    DNS errors are actually still cached; it's something that has been debunked by DJB like a couple of decades ago, give or take:

    http://cr.yp.to/djbdns/third-party.html

    > RFC 2182 claims that DNS failures are not cached; that claim is false.

    Here are some more recent details and the fuller explanation:

    https://serverfault.com/a/824873

    Note that FB.com currently expires its records in 300 seconds, which is 5 minutes.

    PowerDNS (used by ordns.he.net) caches servfail for 60s by default — packetcache-servfail-ttl — which isn't very far from the 5min that you get when things aren't failing.

    Personally, I do agree with DJB — I think it's a better user experience to get a DNS resolution error right away, than having to wait many minutes for the TCP timeout to occur when the host is down anyways.

Exactly. And it would actually be worse, because the clients would have to wait for a timeout, instead of simply returning a name error right away.

  • How would've it been worse? Waiting for a timeout is a good thing as it prevents a thundering herd of refresh-smashing (both automated and manual).

    I don't know BGP well, but it seems easier for peers to just drop FB's packets on the floor than deal with a DNS stampede.

    • An average webpage today is several megabytes in size.

      How would a few bytes over a couple of UDP packets for DNS have any meaningful impact on anyone's network? If anything, things fail faster, so, there's less data to transmit.

      For example, I often use ordns.he.net as an open recursive resolver. They use PowerDNS as their software. PowerDNS has the default of packetcache-servfail-ttl of 60s. OTOH, fb.com A response currently has a TTL of 300s — 5 minutes. So, basically, FB's DNS is cached for roughly the same time whether or not they're actually online.

      2 replies →