← Back to context

Comment by toast0

5 years ago

> Right I was hoping the DNSs of FB ought to be smarter than usual and let's say when DNS at Seattle-1 cannot reach backbone it'd respond with IP of perhaps NYC/SF before it starts the BGP withdrawal.

The problem there is coordination. The PoPs don't generally communicate amongst themselves (and may not have been able to after the FB backbone was broken, although technically, they could have through transit connectivity, it may not be configured to work that way), so when a PoP loses its connection to the FB datacenters, it also loses its source of what PoPs are available and healthy. I think this is likely a classic distributed systems problem; the desired behavior when an individual node becomes unhealthy is different than when all nodes become unhealthy, but the nature of distributed systems is that a node can't tell if its the only unhealthy node or all nodes became unhealthy together. Each individual PoP did the right thing by dropping out of the anycast, but because they all did it, it was the wrong thing.

You are to the point and precise. This is exactly the problem.

  Each individual PoP did the right thing by dropping out of the anycast, but because they all did it, it was the wrong thing.

Somehow I feel the design is flawed because if abuses DNS server status a bit. I mean DNS server down and BGP withdrawal for the DNS server is a perfect combination, however connectivity between DNS and backend server down, DNS up and BGP withdrawal for DNS server is not. DNS did not fail and DNS should just fall back to some other operational DNS perhaps a regional/global default one.

  • I think this is not necessarily a flaw of the design. It's a fundamental weakness of the real world.

    Either you can take the backbone being unavailable as a symbol that the PoP is broken, and kill the PoP; or you can take the backbone being unavailable as a symbol that the backbone is broken and do your best.

    When either interpretation is wrong, you'll need humans to come around and intervene. It's much more common that the only the PoP is broken, so having that require intervention results in more effort.

    The flaw here is more that the intervention required to get the backbone back was hard to do because internal tools to bring back the backbone relied on DNS which relied on the backbone being up. As well, there were some reports that physical security relied on the backbone being up. And that restoring the backbone needed physical access.

    This isn't the first largescale FB outage where the root cause was a bad configuration was pushed globally quickly. It's really something they need to learn not to do. But, even without that, being able to get key things running again, like the backbone, and DNS, and the configuration system(s), and centralized authentication, needs to be doable without those key systems running. I suspect at least some of that will be improved on, and hopefully regularly practiced.

    • I'm not going to claim to be a BGP expert, but as I understand it the way BGP propagation tends to work makes it a pretty global thing just in terms of how the router hardware handles stuff, which makes it unusually tricky to avoid.

      I don't disagree about the general problem mind, I just have a feeling that fixing "don't push configs globally" for BGP specifically is unusually complicated.

    •   internal tools to bring back the backbone relied on DNS which relied on the backbone being up
      

      So are you referring to same DNS servers sitting outside the backbone at the various PoPs? I'd imagine some internal DNS servers which stays in the backbone at use here, unless of course the FB engineers themselves were disconnected from those internal DNS servers.

      1 reply →