Comment by pm2222

5 years ago

You are to the point and precise. This is exactly the problem.

  Each individual PoP did the right thing by dropping out of the anycast, but because they all did it, it was the wrong thing.

Somehow I feel the design is flawed because if abuses DNS server status a bit. I mean DNS server down and BGP withdrawal for the DNS server is a perfect combination, however connectivity between DNS and backend server down, DNS up and BGP withdrawal for DNS server is not. DNS did not fail and DNS should just fall back to some other operational DNS perhaps a regional/global default one.

I think this is not necessarily a flaw of the design. It's a fundamental weakness of the real world.

Either you can take the backbone being unavailable as a symbol that the PoP is broken, and kill the PoP; or you can take the backbone being unavailable as a symbol that the backbone is broken and do your best.

When either interpretation is wrong, you'll need humans to come around and intervene. It's much more common that the only the PoP is broken, so having that require intervention results in more effort.

The flaw here is more that the intervention required to get the backbone back was hard to do because internal tools to bring back the backbone relied on DNS which relied on the backbone being up. As well, there were some reports that physical security relied on the backbone being up. And that restoring the backbone needed physical access.

This isn't the first largescale FB outage where the root cause was a bad configuration was pushed globally quickly. It's really something they need to learn not to do. But, even without that, being able to get key things running again, like the backbone, and DNS, and the configuration system(s), and centralized authentication, needs to be doable without those key systems running. I suspect at least some of that will be improved on, and hopefully regularly practiced.

  • I'm not going to claim to be a BGP expert, but as I understand it the way BGP propagation tends to work makes it a pretty global thing just in terms of how the router hardware handles stuff, which makes it unusually tricky to avoid.

    I don't disagree about the general problem mind, I just have a feeling that fixing "don't push configs globally" for BGP specifically is unusually complicated.

  •   internal tools to bring back the backbone relied on DNS which relied on the backbone being up
    

    So are you referring to same DNS servers sitting outside the backbone at the various PoPs? I'd imagine some internal DNS servers which stays in the backbone at use here, unless of course the FB engineers themselves were disconnected from those internal DNS servers.

    • I don't recall how internal DNS was setup (and determining from the outside isn't really possible), but there were comments in the incident report that DNS being unavailable made it harder to recover.