Comment by louwrentius

5 years ago

> Our primary and out-of-band network access was down

Don't create circular dependencies.

With something as fundamental as the network, no way around it.

- Okay, we'll set up a separate maintenance network in case we can't get to the regular network.

- Wait, but we need a maintenance network for the maintenance network...

  • Two is One, One is None. There are absolutely ways around this, it's called redundancy. The marginal cost of laying an extra pair during physical plant installation is basically $0, which is why you'd never go "well we need a backup for the backup, so there's no point in having two pairs). Similarly, the marginal cost for having a second UPS and PDU in a rack is effectively $0 at scale, so nobody would argue this is unnecessary to deal with possible UPS failure or accidentally unplugging a cable.

    In this case, there are likely several things that can be changes systemically to mitigate or prevent similar failures in the future, and I have every faith that Facebook's SRE team is capable of identifying and implementing those changes. There is no such thing as "no way around it", unless you're dealing with a law of physics.

    • By "no way around it" I mean you're going to need to create a circular dependency at some point, whether it's a maintenance network that's used to manage itself, or the prod network for managing the maintenance network.

      I absolutely agree that installing a maintenance network is a good idea. One of the big challenges, though, is making sure that all your tooling can and will run exclusively on the maintenance network if needed.

      (Also, while the marginal cost of laying an extra pair of fiber during physical installation may be low, making sure that you have fully independent failure domains is much higher, whether that's leased fiber, power, etc.)

  • "Okay, we'll pull in a DSL line from a completely separate ISP for the out-of-band access." (guess what else is in that manhole/conduit?)

    "Okay, we'll use LTE for out-of-band!" (oops, the backhaul for the cell tower goes under the same bridge as the real network)

    True diversity is HARD (not unsolvable, just hard. especially at scale)!

    • heh i toured a large data center here in dallas and listened to them brag about all the redundant connectivity they had while standing next to the conduit where they all entered the building. One person, a pair of wire cutters, and 5 seconds and that whole datacenter is dark.

    • Although the difference here is that loosing connection and out-of-band for a single data center shouldn't be as catastrophic for Facebook, so your examples would be tolerable?

      1 reply →

How do you avoid circular dependencies on an out-of-band-network? Seems like the choice is between a circular dependency, or turtles all the way down.

  • How do you go from "have a separate access method that doesn't depend on your main system" to "turtles all the way down"? The secondary access is allowed to have dependencies, just not on your network.