← Back to context

Comment by pclmulqdq

3 years ago

I'll give you a few examples, and maybe I'll run a casual poll next time we get a beer. No venmo.

I will point out that leader election generally has very long timeouts (seconds), but a common theme here is that you do lots of things that are not leader election but have short timeouts which can cause systems to reconfigure because the system generally wants to run in a useful configuration, not just a safe configuration.

In a modern datacenter, 100 milliseconds is ample time to determine whether a server is "available" and retry on a different server - servers and network switches can get slow, and when they get THAT slow, something is clearly wrong, so it is better to drain them and move that data somewhere else. When the control plane hears about these timeout failures from clients, it dutifully assumes that a server is offline and drains it.

Usually, this works well: The machine to machine latency within a datacenter has way less than 100 microseconds, and if you include the OS stack under heavy load, it might get all the way to 1 millisecond. Something almost always is wrong if a very simple server can't respond within 10-100 milliseconds. This results in 10-100 millisecond response times meaning "not available" at the lower layers of the stack. As I mentioned before, enough reports of "unavailable" results in a machine being drained, and a critical number of these results in an outage.

Attack of the killer microseconds is a good paper that addresses the issue here (albeit obliquely): https://dl.acm.org/doi/10.1145/3015146

Here are a couple of examples:

* There is a very important 10 ms timeout in Colossus (distributed filesystem) to determine the liveness of a disk server - I have seen one instance where enough of a cell broke this timeout due to a software change, and made the entire cell go read-only. In another instance, a small test cell went down due to this timeout under one experiment.

* Another cell went down under load due to a different 10 ms liveness timeout thanks to the misapplication of Nagle's algorithm (although not to networking) - I forget if it was a test cell or something customer-facing.

* Bigtable (NoSQL database) has a similar timeout under 100 ms (but greater than 10 ms) for its tablet servers. I'm sure Spanner (NewSQL database) has the same.