← Back to context

Comment by sillysaurusx

3 years ago

I’m not sure what else to say than “this isn’t true.” 6.824’s labs have been paxos-based for at least the better part of a decade, and at no point did they emphasize latency as a key factor in reliability of distributed systems. If anything, it’s the opposite.

Dismissing rtm as “academic” seems like a bad bet. He’s rarely mistaken. If something were so fundamental to real-world performance, it certainly wouldn’t be missing from his course.

I'll be sure to tell my former colleagues (who build distributed storage systems at Google) that they are wrong about network latency being an important factor in the reliability of their distributed systems because an MIT course said so.

I'm not insinuating that your professor doesn't know the whole picture - I'm sure he does research in the area, which would mean that he is very familiar with the properties of datacenter networks, and he likely does research into how to make distributed systems fast. I'm suggesting that he may not be telling it to you because it would complicate his course beyond the point where it is useful for your learning.

  • Tell you what. If you ask your colleague “Do you feel that a 100ms delay will cause our distributed storage system to become less reliable?” and they answer yes, I’ll venmo you $200. If you increase it to 200ms and they say yes, I’ll venmo you $100. No strings attached, and I’ll take you at your word. But you have to actually ask them, and the phrasing should be as close as possible.

    If we were talking >1s delays, I might agree. But from what I know about distributed systems, it seems $200-unlikely that a Googler whose primary role is distributed systems would claim such a thing.

    The other possibility is that we’re talking past each other, so maybe framing it as a bet will highlight any diffs.

    Note that the emphasis here is “reliability,” not performance. That’s why it’s worth it to me to learn a $200 lesson if I’m mistaken. I would certainly agree as a former gamedev that a 100ms delay degrades performance.

    • RAFT can easily be tuned to handle any amount of latency. It even discusses this in the paper. The issue is “how long are you willing to be in a leaderless state” and for some applications, it’s very tight. If your application needs a leader to make a distributed decision, but that is currently unavailable, the application might not know how to handle that or it might block until one becomes available causing throughput issues.

      However, you shouldn’t be using TCP for latency-sensitive applications IMHO. Firstly, TCP requires a handshake on any new connection. This takes time. Secondly, if 3 packets are sent and the first one is lost, you won’t get those last packets for a couple hundred ms anyway (default retransmission times). So you’re better off using something like UDP. So, if you need the properties of TCP, you aren’t doing latency-sensitive anything.

      1 reply →

    • I'll give you a few examples, and maybe I'll run a casual poll next time we get a beer. No venmo.

      I will point out that leader election generally has very long timeouts (seconds), but a common theme here is that you do lots of things that are not leader election but have short timeouts which can cause systems to reconfigure because the system generally wants to run in a useful configuration, not just a safe configuration.

      In a modern datacenter, 100 milliseconds is ample time to determine whether a server is "available" and retry on a different server - servers and network switches can get slow, and when they get THAT slow, something is clearly wrong, so it is better to drain them and move that data somewhere else. When the control plane hears about these timeout failures from clients, it dutifully assumes that a server is offline and drains it.

      Usually, this works well: The machine to machine latency within a datacenter has way less than 100 microseconds, and if you include the OS stack under heavy load, it might get all the way to 1 millisecond. Something almost always is wrong if a very simple server can't respond within 10-100 milliseconds. This results in 10-100 millisecond response times meaning "not available" at the lower layers of the stack. As I mentioned before, enough reports of "unavailable" results in a machine being drained, and a critical number of these results in an outage.

      Attack of the killer microseconds is a good paper that addresses the issue here (albeit obliquely): https://dl.acm.org/doi/10.1145/3015146

      Here are a couple of examples:

      * There is a very important 10 ms timeout in Colossus (distributed filesystem) to determine the liveness of a disk server - I have seen one instance where enough of a cell broke this timeout due to a software change, and made the entire cell go read-only. In another instance, a small test cell went down due to this timeout under one experiment.

      * Another cell went down under load due to a different 10 ms liveness timeout thanks to the misapplication of Nagle's algorithm (although not to networking) - I forget if it was a test cell or something customer-facing.

      * Bigtable (NoSQL database) has a similar timeout under 100 ms (but greater than 10 ms) for its tablet servers. I'm sure Spanner (NewSQL database) has the same.

    • At 200 ms you start to assume the other end is dead and retry. You don’t get sub second consumer response times with random 200 ms delays on data center to data center calls.