← Back to context

Comment by pclmulqdq

3 years ago

That is all well and good in an academic setting. Many distributed systems in the real world like having time bounds under 200 ms for certain things like Paxos consensus within a datacenter. It turns out that latency, at some level, is equivalent to reliability, and 200 milliseconds is almost always well beyond that level.

I’m not sure what else to say than “this isn’t true.” 6.824’s labs have been paxos-based for at least the better part of a decade, and at no point did they emphasize latency as a key factor in reliability of distributed systems. If anything, it’s the opposite.

Dismissing rtm as “academic” seems like a bad bet. He’s rarely mistaken. If something were so fundamental to real-world performance, it certainly wouldn’t be missing from his course.

  • I'll be sure to tell my former colleagues (who build distributed storage systems at Google) that they are wrong about network latency being an important factor in the reliability of their distributed systems because an MIT course said so.

    I'm not insinuating that your professor doesn't know the whole picture - I'm sure he does research in the area, which would mean that he is very familiar with the properties of datacenter networks, and he likely does research into how to make distributed systems fast. I'm suggesting that he may not be telling it to you because it would complicate his course beyond the point where it is useful for your learning.

    • Tell you what. If you ask your colleague “Do you feel that a 100ms delay will cause our distributed storage system to become less reliable?” and they answer yes, I’ll venmo you $200. If you increase it to 200ms and they say yes, I’ll venmo you $100. No strings attached, and I’ll take you at your word. But you have to actually ask them, and the phrasing should be as close as possible.

      If we were talking >1s delays, I might agree. But from what I know about distributed systems, it seems $200-unlikely that a Googler whose primary role is distributed systems would claim such a thing.

      The other possibility is that we’re talking past each other, so maybe framing it as a bet will highlight any diffs.

      Note that the emphasis here is “reliability,” not performance. That’s why it’s worth it to me to learn a $200 lesson if I’m mistaken. I would certainly agree as a former gamedev that a 100ms delay degrades performance.

      13 replies →

it seems to me like systems like these are the exception rather than the rule. you can always turn off nagle's algorithm if you have something really latency-sensitive, but it should not be off by default.

200 ms is not the end of the world in most cases, it's far better than relying on everything doing its own buffering correctly and suffering a massive performance penalty when something inevitably doesn't.

  • I have to disagree 200 Ms is usually most of your latency budget in my experience. 200 ms delays randomly kill your p99 numbers and harm the customers. Most internet traffic is in the data center, not to the edge. And I assume fastly and Akamai and cloud flare are all aware of how to tune to slow last miles.