← Back to context

Comment by sillysaurusx

3 years ago

I don't think this is correct. 6.824 emphasizes reliability over latency. They mention it in several places: https://pdos.csail.mit.edu/6.824/labs/guidance.html

> It should be noted that tweaking timeouts rarely fixes bugs, and that doing so should be a last resort. We frequently see students willing to keep making arbitrary tweaks to their code (especially timeouts) rather than following a careful debugging process. Doing this is a great way to obscure underlying bugs by masking them instead of fixing them; they will often still show up in rare cases, even if they appear fixed in the common case.

> In particular, in Raft, there are wide ranges of timeouts that will let your code work. While you CAN pick bad timeout values, it won't take much time to find timeouts that are functional.

Their unit tests are quite worthwhile to read, if only to absorb how many ways latency assumptions can bite you.

It's true that in the normal case, it's good to have low latency. But correctly engineered distributed systems won't reorganize themselves due to a ~200ms delay.

To put it another way, if a random 200ms fluctuation causes service disruptions, your system probably wasn't going to work very well to begin with. Blaming it on Nagle's algorithm is a punt.

In my decades of experience in telco, capital markers, and core banking, unexplained latency spikes of hundreds of ms are usually analyzed to death as they can have ripple effects. I’ve had 36 hour severity 1 incidents with multiple VPs taking notes on 24/7 conference calls when a distributed system starts showing latency spikes in the 400ms range.

No, the system isn’t going haywire, but 200-400ms is concerning inside a datacenter for core apps.

But let’s forget IT apps, let’s talk about the network. In a network 200ms is catastrophic.

Presumably you know BGP is the very popular distributed system that converges Internet routes?

Inside a datacenter the Bidirectional Forwarding Protocol (BFD) is used to drop BGP convergence times to be sub-second if you’re using it as an IGP. BFD is also useful with other protocols but anyway. It has heartbeats of 100-300ms. If there’s a fluctuation of the network 3x that interval, it will drop the link and trigger a round of convergence. This is essential in core networks or telco 4G/5G transport networks.

Of course, flapping can be the consequence of setting too low an interval. Tradeoffs.

Back to the original point, I’ve contributed to the code of equity and bond trading apps, telco apps, core banking systems. And cloud/Kubernetes systems. All RPC distributed systems. Every. Single. One. That performed well… For 30 years! Has enabled TCP_NODELAY. Except when serving up large amounts of streaming data. And the reason fundamentally is that most of the time you have less control over client settings (delayed TCP acks), so it’s easier to control the server.

That is all well and good in an academic setting. Many distributed systems in the real world like having time bounds under 200 ms for certain things like Paxos consensus within a datacenter. It turns out that latency, at some level, is equivalent to reliability, and 200 milliseconds is almost always well beyond that level.

  • I’m not sure what else to say than “this isn’t true.” 6.824’s labs have been paxos-based for at least the better part of a decade, and at no point did they emphasize latency as a key factor in reliability of distributed systems. If anything, it’s the opposite.

    Dismissing rtm as “academic” seems like a bad bet. He’s rarely mistaken. If something were so fundamental to real-world performance, it certainly wouldn’t be missing from his course.

    • I'll be sure to tell my former colleagues (who build distributed storage systems at Google) that they are wrong about network latency being an important factor in the reliability of their distributed systems because an MIT course said so.

      I'm not insinuating that your professor doesn't know the whole picture - I'm sure he does research in the area, which would mean that he is very familiar with the properties of datacenter networks, and he likely does research into how to make distributed systems fast. I'm suggesting that he may not be telling it to you because it would complicate his course beyond the point where it is useful for your learning.

      14 replies →

  • it seems to me like systems like these are the exception rather than the rule. you can always turn off nagle's algorithm if you have something really latency-sensitive, but it should not be off by default.

    200 ms is not the end of the world in most cases, it's far better than relying on everything doing its own buffering correctly and suffering a massive performance penalty when something inevitably doesn't.

    • I have to disagree 200 Ms is usually most of your latency budget in my experience. 200 ms delays randomly kill your p99 numbers and harm the customers. Most internet traffic is in the data center, not to the edge. And I assume fastly and Akamai and cloud flare are all aware of how to tune to slow last miles.