Comment by andrewxdiamond
3 years ago
> the network is probably a relatively good datacenter network (high bandwidth, low packet loss/retransmission)
The first lesson I learned about Distributed Systems Engineering is the network is never reliable. A system (or language) designed with the assumption the network is reliable will tank.
But I also I don’t agree that Go was written with that assumption. Google has plenty of experience in distributed systems, and their networks are just as fundamentally unreliable as any
“Relatively” may have needed some emphasis here, but in general, networking done by mostly the same boxes operated by the same people, in the same climate controlled building, are going to be far more reliable than home networks, ISPs running across countries, regional phone networks, etc.
Obviously nothing is perfect, but applications deploying in data centres should probably make the trade offs that give better performance on “perfect” networks, at the cost of poorer performance on bad networks. Those deploying on mobile devices or in home networks may better suit the opposite trade offs.
> The first lesson I learned about Distributed Systems Engineering is the network is never reliable
Yep, and it's a good rule. It's the one Google applies across datacenters.
... but within a datacenter (i.e. where most Go servers are speaking to each other, and speaking to the world-accessible endpoint routers, which are not written in Go), the fabric is assumed to be very clean. If the fabric is not clean, that's a hardware problem that SRE or HwOps needs to address; it's not generally something addressed by individual servers.
(In other words, were the kind of unreliability the article author describes here on their router to occur inside a Google datacenter, it might be detected by the instrumentation on the service made of Go servers, but the solution would be "If it's SRE-supported, SRE either redistributes load or files a ticket to have someone in the datacenter track down the offending faulty switch and smash it with a hammer.")
Relatively reliable. Not "shitty". If you've got a datacenter network that can be described as "shitty", fix your network rather than blaming Go.
This is an embarrassing response. The second lesson you should’ve learned as a systems engineer, long before any distributed stuff, is “turn off Nagle’s algorithm.” (The first being “it’s always DNS”.)
When the network is unreliable larger TCP packets ain’t gonna fix it.
Usually you have control over one of them only. If you run the whole network, sure, fix that instead. But if you don't, sending fewer larger packets can actually improve the situation even if it doesn't fix it.
Fewer packets yes, but I've been on several networks where sending large packets ends up with bad reordering and dropping behavior.
But it will at least let it get out of slow-start.
It's strange you're getting hammered for this. Everyone in 6.824 would probably agree with you. https://pdos.csail.mit.edu/6.824/
Let's weigh the engineering tradeoffs. If someone is using Go for high-performance networking, does the gain from enabling NDELAY by default outweigh the pain caused by end users?
Defaults matter; doubly so for a popular language like Go.
I have worked on networked projects ranging from modern datacenters to ca. 2005 consumer-grade ADSL in Ohio to cellular networks in rural South Asia.
There are situations where you want Nagle's algorithm on; when you have stable connections but noisy transmission, streams of data with no ability to buffer, and no application-level latency requirements. There are not many such situations. It is not any of these, and it's certainly not within any datacenter.
Nagle's algorithm also really screws with distributed systems - you are going to be sending quite a few packets with time bounds, and you REALLY don't want them getting Nagled.
In fact, Nagle's algorithm is a big part of why a lot of programmers writing distributed systems think that datacenter networks are unreliable.
I don't think this is correct. 6.824 emphasizes reliability over latency. They mention it in several places: https://pdos.csail.mit.edu/6.824/labs/guidance.html
> It should be noted that tweaking timeouts rarely fixes bugs, and that doing so should be a last resort. We frequently see students willing to keep making arbitrary tweaks to their code (especially timeouts) rather than following a careful debugging process. Doing this is a great way to obscure underlying bugs by masking them instead of fixing them; they will often still show up in rare cases, even if they appear fixed in the common case.
> In particular, in Raft, there are wide ranges of timeouts that will let your code work. While you CAN pick bad timeout values, it won't take much time to find timeouts that are functional.
Their unit tests are quite worthwhile to read, if only to absorb how many ways latency assumptions can bite you.
It's true that in the normal case, it's good to have low latency. But correctly engineered distributed systems won't reorganize themselves due to a ~200ms delay.
To put it another way, if a random 200ms fluctuation causes service disruptions, your system probably wasn't going to work very well to begin with. Blaming it on Nagle's algorithm is a punt.
21 replies →