Golang disables Nagle's Algorithm by default

3 years ago (withinboredom.info)

388 comments

withinboredom

If you trace this all the way back it's been in the Go networking stack since the beginning with the simple commit message of "preliminary network - just Dial for now " [0] by Russ Cox himself. You can see the exact line in the 2008 our repository here [1].

As an aside it was interesting to chase the history of this line of code as it was made with a public SetNoDelay function, then with a direct system call, then back to an abstract call. Along the way it was also broken out into a platform specific library, then back into a general library and go other with a pass from gofmt, all over a "short" 14 years.

0 - https://github.com/golang/go/commit/e8a02230f215efb075cccd41...

1 - https://github.com/golang/go/blob/e8a02230f215efb075cccd4146...

rsc 3 years ago
That code was in turn a loose port of the dial function from Plan 9 from User Space, where I added TCP_NODELAY to new connections by default in 2004 [1], with the unhelpful commit message "various tweaks". If I had known this code would eventually be of interest to so many people maybe I would have written a better commit message!
I do remember why, though. At the time, I was working on a variety of RPC-based systems that ran over TCP, and I couldn't understand why they were so incredibly slow. The answer turned out to be TCP_NODELAY not being set. As John Nagle points out [2], the issue is really a bad interaction between delayed acks and Nagle's algorithm, but the only option on the FreeBSD system I was using was TCP_NODELAY, so that was the answer. In another system I built around that time I ran an RPC protocol over ssh, and I had to patch ssh to set TCP_NODELAY, because at the time ssh only set it for sessions with ptys [3]. TCP_NODELAY being off is a terrible default for trying to do anything with more than one round trip.
When I wrote the Go implementation of net.Dial, which I expected to be used for RPC-based systems, it seemed like a no-brainer to set TCP_NODELAY by default. I have a vague memory of discussing it with Dave Presotto (our local networking expert, my officemate at the time, and the listed reviewer of that commit) which is why we ended up with SetNoDelay as an override from the very beginning. If it had been up to me, I probably would have left SetNoDelay out entirely.
As others have pointed out at length elsewhere in these comments, it's a completely reasonable default.
I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.
And to answer the question in the article:
> Much (all?) of Kubernetes is written Go, and how has this default affected that?
I'm quite confident that this default has greatly improved the default server latency in all the various kinds of servers Kubernetes has. It was the right choice for Go, and it still is.
[1] http://publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TM-65...
- Aissen 3 years ago
  
  > I will just add that it makes no sense at all that git-lfs (lf = large file!) should be sending large files 50 bytes at a time. That's a huge number of system calls that could be avoided by doing larger writes. And then the larger writes would work better for the TCP stack anyway.
  FWIW, at least one git-lfs contributor agrees with you: https://github.com/git-lfs/git-lfs/issues/5242#issuecomment-...
  > I think the first thing we should probably look at here is whether Git LFS (and the underlying Go libraries) are optimizing TCP socket writes or not. We should be avoiding making too many small writes where we can instead make a single larger one, and avoiding the "write-write-read" pattern if it appears anywhere in our code, so we don't have reads waiting on the final write in a sequence of writes. Regardless of the setting of TCP_NODELAY, any such changes should be a net benefit.
  My 2ct: this type of low-hanging fruit optimization is often found even in largely-used software, so it shouldn't really be a surprise. It's always frustrating when you're the first to find those, though.
- silisili 3 years ago
  
  As one on the 'supports this decision' side, thanks for taking time from your day to give us the history.
  It would be really nice if such context existed elsewhere other than a rather ephemeral forum. It would be awesome to somehow have annotations around certain decisions in a centralized place, though I have no idea how to do that cleanly.
  
  11 replies →
- francislavoie 3 years ago
  
  Thanks for the explanation, Russ!
  As a maintainer of Caddy, I was wondering if you have an opinion on whether it makes sense to have on for a general purpose HTTP server. Do you think it makes sense for us to change the default in Caddy?
  Also, would there be appetite for making it easier to change the mode in an http.Server? It feels like needing to reach too deep to change that when using APIs at a higher level than TCP (although I may have missed some obvious way to set it more easily). For HTTP clients it can obviously be changed easily in the dialer where we have access to the connection early on.
  
  1 reply →
- allanrbo 3 years ago
  
  Thanks for the insight and history brief, Russ!
- withinboredom 3 years ago
  
  Thanks for the history!

Philip-J-Fry 3 years ago

In my opinion, I think it's correct to be disabled by default.

I think Nagle's algorithm does more harm than good if you're unaware of it. I've seen people writing C# applications and wondering why stuff is taking 200ms. Some people don't even realise it's Nagle's algorithm (edit: interacting with Delayed ACKs) and think it's network issues or a performance problem they're introduced.

I'd imagine most Go software is deployed in datacentres where the network is high quality and it doesn't really matter too much. Fast data transfer is probably preferred. I think Nagle's algorithm should be an optimisation you can optionally enable (which you can) to more efficiently use the network at the expense of latency. Being more "raw" seems like the sensible default to me.

Animats 3 years ago
The basic problem, as I've written before[1][2], is that, after I put in Nagle's algorithm, Berkeley put in delayed ACKs. Delayed ACKs delay sending an empty ACK packet for a short, fixed period based on human typing speed, maybe 100ms. This was a hack Berkeley put in to handle large numbers of dumb terminals going in to time-sharing computers using terminal to Ethernet concentrators. Without delayed ACKs, each keystroke sent a datagram with one payload byte, and got a datagram back with no payload, just an ACK, followed shortly thereafter by a datagram with one echoed character. So they got a 30% load reduction for their TELNET application.
Both of those algorithms should never be on at the same time. But they usually are.
Linux has a socket option, TCP_QUICKACK, to turn off delayed ACKs. But it's very strange. The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]
Sigh.
[1] https://stackoverflow.com/questions/46587168/when-during-the...
- rjbwork 3 years ago
  
  Gotta love HN. The man himself shows up to explain.
  
  10 replies →
- Matthias247 3 years ago
  
  > The documentation is kind of vague, but apparently you have to re-enable it regularly.[3]
  This is correct. And in the end it means more or less that setting the socket option is more of a way of sending an explicit ACK from userspace than a real setting.
  It's not great for common use-cases, because making userspace care about ACKs will obviously degrade efficiency (more syscalls).
  However it can make sense for some use-cases. E.g. I saw the s2n TLS library using QUICKACK to avoid the TLS handshake being stuck [1]. Maybe also worthwhile to be set in some specific RPC scenarios where the server might not immediately send a response on receiving the request, and where the client could send additional frames (e.g. gRPC client side streaming, or in pipelined HTTP requests if the server would really process those in parallel and not just let them sit in socket buffers).
  [1] https://github.com/aws/s2n-tls/blob/46c47a71e637cabc312ce843...
- sph 3 years ago
  
  Any kernel engineer reading that can explain why TCP_QUICKACK isn't enabled by default? Maybe it's time to turn it on by default, if it was just a workaround for old terminals.
  
  1 reply →
- renox 3 years ago
  
  Thanks for this reply. What I find specially annoying is that the TCP client and the servers starts by a synchronization round-trip which is supposed to be used to synchronise options and this isn't the case here! Why can't the client and the servers agree on a sensible set of options (no delayed ack if the client is using the Nagle algorithm)??
- silisili 3 years ago
  
  Is this referring to Nagle on the server, and delayed ACK on the client?
- wtarreau 3 years ago
  
  TCP_QUICKACK is mostly used to send initial data along with the first ACK upon establishing a connection, or to make sure to merge the FIN with the last segment.
- nextaccountic 3 years ago
  
  How it's possible that delayed acks and nagle's algorithms are both defaults, anywhere? Isn't this a matter of choosing one, or another?
- emmelaich 3 years ago
  
  Did the move from line oriented input to character input also occur around then?
  I remember as a student, vi was installed and we all went from using ed to vi.
  There was much gnashing and wailing from the admins of the VAX.
  
  1 reply →
zamalek 3 years ago
From the bottom of the article:
> Most people turn to TCP_NODELAY because of the “200ms” latency you might incur on a connection. Fun fact, this doesn’t come from Nagle’s algorithm, but from Delayed ACKs or Corking. Yet people turn off Nagle’s algorithm … :sigh:
- Philip-J-Fry 3 years ago
  
  Yeah but Nagle's Algorithm and Delayed ACKs interaction is what causes the 200ms.
  Servers tend to enable Nagle's algorithm by default. Clients tend to enabled Delayed ACK by default, and then you get this horrible interaction all because they're trying to be more efficient but stalling eachother.
  I think Go's behavior is the right default because you can't control every server. But if Nagle's was off by default on servers then we wouldn't need to disabled Delayed ACKs on clients.
  
  23 replies →
iofiiiiiiiii 3 years ago
> I've seen people writing C# applications and wondering why stuff is taking 200ms
I observe that in the most recent generation of its HTTP client (SocketsHttpHandler), .NET also sets NoDelay by default.
https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
- kcartlidge 3 years ago
  
  TIL - thanks!
silisili 3 years ago

Agreed. The post should be titled 'Go enables TCP_NODELAY by default', and a body may or may not even be needed. It's documented, even https://pkg.go.dev/net#TCPConn.SetNoDelay
To know why would be interesting, I guess. But you should be buffering writes anyways in most cases. And if you refuse to do that, just turn it back off on the socket. This is on the code author.
sneak 3 years ago
> I'd imagine most Go software is deployed in datacentres where the network is high quality
The problem is that those datacenters are plugged into the Internet, where the network is not always high quality. TFA mentions the Caddy webserver - this is "datacenter" software designed to talk to diverse clients all over the internet. The stdlib should not tamper with the OS defaults unless the OS defaults are pathological.
- tptacek 3 years ago
  
  That doesn't make much sense. There are all sorts of socket and file descriptor parameters with defaults that are situational; NDELAY is one of them, as is buffer size, nonblockingness, address reuse, &c. Maybe disabling Nagle is a bad default, maybe it isn't, but the appeal to "OS defaults" is a red herring.
  
  17 replies →
- drpixie 3 years ago
  
  Also, for small packets, disabling consolidation means adding LOTS of packet overhead. You're not sending 1 million * 50 bytes of data, you're sending 1 million * (50 bytes of data + about 80 bytes of TCP+ethernet header).
  Disabling Nagle makes sense for tiny request/replys (like RPC calls) but it's counterproductive for bulk transfers.
  I'm not the only one who don't like the thought of a standard library quietly changing standard system behaviour ... so know I have to know the standard routines and their behaviour AND I have to know which platforms/libraries silently reverse things :(
  
  3 replies →

tptacek 3 years ago

This isn't a defect, which makes the whole comment kind of strange. I blame the post title, which should be "Golang disables Nagle's Tinygram Algorithm By Default"; then we could just debate Nagle vs. Delayed ACK, which would be 100x more interesting than subthreads like this.

nightpool 3 years ago
Certainly you'd agree that this is a bug in git lfs though, correct? And users doing "git push" with their 500MB files shouldn't have to think about tinygrams or delayed ack?
It's reasonable to think about what other programs might have been affected by this default choice (I'm sure I used one myself two weeks ago—a Dropbox API client with inexplicably awful throughput) and what a better API design that could have avoided this problems might look like
- anonymoushn 3 years ago
  
  Maybe golang should default to panicking if the application repeatedly calls send() with tiny amounts of data :)
- tptacek 3 years ago
  
  I don't know enough about git-lfs to say. Things that need buffering should deliberately buffer, I guess?
dang 3 years ago

Ok, I've replaced the title with that. Thanks!
though I kind of liked "This adventure starts with git-lfs" (the old use-first-sentence-as-title trick) which was the replacement before this
AaronFriel 3 years ago

I think it's a false dichotomy. Delayed ACK and Nagle's algorithm each improve the network in different ways, but Nagle's specifically allows applications to be written without knowledge of the underlying network socket's characteristics.
But there's another way, a third path not taken: Nagle's algorithm plus a syscall (such as fsync()) to immediately clear the buffer.
I believe virtually all web applications - and RPC frameworks - would benefit from this over setting TCP_NODELAY.
It would also be more elegant than TCP_CORK, which has a tremendous pitfall: failing to uncork can result in never sending the last packet. And it's easy to implement by adding a syscall at the end of each request and response. Applications almost always know when they're done writing to a stream.
maxbond 3 years ago
Why isn't this a defect? It brought OP's transfer speed over Ethernet to 2.5MB/s.
- jonas21 3 years ago
  
  Because it's a tradeoff. The author touches on this in the last sentence:
  > Here’s the thing though, would you rather your user wait 200ms, or 40s to download a few megabytes on an otherwise gigabit connection?
  Though I'd phrase it as "would you rather add 200ms of latency to every request, or take 40s to download a few megabytes when you're on an extremely unreliable wifi network and the application isn't doing any buffering?"
  In the use cases that Go was designed for, it probably makes sense to set the default to do poorly in the latter case in order to get the latency win. And if that's not the case for a given application, it can set the option to the other value.
- tptacek 3 years ago
  
  It's an option, with a default. Arguably (I mean, I'd argue it, other reasonable people would disagree), Go's default is the right one for most circumstances. That's not a "defect"; it's a design decision people disagree with.
  
  1 reply →
- loeg 3 years ago
  
  If there is a defect, it's in git-lfs. Picking a reasonable default is not a defect.
  
  20 replies →
stonemetal 3 years ago

Delayed ACK seems like the better default to me, whether it is telnet or web servers, network programming is almost always request response. Delaying the ACK so that part of that response is ready seems like the correct choice. In today's network programming how often is tinygram really an issue?
In this case I would consider the bug to be git lfs. Even if Nagle's was enabled I would still consider it a bug, because of the needless syscall overhead of doing 50 byte writes.

jchw 3 years ago

Actually if you're sending a file or something, do you really need Nagle's algorithm? It seems like the real mistake might be not using a large enough buffer for writing to the socket, but I could be speaking out my ass.

There's actually a lot of prevailing wisdom that suggests disabling Nagle's algorithm is (often) a good idea. While the problem with latency is caused by delayed ACKs, the sender can't do anything about that, because it's the receiver side that controls this.

Not saying that it's good the standard library defaults this necessarily... But this post paints the decision in an oddly uncharitable light. That said, I can't find the original thread where this was discussed, if there ever was one, so I have no idea why they chose to do this, and perhaps it shouldn't be this way by default.

AaronFriel 3 years ago

It's often a good idea when the application has its own buffering, as is common in many languages and web frameworks which implement some sort of 'reader' interface which can alternate symbols of "chunks" and "flushes" or only emit entire payloads (a single chunk). With scatter-gather support for IO, it's generally OK for the application to produce small chunks followed by a flush. Those application layer frameworks want Nagle's algorithm turned off at the TCP layer to avoid double-buffering and incurring extra latency.
Go however is disabling Nagle's by default as opposed to letting it be a framework level decision.
scaredginger 3 years ago
This is a great point. Why is Git LFS uploading a large file in 50 byte chunks?
- rsaxvc 3 years ago
  
  Ideally large files would upload in MTU sized packets, which Nagle's algorithm will often give you, otherwise you may have a small amount of additional overhead at the boundary where the larger chunk may not be divisible into MTU sized packets.
  Edit: I mostly work in embedded (systems that don't run git-lfs), perhaps my view is isn't sensible here.
  
  3 replies →
- jesprenj 3 years ago
  
  I do not know Go. But what if there are so many high level abstractions in the Go language that it operates on streams directly?
  
  2 replies →
Karrot_Kream 3 years ago
The footnote has a brief note about delayed ACKs but it's not like the creator of the socket can control whether the remote is delaying ACKs or not. If ACKs are delayed from the remote, you're eating the bad Nagle's latency.
The TCP_NODELAY behavior is settable and documented here [1]. It might be better to more prominently display this behavior, but it is there. Not sure what's up with the hyperbolic title or what's so interesting about this article. Bulk file transfers are far from the most common use of a socket and most such implementations use application-level buffering.
[1]: https://pkg.go.dev/net#TCPConn.SetNoDelay
- burnished 3 years ago
  
  The title is hyperbolic because a real person got frustrated and wrote about it, the article is interesting because a real person got frustrated at something many of us can imagine encountering but not so many successfully dig into and understand.
  “Mad at slow, discovers why slow” is a timeless tale right up there with “weird noise at night, discovers it was a fan all along”, I think it’s just human nature to appreciate it.
morelisp 3 years ago

> There's actually a lot of prevailing wisdom that suggests disabling Nagle's algorithm is (often) a good idea.
Because even in mediocre networks it is a good idea.
Don’t write a small amount of data if you want (or in this case even need) to send a large amount of data!

leighmcculloch 3 years ago

Some prior discussion about why turn on TCP_NODELAY: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...

John Nagle's comments about it: https://news.ycombinator.com/item?id=10608356

amluto 3 years ago

IMO the real problem is that the socket API is insufficient, and the Nagle algorithm is a kludge around that.

When sending data, there are multiple logical choices:

1. This is part of a stream of data but more is coming soon (once it gets computed, once there is buffer space, or simply once the sender loops again).

2. This is the end of a logical part of the stream, and no more is coming right now.

3. This is latency-sensitive.

For case 1, there is no point in sending a partially full segment. Nagle may send a partial segment, which is silly. For case 2, Nagle is probably reasonable, but may be too conservative. For case 3, Nagle is wrong.

But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.

pclmulqdq 3 years ago
I'm pretty convinced that every foundational OS abstraction that we use today, most of which were invented in the 70's or 80's, is wrong for modern computing environments. It just sucks less for some people than for other people.
I do think Golang's choice of defaulting to TCP_NODELAY is probably right - they expect you to have some understanding that you should probably send large packets if you want to send a lot of stuff, and you likely do not want packets being Nagled if you have 20 bytes you want to send now. TCP_QUICKACK also seems wrong in a world with data caps - the unnecessary ACKs are going to add up.
Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient, and certainly should be expected to trigger pathological cases.
At this point, the OS is basically expected to guess what you actually want to do from how you incant around their bad abstractions, so it's not surprising that sending megabytes of data 50 bytes at a time would trigger some weird slowdowns.
- klabb3 3 years ago
  
  > Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient
  This is the real crime here. The fact that it maxed out at 2.5Mb/s might be quite literally due to CPU limit.
  If you are streaming a large amount of data, you should use a user space buffer anyway, especially if you have small chunks. In Golang, buffers are standard practice and a one-liner to add.
  
  1 reply →
- nextaccountic 3 years ago
  
  > Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient
  io_uring is supposed to help with that
jandrese 3 years ago
This seems like it should be very simple to fix without having to do much to the API. Just implement a flush() function for TCP sockets that tells the stack to kick the current buffer out to the wire immediately. It seems so obvious that I think I must be missing something. Why didn't this appear in the 80s?
- erik_seaberg 3 years ago
  
  It’s not portable but Linux has a TCP_CORK socket option that does this.
  
  2 replies →
AaronFriel 3 years ago

It's a downside of the "everything is a file" mindset. As all abstractions are, it's leaky.
Nagle's algorithm is elegant because it allows poorly written applications to saturate a PHY.
Disabling it requires the application layer to implement its own buffer.
If I had a time machine and access to the early *nixes, I'd extend Nagle's algorithm and the kernel to treat fsync() as a signal to flush immediately.
blibble 3 years ago
> But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.
Linux/FreeBSD/... have had the TCP corking API for what, 20 years?
- amluto 3 years ago
  
  IMO MSG_MORE is a substantially better interface. Sadly it seems to be rarely used.
  
  2 replies →
rubatuga 3 years ago

Latency can be affected by both CPU load and network congestion, so it's possible that Nagle's algorithm can help in Case 3. It's really trial and error to see what works best in practice.

Matthias247 3 years ago

The article is way too opinionated about „golang is doing it wrong“ for a decision that neither has a right or wrong.

Nagle can make sense for some applications, but also has drawbacks for others - as countless articles about the interaction with delayed acks and 40ms pauses (which are pretty huge in the days of modern internet) describe.

If one uses application side buffering and syscalls which transmit all available data at once, enabling NODELAY seems like a valid choice. And that pattern is the one that is used by GOs http libraries, all TLS libraries (you want to encrypt a 16kB record anyway), and probably most other applications using TCP. It’s are seeing anything doing direct syscalls with tiny payloads.

The main question should be why LFS has this behavior - which also isn’t great from an efficiency standpoint. But that question is best discussed in a bug report, and not a blog post of this format.

withinboredom 3 years ago
I prefer reliability over latency, always. The world won’t fall apart in 200ms, let alone 40ms. If you’re doing something where latency does matter (like stocks) then you probably shouldn’t be using TCP, honestly (avoid the handshake!)
When it comes to code, readability and maintainability are more important. If your code is reading chunks of a file then sending it to a packet, you won’t know the MTU or changes to the MTU along the path. Send your chunk and let Nagle optimize it.
Further, principle of least surprise always applies. The OS default is for Nagle to be enabled. For a language to choose a different default (without providing a reason), and one that actively is harmful in poor network conditions at that, was truly surprising.
- Matthias247 3 years ago
  
  TCP is always reliable, the choice of this algorithm will never impact this - it will only impact performance (bandwidth/latency) and efficiency.
  Enabling nagle by default will lead to elevated latencies with some protocols that don't require the peer to send a response (and thereby a piggybacked ACK) after each packet. Even a "modern" TLS1.3 0RTT handshake might fall into that category. This is a performance degradation.
  The scenario that is described in the blog post where too many small packets due to nothing aggregating them causing elevated packet loss is a different performance degradation, and nothing else.
  Both of those can be fixed - the former only by enabling TCP_NODELAY (since the client won't have control over servers), the second by either keeping TCP_NODELAY disabled *or* by aggregating data in userspace (e.g. using a BufferedWriter - which a lot of TLS stacks might integrate by default).
  > The world won’t fall apart in 200ms, let alone 40ms.
  You might be underestimating the the latency sensitivity of the modern internet. Websites are using CDNs to get to a typical latency in the 20ms range. If this suddenly increases to 40ms, the internet experience of a lot of people might get twice as bad as it is at the moment. 200ms might directly push the average latency into what is currently the P99.9 percentile.
  And it would get even worse for intra datacenter use-cases, where the average is in the 1ms range - and where accumulated latencies would still end up being user-experiencable (the latency of any RPC call is the accumulated latency of upstream calls).
  > If your code is reading chunks of a file then sending it to a packet, you won’t know the MTU or changes to the MTU along the path
  Sure - you don't have to. As mentioned, you would just read into an intermediate application buffer of a reasonable size (definitely bigger than 16kB or 10 MTUs) and let the OS deal with it. A loop along `n = read(socket, buffer); write(socket, buffer[0..n])` will not run into the described issue if the buffer is reasonably sized and will be a lot more CPU efficient than doing tiny syscalls and expecting all aggregation to happen in TCP send buffers.
- lanstin 3 years ago
  
  Much of the world is doing ok with TCP and TLS but with session resumption and long lived connections. Many links will be marked bad in 200 ms and retries or new links issues. Imagine you are doing 20k / second / CPU. That is four thousand backed up calls for no reason, just randomness.
- Ferret7446 3 years ago
  
  > I prefer reliability over latency, always.
  I imagine all the engineers who serve millions/billions of requests per second disagree with adding 200ms to each request, especially since their datacenter networks are reliable.
  > Send your chunk and let Nagle optimize it.
  Or you could buffer yourself and save dozens/hundreds of expensive syscalls. If adding buffering makes your code unreadable, your code has bigger maintainability problems.
  
  1 reply →
throwdbaaway 3 years ago
> And that pattern is the one that is used by GOs http libraries
I don't think that is correct. In https://github.com/git-lfs/git-lfs/issues/5242 can be resolved.
In the mean time, #2b can actually be achieved with a "SRE approach" by patching the kernel to remove delayed ack and patching the Go library to remove the `setNoDelay` call. Something for OP to try?
- throwdbaaway 3 years ago
  
  I just learnt about "ip route change ROUTE quickack 1" from https://news.ycombinator.com/item?id=10662061, so we don't even need to patch the kernel. This makes 2b a really attractive option.

ialad 3 years ago

I'm using Go's default HTTP client to make a few requests per second. I set a context timeout of a few seconds for each request. There are random 16 minute intervals where I only get the error `context deadline exceeded`.

From what I found, Go's default client uses HTTP/2 by default. When a TCP connection stops working, it relies on the OS to decide when to time out the connection. Over HTTP/1.1, it closes the connection itself [1] on timeout and makes a new connection.

In Linux, I guess the timeout for a TCP connection depends on `tcp_retries2` which defaults to 15 and corresponds to a time of ~15m40s [2].

This can be simulated by making a client and some requests and then blocking traffic with an `iptables` rule [3]. My solution for now is to use a client that only uses HTTP/1.1.

[1] https://github.com/golang/go/issues/36026#issuecomment-56902...

[2] https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

[3] https://github.com/golang/go/issues/30702

iampims 3 years ago
You can configure the HTTP/2 client to use a timeout + heartbeat.
https://go.googlesource.com/net/+/master/http2/transport.go
- francislavoie 3 years ago
  
  That's a big file. Mind pointing to a specific line number?
  
  1 reply →
benmmurphy 3 years ago

that sounds like there is pooling going on and not invalidating the pooled connection when a timeout happens. I've actually seen a lot of libraries in other languages do a similar thing (my experience is some of the elixir libraries don't have good pool invalidation for http connections). having a default invalidation policy that handles all situations is a bit difficult but I think a default policy that invalidates on any timeout is much better than a default policy that never invalidates on a timeout. as long as invalidation means just evicting it from the pool and not tearing down other channels on the HTTP/2 connection. for example you could have a timeout on a HTTP/2 connection that is just on an individual channel but there is still data flowing through the other channels.
rattray 3 years ago
Wow. Can you easily change the tcp connection timeout?
- iampims 3 years ago
  
  You can. It’s trivial once you know it’s possible. Not sure why it’s not set by default. https://go.googlesource.com/net/+/master/http2/transport.go
  
  2 replies →

PathOfEclipse 3 years ago

As a counter-argument, I've ran into serious issues that were caused by TCP delay being enabled by default, so I ended up disabling it. I actually think having it disabled by default is the right choice, assuming you have the control to re-enable it if you need to.

Also, in my opinion, if you want to buffer your writes, then buffer them in the application layer. Don't rely on the kernel to do it for you.

erik_seaberg 3 years ago
The kernel has to buffer everything you send in a sliding window, to retry missed acks. Userspace buffering only reduces syscalls.
A lot of people with strong preferences about segment boundaries and timing are arguing with TCP and probably shouldn’t be using it.
- Ferret7446 3 years ago
  
  > Userspace buffering only reduces syscalls.
  "only". The kernel also buffers disk writes, but god help you if you're writing files to disk byte by byte.
withinboredom 3 years ago
I talked a bit about that in the post. Use your own buffers if possible, but there are times you can’t do that reliably (proxies come to mind) where you’d have to basically implement an application specific Nagles algorithm. If you find yourself writing something similar, it’s probably better to let the kernel do it and keep your code simpler to reason about.
- morelisp 3 years ago
  
  If you are writing a serious proxy you should be working at either a much lower level (eg splice) or a much higher level (ReadFrom, Copy). If you’re messing around with TCPConn parameters and manual buffer sizes you’ve already lost.
  
  2 replies →
- PathOfEclipse 3 years ago
  
  I haven't thought about this hard, but, would a proxy not serve it's clients best by being as transparent as possible, meaning to forward packets whenever it receives them from either side? I think this would imply setting no_delay on all proxies by default. If either side of the connection has a delay, then the delay will be honored because the proxy will receive packets later than it would otherwise.
to11mtm 3 years ago

IFF you are LAN->LAN or even DC->DC, NoDelay is usually better nowadays. If you are having to retransmit at that level you have far larger problems somewhere else.
If you're buffering at the abstracted transport level, Same.

danpalmer 3 years ago

Go was explicitly designed for writing servers. This means two things are normally true:

- latency matters, for delivering a response to a client

- the network is probably a relatively good datacenter network (high bandwidth, low packet loss/retransmission)

Between these things, I think the default is reasonable, even if not what most would choose. As long as it’s documented.

The fact that other languages have other defaults, or the fact that people use Go for all sorts of other things like system software, doesn’t invalidate the decision made by the designers.

andrewxdiamond 3 years ago
> the network is probably a relatively good datacenter network (high bandwidth, low packet loss/retransmission)
The first lesson I learned about Distributed Systems Engineering is the network is never reliable. A system (or language) designed with the assumption the network is reliable will tank.
But I also I don’t agree that Go was written with that assumption. Google has plenty of experience in distributed systems, and their networks are just as fundamentally unreliable as any
- danpalmer 3 years ago
  
  “Relatively” may have needed some emphasis here, but in general, networking done by mostly the same boxes operated by the same people, in the same climate controlled building, are going to be far more reliable than home networks, ISPs running across countries, regional phone networks, etc.
  Obviously nothing is perfect, but applications deploying in data centres should probably make the trade offs that give better performance on “perfect” networks, at the cost of poorer performance on bad networks. Those deploying on mobile devices or in home networks may better suit the opposite trade offs.
- shadowgovt 3 years ago
  
  > The first lesson I learned about Distributed Systems Engineering is the network is never reliable
  Yep, and it's a good rule. It's the one Google applies across datacenters.
  ... but within a datacenter (i.e. where most Go servers are speaking to each other, and speaking to the world-accessible endpoint routers, which are not written in Go), the fabric is assumed to be very clean. If the fabric is not clean, that's a hardware problem that SRE or HwOps needs to address; it's not generally something addressed by individual servers.
  (In other words, were the kind of unreliability the article author describes here on their router to occur inside a Google datacenter, it might be detected by the instrumentation on the service made of Go servers, but the solution would be "If it's SRE-supported, SRE either redistributes load or files a ticket to have someone in the datacenter track down the offending faulty switch and smash it with a hammer.")
- AnimalMuppet 3 years ago
  
  Relatively reliable. Not "shitty". If you've got a datacenter network that can be described as "shitty", fix your network rather than blaming Go.
- morelisp 3 years ago
  
  This is an embarrassing response. The second lesson you should’ve learned as a systems engineer, long before any distributed stuff, is “turn off Nagle’s algorithm.” (The first being “it’s always DNS”.)
  When the network is unreliable larger TCP packets ain’t gonna fix it.
  
  3 replies →
- sillysaurusx 3 years ago
  
  It's strange you're getting hammered for this. Everyone in 6.824 would probably agree with you. https://pdos.csail.mit.edu/6.824/
  Let's weigh the engineering tradeoffs. If someone is using Go for high-performance networking, does the gain from enabling NDELAY by default outweigh the pain caused by end users?
  Defaults matter; doubly so for a popular language like Go.
  
  24 replies →
iainmerrick 3 years ago
It was originally designed as a C/C++ replacement, not necessarily for servers. If I remember right the first major application it was used for was log processing (displacing Google’s in-house language Sawzall) rather than servers.
- nine_k 3 years ago
  
  Everything developed at Google is intended for transforming protobufs. And how are you going to get some protobufs in the first place? /s
  
  1 reply →

Ameo 3 years ago

Networking is the place where I notice how tall modern stacks are getting the most.

Debugging networking issues inside of Kubernetes feels like searching for a needle in a haystack. There are so, so many layers of proxies, sidecars, ingresses, hostnames, internal DNS resolvers, TLS re/encryption points, and protocols that tracking down issues can feel almost impossible.

Even figuring out issues with local WiFi can be incredibly difficult. There are so many failure modes and many of them are opaque or very difficult to diagnose. The author here resorted to WireShark to figure out that 50% of their packets were re-transmissions.

I wonder how many of these things are just inherent complexity that comes with different computers talking to each other and how many are just side effects of the way that networking/the internet developed over time.

parasubvert 3 years ago

Kubernetes has no inherent or required proxies or sidecars or ingresses, or TLS re-encryption points.
Those are added by “application architects”, or “security architects” and existed long before Kubernetes, for the same debatable reasons: they read about it in a book or article and thought it was a neat idea to solve a problem. Unfortunately, they may not understand the tradeoffs deeply, and may have created more problems than were solved.
imglorp 3 years ago

There's been a highly annoying kubectl port-forward heisenbug open for several years which smells an awful lot like one of these dark Go network layer corners. You get a good connection establish and some data flows, but at some random point it decides to drop. It's not annoying enough for any wizards to fix. I immediately thought of this bug when Nagle in Go came up here.
https://github.com/kubernetes/kubernetes/issues/74551
anthk 3 years ago
Wireshark exists since forever.
- Beltalowda 3 years ago
  
  Wireshark doesn't tell you anything about what's wrong with your code. It just tells you "yup, the code is doing something wrong!"
  Figuring that out in Kubernetes ... yeah, good luck with that.
- lanstin 3 years ago
  
  And that or tcpdump should be the first thing you grab to diagnose a network issue.
  
  1 reply →

bradfitz 3 years ago

Because you're supposed to have buffering at a different layer.

davewritescode 3 years ago

Golang has burned me more than once with bizarre design decisions that break things in a user hostile way.

The last one we ran into was a change in Go 1.15 where servers that presented a TLS certificate with the hostname encoded into the CN field instead of the more appropriate SAN field always fail validation.

The behavior could be disabled however that functionality was removed in 1.18 with no way to opt back into the old behavior. I understand why SAN is the right way to do it but in this case I didn’t control the server.

Developers at Google probably never have to deal with 3rd parties with shitty infrastructure but a lot of us do.

Here’s a bug in rke that’s related https://github.com/rancher/rke2/issues/775

yakaccount4 3 years ago
The x509 package has unfortunately burned me several times, this one included. It is too anal about non-fatal errors, that Google themselves forked it (and asn1) to improve usability.
https://github.com/google/certificate-transparency-go
- davewritescode 3 years ago
  
  Sorry for the late response but thank you so much much for showing me this

ninkendo 3 years ago

It also doesn’t play well with split tunnel VPN’s on macOS that are configured for particular DNS suffixes. If you have a VPN that is only active for connections in a particular domain, git-lfs (and I think any go software, by default) will try to use your non-VPN connection for connections that should be on the VPN.

I don’t know why it is, exactly… but I think it’s related to Golang intentionally avoiding using the system libc and implementing its own low-level TCP/IP functions, leading to it not using the system configuration which tells it which interface to use for which connections.

Edit: now that I think about it, I think the issue is with DNS… macOS can be configured such that some subdomains (like on a VPN) are resolved with different DNS servers than others, which helps isolate things so that you only use your VPN’s DNS server for connections that actually need it. Go’s DNS resolution ignores this configuration system and just uses the same server for all DNS resolution, hence the issue.

SomaticPirate 3 years ago
Go’s choice to default to its own TCP/IP implementation has bitten me personally to the level of requiring a machine restart.
The Go IPv6 DNS resolution on MacOS can cause all DNS requests on the system to begin to fail until a restart.
https://github.com/golang/go/issues/52839
- alecthomas 3 years ago
  
  Not to understate the impact of the bug, but this is not the default for Go. It is used if CGo is disabled, as the issue you linked to describes.
  
  1 reply →
- EdiX 3 years ago
  
  The OS network stack is crashing and this is Go's fault? Is Go holding the network stack wrong?
LaLaLand122 3 years ago

To be fair, "getaddrinfo is _the_ path" is a shitty situation.
- It's a synchronous interface. Things like getaddrinfo_a are barely better. It has forced people to do stuff like https://c-ares.org/ for ages, which has suffered from "is not _the_ path" issues for as long
- It's a less featured interface than, for example, https://wiki.freedesktop.org/www/Software/systemd/writing-re...

jeroenhd 3 years ago

This explains why several of my Go programs needed the occasional restart because of terribly slow transfers over mobile networks.

These weird decisions that go against the norm are exactly why I hate writing Go. There are hidden footguns everywhere and the only way to prevent them is to role play as a Google dev backend dev in a hurry.

Philip-J-Fry 3 years ago

>This explains why several of my Go programs needed the occasional restart because of terribly slow transfers over mobile networks.
It doesn't explain that. Why would this cause you to need to restart your applications? At most it will just decrease performance of that transfer.

forrestthewoods 3 years ago

Meanwhile almost every project I work on is latency sensitivity and I’ve lost track of how many times the fix to bad performance was “disable Nagles algorithm”.

Honestly the correct solution here is probably “there is no default value, the user must explicitly specify on or off”. Some things just warrant a coder to explicitly think about it upfront.

Asmod4n 3 years ago
It’s delayed ack on the client side which adds that slowdown. The spec allows the client to wait up to 500 ms to send it.
- withinboredom 3 years ago
  
  Delayed ACKs send an ACK every-other-packet. So you have to wait at least 200ms for the first ACK. So if you have enough data for two packets then you won’t even notice a delay (probably most data these days unless you have jumbo frames all the way to the client).
  If you control the client, you can turn on quick ACKs and still use Nagle’s algorithm to batch packets.

tlamponi 3 years ago

From my experience most TCP using projects existing for a longer time disable Nagle's Algorithm sooner or later, we did so at Proxmox VE in 2013:

https://git.proxmox.com/?p=pve-manager.git;a=commitdiff;h=fd...

Most of the time it just makes things worse nowadays, so yes, having it disabled by default makes IMO sense.

Flowdalic 3 years ago

The problem does not seem to be that TCP_NODELAY is on, but that the packets are sent carry only 50 bytes of payload. If you send a large file, then I would expect that you invoke send() with page-sized buffers. This should give the TCP stack enough opportunity to fill the packets with an reasonable amount of payload, even in the absence of Nagel's algorithm. Or am I missing something?

kevincox 3 years ago
Even if the application is making 50 byte sends why aren't these getting coalesced once the socket's buffer is full? I understand that Nagle's algorithm will send the first couple packets "eagerly" but I would have expected that onced the transmit window is full they start getting coalesced since they are being buffered anyways.
Disabling Nagle's algorithm should be trading network usage for latency. But it shouldn't reduce throughput.
- Flowdalic 3 years ago
  
  > Even if the application is making 50 byte sends why aren't these getting coalesced once the socket's buffer is full?
  Because maybe the 50 bytes are latency sensitive and need to be at the recipient as soon as possible?
  > I understand that Nagle's algorithm will send the first couple packets "eagerly" […] Disabling Nagle's algorithm should be trading network usage for latency
  No, Nagle's algorithm will delay outgoing TCP packets in the hope that more data will be provided to the TCP connection, that can be shoved into the delayed packet.
  The issue here is not Go's default setting of TCP_NODELAY. There is an use case for TCP_NODELAY. Just like there is a use case for disabling TCP_NODELAY, i.e., Nagle's algorithm (see RFC 869). So any discussion about the default behavior appears to be pointless.
  Instead, I believe the application or a underlying library is to blame. Because I don’t see why applications performing a bulk transfer of data by using “small” (a few bytes) write is anything but a bad design. Not writing large (e.g., page-sized) chunks of data into the file descriptor of the socket, especially when you know that there multiple more of this chunks are to come, just kills performance on multiple levels.
  If I understand the situation the blog post describes correctly, then git-lfs is sending a large (50 MiB?) file in 50 bytes chunks. I suspect this is because git-lfs (or something between git-lfs and the Linux socket, e.g., a library) issues writes to the socket with 50 bytes of data from the file.
  
  1 reply →

cat_plus_plus 3 years ago

Modern programming does buffering on class level rather than system call level. Even if NAGLE solves the problem of sending lots of tiny packets, it doesn't solve the problem of making many inefficient system calls. Plus, best size of buffers and flash policy can only be determined by application logic. If I want smart lights to pulse in sync with music heard by a microphone, delaying to optimize network bandwidth makes no sense. So providing raw interface with well defined behavior by default and taking care of things like buffering in wrapper classes is the right thing to do.

viraptor 3 years ago
> best size of buffers and flash policy can only be determined by application logic
That's not really true. The best result can be obtained by the OS, especially if you can use splice instead of explicit buffers. Or sendfile. There's way too much logic in this to expect each app to deal with this, or even things it doesn't really know about like current IO pressure, or the buffering and caching for a given attached device.
Then there are things you just can't know about. You know about your MTU for example, but won't be monitoring the changes for the given connection. The kernel knows how to scale the buffers appropriately already so it can do the flushes in a better way than the app. (If you're after throughout not latency)
- PeterisP 3 years ago
  
  > The kernel knows how to scale the buffers appropriately already so it can do the flushes in a better way than the app. (If you're after throughout not latency)
  Well, how can the OS know if I'm after throughput or latency? It would be very wrong to simply assume that all or even most apps would prioritize throughput; at modern network speeds throughput often is sufficient and user experience is dominated by latency (both on consumer and server side), so as the parent post says, this policy can only be determined by application logic, since OS doesn't know about what this particular app needs with respect to throughput vs latency tradeoffs.
  
  1 reply →
jandrese 3 years ago
I kind of wonder if these applications are forced to do their own buffering because they have disabled Nagle's algorithm?
The old adage about people who attempt to attempt to avoid TCP end up reinventing TCP and re-learning the lessons from the 70s...
- yencabulator 3 years ago
  
  You missed the part about many inefficient system calls. You want buffering to happen before the thing that has a relatively high per-call overhead.
Too 3 years ago

If you want smart lights to pulse in sync with your microphone you shouldn’t be using TCP in the first place, here UDP is a lot more suitable.
TCP is reconstructing the order, meaning a glitch of a single packet will propagate as delay for following packets, in worst case accumulate into a big congestion.
withinboredom 3 years ago

I talked a bit about that in the post. When you know the network is reliable, it’s a non-issue. When you need to send a few small packets, disable Nagles. When you need to send a bunch of tiny packets across an unknown network (aka the internet) use Nagles.

teleforce 3 years ago

Those who want more fundamental background on the matter can check this excellent seminal paper by Van Jacobson and Michael Karels [1].

In one of the Computerphile's podcasts on the history of Internet congestion, it's claimed as the most influential paper about the Internet and apparently it has more than 9000 citations as of today [2].

Some trivia, based on this research work Van, together with Steve McCane also created the BPF, Berkeley Packet Filter while he's in Berkeley Uni. This is later adopted by the Linux community as eBPF, and the rest is history [3].

[1]Congestion Avoidance and Control:

https://ee.lbl.gov/papers/congavoid.pdf

[2]Internet Congestion Collapse - Computerphile:

https://youtu.be/edUN8OabWCQ

[3]Berkeley Packet Filter:

https://en.m.wikipedia.org/wiki/Berkeley_Packet_Filter

parhamn 3 years ago

First URL has an extra `&l` on it, that 404s. Thanks for the links!
andikleen3 3 years ago

You just have to be very careful with the algorithms in that paper, they had some serious problems (apart from their basic inability to deal with faster links). I like this old but fairly damning analysis from an early Linux TCP developer:
https://ftp.gwdg.de/pub/linux/tux/net/ip-routing/README.rto

karmanyaahm 3 years ago

I ran into a similar phantom-traffic problem from Go ignoring the Linux default for TCP keepalives and sending them every 15 seconds, very wasteful for mobile devices. While I quite like the rest of Go, I don't see why they have to be so opinionated and ignore the OS in their network defaults.

My PR fixing that in Caddy: https://github.com/caddyserver/caddy/pull/4865

Too 3 years ago
To be fair, the linux defaults of 2h are not working in most enterprise or cloud environments. One frequently encounter load balancers, firewalls and other proxies that drop connections after around 5-15 minutes. 15 seconds sounds very aggressive though.
- kccqzy 3 years ago
  
  The default of 2h is not just a Linux default; it's straight up from the RFC.
  https://www.rfc-editor.org/rfc/rfc9293.html#name-tcp-keep-al...
  > Keep-alive packets MUST only be sent when no sent data is outstanding, and no data or acknowledgment packets have been received for the connection within an interval (MUST-26). This interval MUST be configurable (MUST-27) and MUST default to no less than two hours (MUST-28).
mholt 3 years ago

Thanks for that PR!! We greatly appreciate it.

bakul 3 years ago

What has this to do with the Go language? Runtime defaults don't always work for every possible situation, particularly when the runtime provides much more over a kernel interface. Investigate performance issues and if some default doesn't work for you, you can always change it.

withinboredom 3 years ago
Principle of least surprise. Nagle’s is disabled in Go, except in Windows. The OS default is to have it enabled. I thought this was probably some weird accidental configuration in git-lfs. Then it turned into “aha, this is the source of all my problems on my shitty wifi”
- kccqzy 3 years ago
  
  It reminded me of the time when Rust ignored SIGPIPE (obviously a good choice for servers) but did it universally. That's of course also violating the principle of least surprise when interrupting a pipe suddenly causes Rust to spew some exceptions.

svnpenn 3 years ago

OP didn't link to the issue, so here it is:

https://github.com/caddyserver/caddy/issues/5276

also, OP didn't mention that its extremely easy to configure this, with Go itself:

https://godocs.io/net#TCPConn.SetNoDelay

lights0123 3 years ago
> OP didn't mention that its extremely easy to configure this
Maybe not explicitly, but it was definitely mentioned:
> From there, I went into the git-lfs codebase. I didn’t see any calls to setNoDelay
- svnpenn 3 years ago
  
  That's not the same function...

pmontra 3 years ago

Sshfs sets the nodelay tcp flag to off by default precisely because it's designed to transfer files and not interactive traffic, that is single keystrokes in a terminal.

This thread from 2006 could be interesting. It's about the different performances of scp and sftp https://openssh-unix-dev.mindrot.narkive.com/proARDEN/sftp-p...

Meta: the negative in nodelay makes it hard to follow some comments sometimes because of double negatives. The general best practice is to refrain from using negatives in names. This might have been TCP_GROUP_PACKETS?

willmacdonald 3 years ago

In the article the author talks of WiFi interference.

Try using MAC filtering. In previous experiments it drastically improved through put.

I know the mac address can be spoofed, provides no security and can be a pain to set up when everything is WiFi enabled, but it really helps.

All those other WiFi gadgets that belong to your neighbours are continuously try to login, and being rejected, all the time!

kccqzy 3 years ago

While you are at it, probably downgrade those ARP broadcasts to unicasts. Your home Wi-Fi router probably already knows all the IP address MAC address mapping; so no need for devices to send those stupid ARP broadcasts to everything.

btown 3 years ago

Ironically, I imagine one of the side effects of remote work will be that choices like this don't happen as much... because it's much less likely that all your in-house language devs will do all of their performance testing on your corporate WiFi, and at least some will use congested home networks and catch this sooner, or never write it at all.

There's no such thing as a perfect language for all situations - but given that Go was not designed to run solely on low-latency clusters, one wishes it had been further tested in other environments.

schimmy_changa 3 years ago

This is a bit of a hyperbolic title and post, but it does seem like a real issue that the Golang devs should address. Letting the socket do its thing seems like the right way to go, although I'm not an expert in networking.

Any ideas from the devs or other networking experts here in HN?

blibble 3 years ago

I suspect the current behaviour will have to stay as it is because the universe of stuff that could break as a result of changing it is completely unknowable
LeoNatan25 3 years ago
So it’s not hyperbolic, and actually describes things as they are?
- joemi 3 years ago
  
  Calling it "evil" is hyperbolic.
  
  1 reply →
akdor1154 3 years ago

I dunno.. I am not a networking expert by any stretch, but it does seem consistent with Golang's philosophy that devs should have a deep understanding of the various levels of the stack they're working in.
Though TFA does make a fair point that in reality this doesn't happen, and there is slow software abound as a result.

rowanG077 3 years ago

Disabling naggle by default is definitely the right decision. Git LFS does the wrong thing by sending out a file in 50 byte chunks. It should be sending MTU sized chunks.

stusmall 3 years ago

EDIT: I originally linked to the wrong review. It's been there since the initial commit of networking: https://github.com/golang/go/commit/e8a02230f215efb075cccd41...

Game_Ender 3 years ago
This is actually the review for adding back the ability to turn NODELAY on and off, it was actually in the networking code from the start https://github.com/golang/go/blob/e8a02230/src/lib/net/net.g...
- stusmall 3 years ago
  
  Thanks! I noticed that right after I posted it. Unfortunately my non-procrastination setting kicked in and I couldn't delete it before anyone saw it.
withinboredom 3 years ago

Lol “back by popular demand”
At least it wasn’t for my initial thoughts when seeing PRs around that code “to speed up unit tests”. I’d love to see the discussions though.

Aeolun 3 years ago

Is this a problem in Go itself? Isn’t this something the Git-lfs should be changing in only lfs?

It seems reasonable to prefer a short delay by default, but when you are sending multi-magabyte files (lfs’s entire use case) it seems like it would be better to make the connection more reliable (e.g. nobody cares about 200ms extra delay).

morelisp 3 years ago

git-lfs authors agree and point out regular git also disables Nagle.
https://github.com/git-lfs/git-lfs/issues/5242

herpderperator 3 years ago

> Once that was fixed, I saw 600MB per second on my internal network and outside throughput was about the same as wired.

Is the author talking about megabits or really megabytes? 112MB/s is the fastest real speed you will get on a gigabit network. I feel like the author meant to write Mbit instead of MB/s everywhere?

withinboredom 3 years ago

Good find. Yeah 800Mbits.

kyrra 3 years ago

Btw, here's the line on github: https://github.com/golang/go/blob/fbf763fd1d6be3c162ea5ff3c8...

Piezoid 3 years ago

It's been in the code base from the start: https://github.com/golang/go/blob/e8a02230/src/lib/net/net.g...

bogomipz 3 years ago

Relatedly there was a previous HN post and discussion about Delayed ACKs and TCP_NODELAY where John Nagle himself chimed in:

https://news.ycombinator.com/item?id=10608356

donalhunt 3 years ago

Thanks for this.

I've been troubleshooting a nasty issue with RTSP streams and while I'm fairly confident golang is not responsible, this has highlighted a potential root cause for the behaviour we've been seeing (out of order packets, delayed acks).

dboreham 3 years ago

I have an email from Nagle himself, c. 1997 telling me that it was probably a bad idea.

And I've disabled it in every server I've written since.

tptacek 3 years ago
You can just ask him here; he's the 12th busiest user on HN ('Animats, the name of his ragdoll physics engine).
- latency-guy2 3 years ago
  
  He's even here on an adjacent thread!

deathanatos 3 years ago

I'm not following.

Let's say the socket is set to TCP_NODELAY, and the transfer starts at 50 KiB/s. After a couple seconds, shouldn't the application have easily outpaced the network, and buffered enough data in the kernel such that the socket's send buffer is full, and subsequent packets are able to be full? What causes the small packets to persist?

kevincox 3 years ago

This is the question I had from the start and I'm surprised that I had to scroll this far down.
Nagle's algorithm is about what do to when the send buffer isn't full. It is supposed to improve network efficiency in exchange for some latency. Why is it affecting throughput?
Is Linux remembering the size of the send calls in the out buffer and for some reason insisting on sending packets of those sizes still? I can't imagine why it would do that. If anything it sounds like a kernel bug to me.
For large transfers it still likely makes sense to always send full packets (until the end) like TCP_CORK but it seems that it should be unnecessary in most cases.

pelorat 3 years ago

Because of this post I looked up how I disable Nagle's algorithm on Windows. I've now done it (according to the instructions at least). Let's see how it goes. I'm in central Europe on gigabit ethernet and fiber, with more than 50% of my traffic going over IPv6 and most European sites under 10ms away.

throwdbaaway 3 years ago

> not to mention nearly 50% of every packet was literally packet headers

I was just looking at a similar issue with grpc-go, where it would somehow send a HEADERS frame, a DATA frame, and a terminal HEADERS frame in 3 different packets. The grpc server is a golang binary (lightstep collector), which definitely disables Nagle's algorithm as shown by strace output, and the flag can't be flipped back via the LD_PRELOAD trick (e.g. with a flipped version of https://github.com/sschroe/libnodelay) as the binary is statically linked.

I can't reproduce this with a dummy grpc-go server, where all 3 frames would be sent in the same packet. So I can't blame Nagle's algorithm, but I am still not sure why the lightstep collector behaves differently.

throwdbaaway 3 years ago
Found the root cause from https://github.com/grpc/grpc-go/commit/383b1143 (original issue: https://github.com/grpc/grpc-go/issues/75):
// Note that ServeHTTP uses Go's HTTP/2 server implementation which is // totally separate from grpc-go's HTTP/2 server. Performance and // features may vary between the two paths.
The lightstep collector serves both gRPC and HTTP traffic on the same port, using the ServeHTTP method from the comment above. Unfortunately, Go's HTTP/2 server doesn't have the improvements mentioned in https://grpc.io/blog/grpc-go-perf-improvements/#reducing-flu.... The frequent flushes mean it can suffer from high latency with Nagle enabled, or from high packet overhead with Nagle disabled.
tl;dr: blame bradfitz instead :)

fideloper 3 years ago

One specific thing I wonder about is how this setting effects Docker, specifically when pushing/pulling images around.

In both GitHub Docker and Moby organizations, "SetNoDelay" doesn't return any results. I wonder if performance could be improved making connections with `connection.SetNoDelay(false)`

SergeAx 3 years ago

I have a hypothesis here. Go is a language closely curated by Google, and the primary use of Go in Google is to write concurrent Protobuf microservices, which is exactly the case of exchanging lots of small packets on a very reliable networks.

manv1 3 years ago

Nagle's algorithm is designed to stop packlets.

If you're not sending a lot of packlets you shouldn't be using Nagle's algorithm. It's on by default in systems because without it interactive shells get weird, and there are few things more annoying to sysadmins than weird terminal behavior, especially when shit is hitting the fan.

kevincox 3 years ago

But it seems that it shouldn't be limiting packets to 50 bytes (which is apparently the size of buffers used by the application in send/write). Once the send buffer is full the Kernel should be sending full packets.
elromulous 3 years ago

What's a packlet?

Jenda_ 3 years ago

I don't know Golang, but how does the function in git-lfs that writes to the socket look like? Is it writing in 50-byte chunks? Why?

Because I guess even with TCP_NODELAY, if I submit reasonably huge chunks of data (e.g. 4K, 64K...) to the socket, they will get split into reasonably-sized packets.

microtherion 3 years ago
The code in question seems to be this portion of SendMessageWithData in ssh/protocol.go [1]:
buf := make([]byte, 32768) for { n, err := data.Read(buf) if n > 0 { err := conn.pl.WritePacket(buf[0:n]) if err != nil { return err } } if err != nil { break } }
The write packet size seems to be determined by how much data the reader returns at a time. That could backfire if the reader were e.g. something like line at a time (no idea if something like that exists in Golang), but that does not seem to be the case here.
[1] https://github.com/git-lfs/git-lfs/blob/d3716c9024083a45771c...

andrewfromx 3 years ago

makes me wonder about SACKS again: https://www.reddit.com/r/networking/comments/yf3d6u/how_comm...

jsnell 3 years ago

SACKs are the second most important/useful TCP extension after window scaling. SACKs have had basically universal support for more than a decade (like, 95% of the traffic on the public internet negotiated SACKs in 2012). Anyone writing a new production TCP stack without SACKs is basically committing malpractice.

sireat 3 years ago

I've learned the hard way to avoid git-lfs at all costs.

Main issue is that git-lfs is NOT "it just works".

The migration process if you mistakenly in/excluded a file is quite painful and bug prone.

I'd rather just exclude big blobs from git if possible.

judge2020 3 years ago

Side-note: I wonder why the author has decided to include overflow:hidden in an effort to hide the page scroll bar.

withinboredom 3 years ago

Must be the theme. It’s a shitty theme but I can’t be bothered to get a better one atm. It’s pretty far down the todo list.

benmmurphy 3 years ago

when using TCP_NODELAY do you need to ensure your writes are a multiple of the maximum segment size? for example if the MSS is 1400 and you are doing writes of 1500 bytes does this mean you will be sending packets of size 1400 and 100?

withinboredom 3 years ago
What about if there are jumbo frames all the way to the client. You are throwing away a lot of bandwidth. What about if there is vxlan like in k8s, you’ll be sending two packets, one tiny and one full. Use Nagle and send what you have when you have it. Let the TCP stack do it’s job. Work on optimization when it is actually impactful to do so. Sending a packet is cheaper than reading a db.
- benmmurphy 3 years ago
  
  the big reason for no-delay is the really bad interaction between nagle's algorithm and delayed ACK for request-response protocols like the start of a TLS connection. its possible the second handshake packet the client/server sends to be delayed significantly because one of the parties has delayed ack enabled.
  Ideally, the application could just signal to the OS that the data needs to be flushed at a certain points. TCP_NODELAY almost lets you do this but the problem is it applies to all writes() including ones that don't need to be flushed. for example if you are a http server sending a 250MB response then only the last write needs to be 'flushed'. linux has some non-posix options that you give more control like TCP_CORK using setsockopt which lets you signal these boundaries explicitly or MSG_MORE which is a bit more convenient to use.

zikohh 3 years ago

Please add links to the GitHub issues in the blog

j3s 3 years ago

this has been known forever, very inflammatory article imo.

JeremyBanks 3 years ago

[dead]

patriciajenner 3 years ago

[dead]

oh3 3 years ago

interesting article

metadat 3 years ago

> I would absolutely love to discover the original code review for this and why this was chosen as a default. If the PRs from 2011 are any indication, it was probably to get unit tests to pass faster. If you know why this is the default, I’d love to hear about it!

Please hold while I pick my fallen jaw up off the floor.

The parents of the Internet work at Google. How could this defect make it to production and live for 12+ years in the wild? I guess nothing fixes itself, but this shatters the myth of Google(r) superiority. It turns out people are universally entities comprised of sloppy, error-prone wetware.

At the very least there should be a comment in caps and in the documentation describing why this default was chosen and in what circumstances it's ill-advised. I'm not claiming to be remarkably exceptional and even I bundle such information on the first pass when writing the initial code (my rule: to ensure a good future, any unusual or non-standard defaults deserve at least a minimal explanation) (Full-Disclosure: I was rejected after round 1 of Google code screens 3 times, though have been hired to other FAANG/like companies).

Yeesh.

p.s. Be sure to brace yourself before reading https://news.ycombinator.com/item?id=34179426#34180015

to11mtm 3 years ago

> It turns out people are universally entities comprised of sloppy, error-prone wetware.
The line from Agent K in 'Men In Black' comes to mind here.
More jobs than not, I left with at least one 3+ month old PR of changes for stability I was 'not allowed to merge because we didn't have the bandwidth to regression (or do cross-ecosystem-update-on-lib)'. Yes I made sure to explain to my colleagues why I did them and why I was mentioning them before I left.
Most eventually got applied.
> (I've been rejected after round 1 of Google code screens 3 times, though have been hired to other FAANG-like companies). Sheesh.
I've found that the companies that hire based on quality-of-bullshitting sometimes pay more, but are far less satisfying than companies that hire on quality-of-language-lawyering (i.e. you understand the caveats of a given solution rather than sugar coating them).
platinumrad 3 years ago

> Please hold while I pick my fallen jaw up off the floor.
> p.s. Be sure to brace yourself before reading https://news.ycombinator.com/item?id=34179426#34180015
Both of these snide comments assume that the speculative explanations are correct, which they very well may not be.
xiphias2 3 years ago

Google's interview level is set to not needing to fire too many bad people, it's not about being superior (err on the side of caution when hiring).
This might change now in this downturn, but when I was working at Google in 2008, we were the only tech company where nobody was fired because of the recession (there were offices closed, and people had the option to relocate, although not everybody took that option).
If you compare it with Facebook, they just fired a lot of people.
In short: you probably just didn't have luck, you should try again when you can.
carom 3 years ago
Google designs for Google. In their world everyone uses a latest gen MacBook with maxed out RAM on gigabit fiber.
- free652 3 years ago
  
  The default is glinux, most of the company are using chromebooks.
  
  7 replies →
- kevinmchugh 3 years ago
  
  Google has more end users on slow networks and old devices than almost anyone. Throttle your browser with the browser tools and see what loads quicker, google.com or a website of your choice. Once you've loaded google.com, do a search.
  
  1 reply →
tourist2d 3 years ago
How can you call it a defect when it might have been a deliberate decision? Your whole post sounds like you're upset Google didn't hire you lmao
- fsdjkflsjfsoij 3 years ago
  
  The entire post is embarrassing and makes me think that Google made the correct decision. Also, it seems that people that want to change the default behaviour can simply use the TCPConn.SetNoDelay function.
- metadat 3 years ago
  
  Decisions deserve documentation (because a footgun warning is preferable to spontaneous unintended penetration).
  
  4 replies →
parasubvert 3 years ago

It’s not a defect, and it’s not unusual to enable TCP_NODELAY.
As a default, it’s a design decision. It’s documented in the Golang Net library.
I remember learning all of this stuff in 1997 in my first Java job and witnessing same shock and horror at TCP_NODELAY being disabled (!) by default when most server developers had to enable it to get any reasonable latency for their RPC type apps, because most clients had delayed TCP ACKs on by default. Which should never be used with Nagle’s algorithm!
This Internet folklore gets relearned by every new generation. Golang’s default has decades of experience in building server software behind the decision to enable it. As many other threads here have explained, including Nagle himself.
dragonwriter 3 years ago

> The parents of the Internet work at Google. How could this defect make it to production and live for 12+ years in the wild?
Google is a big company; the “parents of the internet”, insofar as they work at Google, probably work nowhere near this, in terms of scope of work.
oldgregg 3 years ago
Would be naive to think corporate incentives are not influencing code and protocols:
> Http/3 standardized 6 months ago and Google has been widely using it for years-- but not supported by Go.
> Webtransport originally did P2P/Ice component but no longer.
> Http/3 doesn't even have option to work without certificate authorities.
- nine_k 3 years ago
  
  > Http/3 doesn't even have option to work without certificate authorities.
  Unencrypted HTTP is dead for any serious purpose. Any remaining use is legacy, like code written in Basic.
  With Letsencrypt on one hand, and single-binary utilities to run your own local CA on the other hand, this should pose no problem.
  
  10 replies →
quotemstr 3 years ago
> this shatters the myth of Google(r) superiority. It turns out people are universally entities comprised of sloppy, error-prone wetware.
Golang was created with the specific goal of sidestepping what had become a bureaucratic C++ "readability" process within Google, so yes. Goodhart's law in action.
- nine_k 3 years ago
  
  The problem with C++ is not getting readability, but footguns! footguns everywhere! Plus the compile time.
- lokar 3 years ago
  
  That’s not at all true. Go has readability as well.
  
  1 reply →
fomine3 3 years ago

Googler's network environment would be extremely good so it's not weird.
robswc 3 years ago
I think one of the most insightful things I've learned in life is that books, movies, articles, etc. have warped my perception of the "elites." When you split hairs, there is certainly a difference in skill/knowledge _but_ at the end of the day, everyone will make mistakes. (error-prone wetware, haha)
I totally get it though. I mean, as a recent example, look at FTX. I knew SBF and was close to working for Alameda (didn't want to go to Hong Kong tho). Over the years I thought that I was an idiot for missing out and that everyone there was a genius. Turns out they weren't and not only that _everyone_ got taken for a ride. VCs throwing money, celebrities signing to say anything, politicians shaking hands, etc.
Funny, I did see a leaked text when Elon was trying to buy Twitter, SBF was trying to be part of it and someone didn't actually think he had the money, so maybe someone saw the BS.
All that aside tho, yea, this is something I forget and "re-learn" all the time. A bit concerning if you think about it too much! I wonder if that's the same for other fields of work. I mean, if there was an attack on a power grid, how many people in the US would even know _how_ to fix it? Are the systems legacy? I've seen some code bases where one file could be deleted and it would take tons of hours to even figure out what went wrong, lol.
- beebmam 3 years ago
  
  There's nothing elite about being a programmer at any of the big tech companies. It's software engineering and design. It's the same everywhere, just different problem domains.
  I've worked with some of the highest ranking people in multiple large tech companies. The truth is there is no "elite". CTOs of the biggest companies in the world are just like you and me.
  
  1 reply →

vbezhenar 3 years ago

TLDR: Golang uses TCP_NODELAY by default on sockets. Seems wild. I guess it's time to disable TCP_NODELAY in Linux to fix bad software.

vasqw 3 years ago
Yeah, let's just remove TCP_NODELAY and fuck all latency-sensitive applications.
- perryizgr8 3 years ago
  
  Actual latency sensitive apps can always use SOCK_RAW and implement their own TCP. In fact, for serious low latency you need to bypass the entire kernel stack too, like DPDK.
  
  3 replies →

RugnirViking 3 years ago

my goodness. It (git-lfs, which triggered thus investigation) essentially insists on sending each packet as a tiny individual packet (resulting in umpteen thousands) instead of using the internet's built-in packet batching system (nagle's algorithm)

scharman 3 years ago

I believe it just emits at least one packet on each system 'write' call. As long as your 'write' invocations are larger blocks then I'd expect you'd see very little difference with O_NDELAY enabled or disabled. I've always assumed you want to limit system calls so I'd always assumed it to be better practice to encode to a buffer and invoke 'write' on larger blocks. So this feels like a combination of issues.
Regardless, overriding a socket parameter like this should be well documented by Golang if that's the desired intent.
morelisp 3 years ago

If you want to buffer, you can still buffer. There’s no advantage letting the OS do it, and decades of documented disadvantages.
taneq 3 years ago
Whether this is the right or wrong thing depends 100% on what you’re trying to do. For many applications you want to send your message immediately because your next message depends on the response.
- withinboredom 3 years ago
  
  Very rarely this is the case. From the application’s perspective yes. From a packet perspective… no. The interface is going to send packets and they’ll end up in a few buffers after going through some wires. If something goes wrong along the way, they’ll be retransmitted. But the packets don’t care about the response, except an acknowledgment the packets were received. If you send 4000 byte messages when the MTU is 9000, you’re wasting perfectly good capacity. If you had Nagle’s turned on, you’d send one 8040 byte packet. With Nagle’s you don’t have to worry about the MTU, you write your data to the kernel and the rest is magically handled for you.

nottorp 3 years ago

They're really in a bubble at Google.

xiwenc 3 years ago

Nice finding. It would help also to suggest a workaround. Perhaps “overloading“ the function? Not a Golang expert here. But providing a solution (other than waiting for upstream) would be beneficial for others.

morelisp 3 years ago
There is a public, documented API to turn Nagle back on.
(Please don’t.)
- kiwijamo 3 years ago
  
  Can you elaborate? Your suggestion not to turn it back on would result in the OP having to suffer slow upload speeds despite having available bandwidth order of magnitude larger. How is that a good outcome?
  
  1 reply →

neonsunset 3 years ago

It is the correct default and anyone who states otherwise has not spent sufficient amount of hours on debugging obscure network latency issues, especially when they interact with any kind of complex software stack on top of them.