It's always TCP_NODELAY

2 years ago (brooker.co.za)

275 comments

todsacerdoti

I've fixed multiple latency issues due to nagle's multiple times in my career. It's the first thing I jump to. I feel like the logic behind it is sound, but it just doesn't work for some workloads. It should be something that an engineer needs to be forced to set while creating a socket, instead of letting the OS choose a default. I think that's the main issue. Not that it's a good / bad option but that there is a setting that people might not know about that manipulates how data is sent over the wire so aggressively.

nh2 2 years ago
Same here. I have a hobby that on any RPC framework I encounter, I file a Github issue "did you think of TCP_NODELAY or can this framework do only 20 calls per second?".
So far, it's found a bug every single time.
Some examples: https://cloud-haskell.atlassian.net/browse/DP-108 or https://github.com/agentm/curryer/issues/3
I disagree on the "not a good / bad option" though.
It's a kernel-side heuristic for "magically fixing" badly behaved applications.
As the article states, no sensible application does 1-byte network write() syscalls. Software that does that should be fixed.
It makes sense only in the case when you are the kernel sysadmin and somehow cannot fix the software that runs on the machine, maybe for team-political reasons. I claim that's pretty rare.
For all other cases, it makes sane software extra complicated: You need to explicitly opt-out of odd magic that makes poorly-written software have slightly more throughput, and that makes correctly-written software have huge, surprising latency.
John Nagle says here and in linked threads that Delayed Acks are even worse. I agree. But the Send/Send/Receive receive pattern that Nagle's Algorithm degrades is a totally valid and common use case, including anything that does pipelined RPC over TCP.
Both Delayed Acks and Nagle's Algorithm should be opt-in, in my opinion. It should be called TCP_DELAY, which you can opt-into if you can't be asked to implement basic userspace buffering.
People shouldn't /need/ to know about these. Make the default case be the unsurprising one.
- pzs 2 years ago
  
  "As the article states, no sensible application does 1-byte network write() syscalls." - the problem that this flag was meant to solve was that when a user was typing at a remote terminal, which used to be a pretty common use case in the 80's (think telnet), there was one byte available to send at a time over a network with a bandwidth (and latency) severely limited compared to today's networks. The user was happy to see that the typed character arrived to the other side. This problem is no longer significant, and the world has changed so that this flag has become a common issue in many current use cases.
  Was terminal software poorly written? I don't feel comfortable to make such judgement. It was designed for a constrained environment with different priorities.
  Anyway, I agree with the rest of your comment.
  
  9 replies →
- klabb3 2 years ago
  
  > As the article states, no sensible application does 1-byte network write() syscalls. Software that does that should be fixed.
  Yes! And worse, those that do are not gonna be “fixed” by delays either. In this day and age with fast internets, a syscall per byte will bottleneck the CPU way before it’ll saturate the network path. The cpu limit when I’ve been tuning buffers have been somewhere in the 4k-32k range for 10Gbps ish.
  > Both Delayed Acks and Nagle's Algorithm should be opt-in, in my opinion.
  Agreed, it causes more problems than it solves and is very outdated. Now, the challenge is rolling out such a change as smoothly as possible, which requires coordination and a lot of trivia knowledge of legacy systems. Migrations are never trivial.
  
  1 reply →
- jandrese 2 years ago
  
  The problem with making it opt in is that the point of the protocol was to fix apps that, while they perform fine for the developer on his LAN, would be hell on internet routers. So the people who benefit are the ones who don't know what they are doing and only use the defaults.
- kazinator 2 years ago
  
  It's very easy to end up with small writes. E.g.
  1. Write four bytes (length of frame) 2. Write the frame (write the frame itself)
  The easiest fix in C code, with the least chance of introduce a buffer overflow or bad performance is to keep these two pieces of information in separate buffers, and use writev. (How portable is that compared to send?)
  If you have to combine the two into one flat frame, you're looking at allocating and copying memory.
  Linux has something called corking: you can "cork" a socket (so that it doesn't transmit), write some stuff to it multiple times and "uncork". It's extra syscalls though, yuck.
  You could use a buffered stream where you control flushes: basically another copying layer.
- a_t48 2 years ago
  
  Thanks for the reminder to set this on the new framework I’m working on. :)
- imp0cat 2 years ago
  
  I have a hobby that on any RPC framework I encounter, I file a Github issue "did you think of TCP_NODELAY or can this framework do only 20 calls per second?".
  So true. Just last month we had to apply the TCP_NODELAY fix to one of our libraries. :)
- hgomersall 2 years ago
  
  Would one not also get clobbered by all the sys calls for doing many small packets? It feels like coalescing in userspace is a much better strategy all round if that's desired, but I'm not super experienced.
- carterschonwald 2 years ago
  
  Oh hey! It’s been a while how’re you?!
Sebb767 2 years ago

> It should be something that an engineer needs to be forced to set while creating a socket, instead of letting the OS choose a default.
If the intention is mostly to fix applications with bad `write`-behavior, this would make setting TCP_DELAY a pretty exotic option - you would need a software engineer to be both smart enough to know to set this option, but not smart enough to distribute their write-calls well and/or not go for writing their own (probably better fitted) application-specific version of Nagles.
Bluecobra 2 years ago
I agree, it has been fairly well known to disable Nagle's Algorithm in HFT/low latency trading circles for quite some time now (like > 15 years). It's one of the first things I look for.
- Scubabear68 2 years ago
  
  I was setting TCP_NODELAY at Bear Stearns for custom networking code circa 1994 or so.
  
  1 reply →
- Reason077 2 years ago
  
  Surely serious HFT systems bypass TCP altogether now days. In that world, every millisecond of latency can potentially cost a lot of money.
  These are the guys that use microwave links to connect to exchanges because fibre-optics have too much latency.
  
  1 reply →
- mcoliver 2 years ago
  
  Same in M&E / vfx
hinkley 2 years ago
What you really want is for the delay to be n microseconds, but there’s no good way to do that except putting your own user space buffering in front of the system calls (user space works better, unless you have something like io_uring amortizing system call times)
- mjevans 2 years ago
  
  It'd probably be amazing how many poorly coded games would work better if something like...
  TCP_60FPSBUFFER
  Would wait for ~16mS after the first packet is queued and batch the data stream up.
  
  2 replies →
- jnordwick 2 years ago
  
  linux has auto-corking (and I know of no way to disable it) that will do these short delays on small packets even if the dev doesn't want it
- bobmcnamara 2 years ago
  
  I'd rather have portable TCP_CORK
  
  1 reply →
nsguy 2 years ago
The logic is really for things like Telnet sessions. IIRC that was the whole motivation.
- bobmcnamara 2 years ago
  
  And for block writes!
  The Nagler turns a series of 4KB pages over TCP into a stream of MTU sized packets, rather than a short packet aligned to the end of each page.
inopinatus 2 years ago

With some vendors you have to solve it like a policy problem, via a LD_PRELOAD shim.
kazinator 2 years ago

> that an engineer needs to be forced to set while creating a socket
Because there aren't enough steps in setting up sockets! Haha.
I suspect that what would happen is that many of the programming language run-times in the world which have easier-to-use socket abstractions would pick a default and hide it from the programmer, so as not to expose an extra step.
pjc50 2 years ago
> I feel like the logic behind it is sound, but it just doesn't work for some workloads.
The logic is only sound for interactive plaintext typing workloads. It should have been turned off by default 20 years ago, let alone now.
- p_l 2 years ago
  
  Remember that IPv4 original "target replacement date" (as it was only an "experimental" protocol) was 1990...
  And a common thing in many more complex/advanced protocols was to explicitly delineate "messages", which avoids the issue of Nagle's algorithm altogether.
ww520 2 years ago

Same here. My first job out of college was at a database company. Queries at the client side of the client-server based database were slow. It was thought the database server was slow as hardware back then was pretty pathetic. I traced it down to the network driver and found out the default setting of TCP_NODELAY was off. I looked like a hero when turning on that option and the db benchmarks jumped up.
immibis 2 years ago

Not when creating a socket - when sending data. When sending data, you should indicate whether this data block prefers high throughput or low latency.
nailer 2 years ago
You’re right re: making delay explicit, but also crappy use the space networking tools don’t show whether no_delay is enabled on sockets.
Last time I had to do some Linux stuff, maybe 10 years ago you had to write a systemtap program. I guess it’s EBNF now. But I bet the userspace tools still suck.
- nailer 2 years ago
  
  > use the space
  Userspace. Sorry, was using voice dictation.

0xbadcafebee 2 years ago

The takeaway is odd. Clearly Nagle's Algorithm was an attempt at batched writes. It doesn't matter what your hardware or network or application or use-case or anything is; in some cases, batched writes are better.

Lots of computing today uses batched writes. Network applications benefit from it too. Newer higher-level protocols like QUIC do batching of writes, effectively moving all of TCP's independent connection and error handling into userspace, so the protocol can move as much data into the application as fast as it can, and let the application (rather than a host tcp/ip stack, router, etc) worry about the connection and error handling of individual streams.

Once our networks become saturated the way they were in the old days, Nagle's algorithm will return in the form of a QUIC modification, probably deeper in the application code, to wait to send a QUIC packet until some criteria is reached. Everything in technology is re-invented once either hardware or software reaches a bottleneck (and they always will as their capabilities don't grow at the same rate).

(the other case besides bandwidth where Nagle's algorithm is useful is if you're saturating Packets Per Second (PPS) from tiny packets)

p_l 2 years ago
The difference between QUIC and TCP is the original sin of TCP (and its predecessor) - that of emulating an async serial port connection, with no visible messaging layer.
It meant that you could use a physical teletypewriter to connect to services (simplified description - slap a modem on a serial port, dial into a TIP, write host address and port number, voila), but it also means that TCP has no idea of message boundaries, and while you can push some of that knowledge now the early software didn't.
In comparison, QUIC and many other non-TCP protocols (SCTP, TP4) explicitly provide for messaging boundaries - your interface to the system isn't based on emulated serial ports but on messages that might at most get reassembled.
- utensil4778 2 years ago
  
  It's kind of incredible to think how many things in computers and electronics turn out to just be a serial port.
  One day, some future engineer is going to ask why their warp core diagnostic port runs at 9600 8n1.
  
  1 reply →
Spivak 2 years ago

Yes but it seems this particular implementation is using a heuristic for how to batch that made some assumptions that didn't pan out.
adgjlsfhk1 2 years ago

batching needs to be application controlled rather than protocol controlled. the protocol doesn't have enough context to batch correctly.

somat 2 years ago

What about the opposite, disable delayed acks.

The problem is the pathological behavior when tinygram prevention interacts with delayed acks. There is an exposed option to turn off tinygram prevention(TCP_NODELAY), how would you tun off delayed acks instead? Say if you wanted to benchmark all four combinations and see what works best.

doing a little research I found:

linux has the TCP_QUICKACK socket option but you have to set it every time you receive. there is also /proc/sys/net/ipv4/tcp_delack_min and /proc/sys/net/ipv4/tcp_ato_min

freebsd has net.inet.tcp.delayed_ack and net.inet.tcp.delacktime

mjb 2 years ago

TCP_QUICKACK does fix the worst version of the problem, but doesn't fix the entire problem. Nagles algorithm will still wait for up to one round-trip time before sending data (at least as specified in the RFC), which is extra latency with nearly no added value.
Animats 2 years ago

> linux has the TCP_QUICKACK socket option but you have to set it every time you receive
Right. What were they thinking? Why would you want it off only some of the time?
batmanthehorse 2 years ago
In CentOS/RedHat you can add `quickack 1` to the end of a route to tell it to disable delayed acks for that route.
- rbjorklin 2 years ago
  
  And with systemd >= 253 you can set it as part of the network config to have it be applied automatically. https://github.com/systemd/systemd/issues/25906
Culonavirus 2 years ago

Apparently you have time to "do a little research" but not to read the entire article you're reacting to? It specifically mentions TCP_QUICKACK.

pclmulqdq 2 years ago

In a world where bandwidth was limited, and the packet size minimum was 64 bytes plus an inter-frame gap (it still is for most Ethernet networks), sending a TCP packet for literally every byte wasted a huge amount of bandwidth. The same goes for sending empty acks.

On the other hand, my general position is: it's not TCP_NODELAY, it's TCP.

metadaemon 2 years ago
I'd just love a protocol that has a built in mechanism for realizing the other side of the pipe disconnected for any reason.
- toast0 2 years ago
  
  That's possible in circuit switched networking with various types of supervision, but packet switched networking has taken over because it's much less expensive to implement.
  Attempts to add connection monitoring usually make things worse --- if you need to reroute a cable, and one or both ends of the cable will detect a cable disconnection and close user sockets, that's not great, now you do a quick change with a small period of data loss but otherwise minor interruption; all of the established connections will be dropped.
- 01HNNWZ0MV43FF 2 years ago
  
  To re-word everyone else's comments - "Disconnected" is not well-defined in any network.
  
  12 replies →
- sophacles 2 years ago
  
  That's really really hard. For a full, guaranteed way to do this we'd need circuit switching (or circuit switching emulation). It's pretty expensive to do in packet networks - each flow would need to be tracked by each middle box, so a lot more RAM at every hop, and probably a lot more processing power. If we go with circuit establishment, its also kind of expensive and breaks the whole "distributed, decentralized, self-healing network" property of the Internet.
  It's possible to do better than TCP these days, bandwidth is much much less constrained than it was when TCP was designed, but it's still a hard problem to do detection of pipe disconnected for any reason other than timeouts (which we already have).
- pclmulqdq 2 years ago
  
  Several of the "reliable UDP" protocols I have worked on in the past have had a heartbeat mechanism that is specifically for detecting this. If you haven't sent a packet down the wire in 10-100 milliseconds, you will send an extra packet just to say you're still there.
  It's very useful to do this in intra-datacenter protocols.
- jallmann 2 years ago
  
  These types of keepalives are usually best handled at the application protocol layer where you can design in more knobs and respond in different ways. Otherwise you may see unexpected interactions between different keepalive mechanisms in different parts of the protocol stack.
- koverstreet 2 years ago
  
  Like TCP keepalives?
  
  9 replies →
- the8472 2 years ago
  
  If a socket is closed properly there'll be a FIN and the other side can learn about it by polling the socket.
  If the network connection is lost due to external circumstances (say your modem crashes) then how would that information propagate from the point of failure to the remote end on an idle connection? Either you actively probe (keepalives) and risk false positives or you wait until you hear again from the other side, risking false negatives.
  
  4 replies →
- noselasd 2 years ago
  
  SCTP has hearbeats to detect that.
- drb999 2 years ago
  
  What you’re looking for is: https://datatracker.ietf.org/doc/html/rfc5880
  BFD, it’s used for millisecond failure detection and typically combined with BGP sessions (tcp based) to ensure seamless failover without packet drops.
niutech 2 years ago
Shouldn't QUIC (https://en.wikipedia.org/wiki/QUIC) solve the TCP issues like latency?
- klabb3 2 years ago
  
  As someone who needed high throughput and looked to QUIC because of control of buffers, I recommend against it at this time. It’s got tons of performance problems depending on impl and the API is different.
  I don’t think QUIC is bad, or even overengineered, really. It delivers useful features, in theory, that are quite well designed for the modern web centric world. Instead I got a much larger appreciation for TCP, and how well it works everywhere: on commodity hardware, middleboxes, autotuning, NIC offloading etc etc. Never underestimate battletested tech.
  In that sense, the lack of TCP_NODELAY is an exception to the rule that TCP performs well out of the box (golang is already doing this by default). As such, I think it’s time to change the default. Not using buffers correctly is a programming error, imo, and can be patched.
  
  2 replies →
- jallmann 2 years ago
  
  The specific issues that this article discusses (eg Nagle's algorithm) will be present in most packet-switched transport protocols, especially ones that rely on acknowledgements for reliability. The QUIC RFC mentions this: https://datatracker.ietf.org/doc/html/rfc9000#section-13
  Packet overhead, ack frequency, etc are the tip of the iceberg though. QUIC addresses some of the biggest issues with TCP such as head-of-line blocking but still shares the more finicky issues, such as different flow and congestion control algorithms interacting poorly.
- djha-skin 2 years ago
  
  Quic is mostly used between client and data center, but not between two datacenter computers. TCP is the better choice once inside the datacenter.
  Reasons:
  Security Updates
  Phones run old kernels and new apps. So it makes a lot of sense to put something that needs updated a lot like the network stack into user space, and quic does well here.
  Data center computers run older apps on newer kernels, so it makes sense to put the network stack into the kernel where updates and operational tweaks can happen independent of the app release cycle.
  Encryption Overhead
  The overhead of TLS is not always needed inside a data center, where it is always needed on a phone.
  Head of Line Blocking
  Super important on a throttled or bad phone connection, not a big deal when all of your datacenter servers have 10G connections to everything else.
  In my opinion TCP is a battle hardened technology that just works even when things go bad. That it contains a setting with perhaps a poor default is a small thing in comparison to its good record for stability in most situations. It's also comforting to know I can tweak kernel parameters if I need something special for my particular use case.
  
  1 reply →

theamk 2 years ago

I don't by the reasoning for never needing Nagle anymore. Sure, telnet isn't a thing today, but I bet there are still plenty of apps which do equivalent of:

     write(fd, "Host: ")
     write(fd, hostname)
     write(fd, "\r\n")
     write(fd, "Content-type: ")
     etc...

this may not be 40x overhead, but it'd still 5x or so.

temac 2 years ago
Fix the apps. Nobody expect magical perf if you do that when writing to files, even though the OS also has its own buffers. There is no reason to expect otherwise when writing to a socket and actually nagle already doesn't save you from syscall overhead.
- toast0 2 years ago
  
  Nagle doesn't save the derpy side from syscall overhead, but it would save the other side.
  It's not just apps doing this stuff, it also lives in system libraries. I'm still mad at the Android HTTPS library for sending chunked uploads as so many tinygrams. I don't remember exactly, but I think it's reasonable packetization for the data chunk (if it picked a reasonable size anyway), then one packet for \r\n, one for the size, and another for another \r\n. There's no reason for that, but it doesn't hurt the client enough that I can convince them to avoid the system library so they can fix it and the server can manage more throughput. Ugh. (It might be that it's just the TLS packetization that was this bogus and the TCP packetization was fine, it's been a while)
  If you take a pcap for some specific issue, there's always so many of these other terrible things in there. </rant>
- citrin_ru 2 years ago
  
  I agree that such code should be fixed but having hard time persuading developers to fix their code. Many of them don't know what is a syscall, how making a syscall triggers sending of an IP packet, how a library call translates to a syscall e. t. c. Worse they don't want to know this, they write say Java code (or some other high level language) and argue that libraries/JDK/kernel should handle all 'low level' stuff.
  To get optimal performance for request-response protocols like HTTP one should send a full request which includes a request line, all headers and a POST body using a single write syscall (unless POST body is large and it make sense to write it in chunks). Unfortunately not all HTTP libraries work this way and a library user cannot fix this problem without switching a library which is: 1. not always easy 2. it is not widely known which libraries are efficient and which are not. Even if you have an own HTTP library it's not always trivial to fix: e. g. in Java a way to fix this problem while keeping code readable and idiomatic is too wrap socket into BufferedOutputStream which adds one more memory-to-memory copy for all data you are sending on top of at least one memory-to-memory copy you already have without a buffered stream; so it's not an obvious performance win for an application which already saturates memory bandwidth.
- bjourne 2 years ago
  
  > Fix the apps. Nobody expect magical perf if you do that when writing to files,
  We write to files line-by-line or even character-by-character and expect the library or OS to "magically" buffer it into fast file writes. Same with memory. We expect multiple small mallocs to be smartly coalesced by the platform.
  
  6 replies →
- jeroenhd 2 years ago
  
  Everybody expects magical perf if you do that when writing files. We have RAM buffers and write caches for a reason, even on fast SSDs. We expect it so much that macOS doesn't flush to disk even when you call fsync() (files get flushed to the disk's write buffer instead).
  There's some overhead to calling write() in a loop, but it's certainly not as bad as when a call to write() would actually make the data traverse whatever output stream you call it on.
- meinersbur 2 years ago
  
  Those are the apps are quickly written and do not care if they unnecessarily congest the network. The ones that do get properly maintained can set TCP_NODELAY. Seems like a reasonable default to me.
- blahgeek 2 years ago
  
  We actually have the similar behavior when writing to files: contents are buffered in page cache and are written to disk later in batch, unless user explicitly call "sync".
- ale42 2 years ago
  
  Apps can always misbehave, you never know what people implement, and you don't always have source code to patch. I don't think the role of the OS is to let the apps do whatever they wish, but it should give the possibility of doing it if it's needed. So I'd rather say, if you know you're properly doing things and you're latency sensitive, just TCP_NODELAY on all your sockets and you're fine, and nobody will blame you about doing it.
- josefx 2 years ago
  
  I would love to fix the apps, can you point me to the github repo with all the code written the last 30 years so I can get started?
rwmj 2 years ago
The comment about telnet had me wondering what openssh does, and it sets TCP_NODELAY on every connection, even for interactive sessions. (Confirmed by both reading the code and observing behaviour in 'strace').
- c0l0 2 years ago
  
  Especially for interactive sessions, it absolutely should! :)
  
  7 replies →
asveikau 2 years ago

I don't think that's actually super common anymore when you consider that doing asynchronous I/O, the only sane way to do that is put it into a buffer rather than blocking at every small write(2).
Then you consider that asynchronous I/O is usually necessary both on server (otherwise you don't scale well) and client (because blocking on network calls is terrible experience, especially in today's world of frequent network changes, falling out of network range, etc.)
silisili 2 years ago
We shouldn't penalize the internet at large because some developers write terrible code.
- littlestymaar 2 years ago
  
  Isn't it how SMTP is working though?
  
  1 reply →
grishka 2 years ago

And they really shouldn't do this. Even disregarding the network aspect of it, this is still bad for performance because syscalls are kinda expensive.
otterley 2 years ago

Marc addresses that: “That’s going to make some “write every byte” code slower than it would otherwise be, but those applications should be fixed anyway if we care about efficiency.”
jrockway 2 years ago
Does this matter? Yes, there's a lot of waste. But you also have a 1Gbps link. Every second that you don't use the full 1Gbps is also waste, right?
- tedunangst 2 years ago
  
  This is why I always pad out the end of my html files with a megabyte of  . A half empty pipe is a half wasted pipe.
  
  3 replies →
Arnt 2 years ago

Those aren't the ones you debug, so they won't be seen by OP. Those are the ones you don't need to debug because Nagle saves you.
jabl 2 years ago

Even if you do nothing 'fancy' like Nagle, corking, or userspace building up the complete buffer before writing etc., at the very least the above should be using a vectored write (writev() ).
sophacles 2 years ago

TCP_CORK handles this better than nagle tho.
Too 2 years ago

Shouldn’t that go through some buffer? Unless you fflush() between each write?
eatonphil 2 years ago
I imagine the write calls show up pretty easily as a bottleneck in a flamegraph.
- wbl 2 years ago
  
  They don't. Maybe if you're really good you notice the higher overhead but you expect to be spending time writing to the network. The actual impact shows up when the bandwidth consumption is way up on packet and TCP headers which won't show on a flamegraph that easily.
tptacek 2 years ago

The discussion here mostly seems to miss the point. The argument is to change the default, not to eliminate the behavior altogether.
the8472 2 years ago

shouldn't autocorking help with even without nagle?
loopdoend 2 years ago

Ah yeah I fixed this exact bug in net-http in Ruby core a decade ago.

batmanthehorse 2 years ago

Does anyone know of a good way to enable TCP_NODELAY on sockets when you don't have access to the source for that application? I can't find any kernel settings to make it permanent, or commands to change it after the fact.

I've been able to disable delayed acks using `quickack 1` in the routing table, but it seems particularly hard to enable TCP_NODELAY from outside the application.

I've been having exactly the problem described here lately, when communicating between an application I own and a closed source application it interacts with.

coldpie 2 years ago
Would some kind of LD_PRELOAD interception for socket(2) work? Call the real function, then do setsockopt or whatever, and return the modified socket.
- cesarb 2 years ago
  
  > Would some kind of LD_PRELOAD interception for socket(2) work?
  That would only work if the call goes through libc, and it's not statically linked. However, it's becoming more and more common to do system calls directly, bypassing libc; the Go language is infamous for doing that, but there's also things like the rustix crate for Rust (https://crates.io/crates/rustix), which does direct system calls by default.
  
  29 replies →
jdadj 2 years ago

Depending on the specifics, you might be able to add socat in the middle.
Instead of: your_app —> server
you’d have: your_app -> localhost_socat -> server
socat has command line options for setting tcp_nodelay. You’d need to convince your closed source app to connect to localhost, though. But if it’s doing a dns lookup, you could probably convince it to connect to localhost with an /etc/hosts entry
Since your app would be talking to socat over a local socket, the app’s tcp_nodelay wouldn’t have any effect.
praptak 2 years ago

Attach debugger (ptrace), call setsockopt?
the8472 2 years ago

opening `/proc/<pid>/fd/<fd number>` and setting the socket option may work (not tested)
tedunangst 2 years ago
LD_PRELOAD.
- batmanthehorse 2 years ago
  
  Thank you, found this: https://github.com/sschroe/libnodelay
Too 2 years ago

Is it possible to set it as a global OS setting, inside a container?
tuetuopay 2 years ago

you could try ebpf and hook on the socket syscall. might be harder than LD_PRELOAD as suggested by other commenters though

resonious 2 years ago

~15 years ago I played an MMO that was very real-time, and yet all of the communication was TCP. Literally you'd click a button, and you would not even see your action play out until a response packet came back.

All of the kids playing this game (me included) eventually figured out you could turn on TCP_NODELAY to make the game buttery smooth - especially for those in California close to the game servers.

jonathanlydall 2 years ago
Not sure if you're talking about WoW, but around that time ago an update to the game did exactly this change (and possibly more).
An interesting side-effect of this was that before the change if something stalled the TCP stream, the game would hang for a while then very quickly replay all the missed incoming events (which was very often you being killed). After the change you'd instead just be disconnected.
- mst 2 years ago
  
  I think I have a very vague memory of the "hang, hang, hang, SURPRISE! You're dead" thing happening in Diablo II but it's been so long I wouldn't bet on having remembered correctly.

zengid 2 years ago

Relevant Oxide and Friends podcast episode https://www.youtube.com/watch?v=mqvVmYhclAg

matthavener 2 years ago

This was a great episode and the really drove home the importance of visualization.

rsc 2 years ago

Not if you use a modern language that enables TCP_NODELAY by default, like Go. :-)

andrewfromx 2 years ago

https://github.com/golang/go/issues/57530
huh, TIL.
silverwind 2 years ago
Node.js also does this since at least 2020.
- Sammi 2 years ago
  
  Since 2022 v.18.
  PR: https://github.com/nodejs/node/pull/42163
  Changelog entry: https://github.com/nodejs/node/blob/main/doc/changelogs/CHAN...
eru 2 years ago
Why do you need a whole language for that? Couldn't you just use a 'modern' networking library?
- rsc 2 years ago
  
  Sure, like the one in https://9fans.github.io/plan9port/. :-)

obelos 2 years ago

Not every time. Sometimes it's DNS.

p_l 2 years ago
Once it was a failing line card in router zeroing last bit in IPv4 addresses, resulting in ticket about "only even IPv4 addresses are accessible" ...
- jcgrillo 2 years ago
  
  For some reason this reminded me of the "500mi email" bug [1], maybe a similar level of initial apparent absurdity?
  [1] https://www.ibiblio.org/harris/500milemail.html
  
  6 replies →
sophacles 2 years ago

One time for me it was: the glass was dirty.
Some router near a construction site had dust settle into the gap between the laser and the fiber, and it attenuated the signal enough to see 40-50% packet loss.
We figured out where the loss was and had our NOC email the relevant transit provider. A day later we got an email back from the tech they dispatched with the story.
jeffrallen 2 years ago

Once every 50 years and 2 billion kilometers, it's a failing memory chip. But you can usually just patch around them, so no big deal.
skunkworker 2 years ago

Don’t forget BGP or running out of disk space without an alert.
marcosdumay 2 years ago

When it fails, it's DNS. When it just stops moving, it's either TCP_NODELAY or stream buffering.
Really complex systems (the Web) also fail because of caching.
Sohcahtoa82 2 years ago
I chuckle whenever I see this meme, because in my experience, the issue is usually DHCP.
- anilakar 2 years ago
  
  But it's usually DHCP that sets the wrong DNS servers.
  It's funny that some folks claim DNS outage is a legitimate issue in systems whose both ends they control. I get it; reimplementing functionality is rarely a good sign, but since you already know your own addresses in the first place, you should also have an internal mechanism for sharing them.
rickydroll 2 years ago

Not every time. Sometimes, the power cord is only connected at one end.
drivers99 2 years ago
Or SELinux
- DEADMINCE 2 years ago
  
  The difference is SELinux shouldn't be disabled.

evanelias 2 years ago

John Nagle has posted insightful comments about the historical background for this many times, for example https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Animats 2 years ago
The sending pattern matters. Send/Receive/Send/Receive won't trigger the problem, because the request will go out immediately and the reply will provide an ACK and allow another request. Bulk transfers won't cause the problem, because if you fill the outgoing block size, there's no delay.
But Send/Send/Receive will. This comes up a lot in game systems, where most of the traffic is small events going one way.
- pipe01 2 years ago
  
  I would imagine that games that require exotic sending patterns would use UDP, giving them more control over the protocol
  
  1 reply →
EvanAnderson 2 years ago
I love it when Nagle's algorithm comes up on HN. Inevitably someone, not knowing "Animats" is John Nagle, responds a comment from Animats with a "knowing better" tone. >smile<
(I also really like Animats' comments, too.)
- geoelectric 2 years ago
  
  I have to confess that when I saw this post, I quickly skimmed the threads to check if someone was trying to educate Animats on TCP. Think I've only seen that happen in the wild once or twice, but it absolutely made my day when it did.
  
  2 replies →
- userbinator 2 years ago
  
  I always check if the man himself makes an appearance every time I see that. He has posted a few comments in here already.
- jeltz 2 years ago
  
  It is like when someone here accused Andres Freund (PostgreSQL core dev who recently became famous due to the xz backdoor) of Dunning–Kruger when he had commented on something related to PostgreSQL's architecture which he had spent many many hours working on personally (I think it was pluggable storage).
  Maybe you just tried to educate the leading expert in the world on his own expertise. :D
SushiHippie 2 years ago

FYI the best way to filter by author is 'author:Animats' this will only show results from the user Animats and won't match animats inside the comment text.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

jedberg 2 years ago

This is an interesting thing that points out why abstraction layers can be bad without proper message passing mechanisms.

This could be fixed if there was a way for the application at L7 to tell the TCP stack at L4 "hey, I'm an interactive shell so I expect to have a lot of tiny packets, you should leave TCP_NODELAY on for these packets" so that it can be off by default but on for that application to reduce overhead.

Of course nowadays it's probably an unnecessary optimization anyway, but back in '84 it would have been super handy.

Dylan16807 2 years ago
"I'm an interactive shell so I expect to have a lot of tiny packets" is what the delay is for. If you want to turn it off for those, you should turn it off for everything.
(If you're worried about programs that buffer badly, then you could compensate with a 1ms delay. But not this round trip stuff.)
eru 2 years ago

The take-away I get is that abstraction layers (in the kernel) can be bad.
Operating system kernels should enable secure multiplexing of resources. Abstraction and portability should be done via libraries.
See https://en.wikipedia.org/wiki/Exokernel

JoshTriplett 2 years ago

I do wish that TCP_NODELAY was the default, and there was a TCP_DELAY option instead. That'd be a world in which people who want the batch-style behavior (optimizing for throughput and fewer packets at the expense of latency) could still opt into it.

mzs 2 years ago

So do I, but I with there was a new one TCP_RTTDELAY. It would take a byte that would be what 128th of RTT you want to use for Nagle instead of one RTT or full* buffer. 0 would be the default, behaving as you and I prefer.
* "Given the vast amount of work a modern server can do in even a few hundred microseconds, delaying sending data for even one RTT isn’t clearly a win."
I don't think that's such an issue anymore either, given that the server produces so much data it fills the output buffer quickly anyway, the data is then immediately sent before the delay runs its course.

mannyv 2 years ago

We used to call them "packlets."

His "tinygrams" is pretty good too, but that sort of implies UDP (D -> datagrams)

chuckadams 2 years ago

> We used to call them "packlets."
setsockopt(fd, IPPROTO_TCP, TCP_MAKE_IT_GO, &go, sizeof(go));

mirekrusin 2 years ago

Can't it have "if payload is 1 byte (or less than X) then wait, otherwise don't" condition?

chuckadams 2 years ago

Some network stacks like those in Solaris and HP/UX let you tune the "Nagle limit" in just such a fashion, up to disabling it entirely by setting it to 1. I'm not aware of it being tunable on Linux, though you can manually control the buffering using TCP_CORK. https://baus.net/on-tcp_cork/ has some nice details.
deathanatos 2 years ago

How is what you're describing not just Nagle's algorithm?
If you mean TCP_NODELAY, you should use it with TCP_CORK, which prevents partial frames. TCP_CORK the socket, do your writes to the kernel via send, and then once you have an application level "message" ready to send out — i.e., once you're at the point where you're going to go to sleep and wait for the other end to respond, unset TCP_CORK & then go back to your event loop & sleep. The "uncork" at the end + nodelay sends the final partial frame, if there is one.
fweimer 2 years ago

There is a socket option, SO_SNDLOWAT. It's not implement Linux according to the manual page. The description in UNIX Network Programming and TCP Illustrated conflict, too. So it's probably not useful.
the8472 2 years ago

You can buffer in userspace. Don't do small writes to the socket and no bytes will be sent. Don't do two consecutive small writes and nagle won't kick in.
astrange 2 years ago

FreeBSD has accept filters, which let you do something like wait for a complete HTTP header (inaccurate from memory summary.) Not sure about the sending side.

Ono-Sendai 2 years ago

From my blog > 10 years ago but sadly still relevant: "Sockets should have a flushHint() API call.": https://forwardscattering.org/post/3

sciencesama 2 years ago

For folks looking to disable in windows this is how it is done fyi !!

https://support.microsoft.com/en-us/topic/fix-tcp-ip-nagle-a...

benreesman 2 years ago

Nagle and no delay are like like 90+% of the latency bugs I’ve dealt with.

Two reasonable ideas that mix terribly in practice.

ryjo 2 years ago

I just ran into this this week implementing a socket library in CLIPS. I used Berkley sockets, and before that I had only worked with higher-level languages/frameworks that abstracts a lot of these concerns away. I was quite confused when Firefox would show a "connection reset by peer." It didn't occur to me it could be an issue "lower" in the stack. `tcpdump` helped me to observe the port and I saw that the server never sent anything before my application closed the connection.

projectileboy 2 years ago

The real issue in modern data centers is TCP. Of course at present, we need to know about these little annoyances at the application layer, but what we really need is innovation in the data center at level 4. And yes I know that many people are looking into this and have been for years, but the economic motivation clearly has not yet been strong enough. But that may change if the public's appetite for LLM-based tooling causes data centers to increase 10x (which seems likely).

epolanski 2 years ago

I was curious whether I had to change anything in my applications after reading that so did a bit of research.

Both Node.js and Curl use TCP_NODELAY by default from a long time.

Sammi 2 years ago
Nodejs enabled TCP_NODELAY by default in 2022 v.18.
PR: https://github.com/nodejs/node/pull/42163
Changelog entry: https://github.com/nodejs/node/blob/main/doc/changelogs/CHAN...
- epolanski 2 years ago
  
  That's the HTTP module, but it was already moved to NODELAY for the 'net' module in 2015.

kaoD 2 years ago

As a counterpoint, here's the story of how for me it wasn't TCP_NODELAY: for some reason my Nodejs TCP service was talking a few seconds to reply to my requests in localhost (Windows machine). After the connection was established everything was pretty normal but it consistently took a few seconds to establish the connection.

I even downloaded netcat for Windows to go as bare ones as possible... and the exact same thing happened.

I rewrote a POC service in Rust and... oh wow, the same thing happens.

It took me a very long time of not finding anything on the internet (and getting yelled at in Stack Overflow, or rather one of its sister sites) and painstakingly debugging (including writing my own tiny client with tons of debug statements) until I realized "localhost" was resolving first to IPv6 loopback in Windows and, only after quietly timing out there (because I was only listening on IPv4 loopback), it did try and instantly connect through IPv4.

littlestymaar 2 years ago

I've seen this too, but luckily someone one the internet gave me a pointer to the exact problem so I didn't have to go deep to figure out.

maple3142 2 years ago

Not sure if this is a bit off topic or not, but I recently encountered a problem where my program are continuously calling write to a socket in a loop that loops N times, with each of them sending about a few hundred bytes of data representing an application-level message. The loop can be understanded as some "batched messages" to server. After that, the program will try to receive data from server and do some processing.

The problem is that if N is above certain limit (e.g. 4), the server will resulting in some error saying that the data is truncated somehow. I want to make N larger because the round-trip latency is already high enough, so being blocked by this is pretty annoying. Eventually, I found an answer on stackoverflow saying that setting TCP_NODELAY can fix this, and it actually magically enable be to increase N to a larger number like 64 or 128 without causing issues. Still not sure why TCP_NODELAY can fix this issue and why this problem happens in the first place.

gizmo686 2 years ago
My guess would be that the server assumes that every call to recv() terminates on a message boundary.
With TCP_NODELAY and small messages, this works out fine. Every message is contained in a single packet, and the userspace buffer being read into is large enough to contain it. As such, whenever the kernel has any data to give to userspace, it has an integer number of messages to give. Nothing requires the kernel to respect that, but it will not go out of its way to break it.
In contrast, without TCP_NODELAY, messages get concatenated and then fragmented based on where packet boundaries occur. Now, the natural end point for a call to recv() is not the message boundary, but the packet boundary.
The server is supposed to see that it is in the middle of a message, and make another call to recv() to get the rest of it; but clearly it does not do that.
- caf 2 years ago
  
  Otherwise known as the "TCP is a stream-based abstraction, not a packet-based abstraction" bug.
  A related one is failing to process the second of two complete commands that happen to arrive in the same recv() call.
  
  1 reply →
blahgeek 2 years ago

> The problem is that if N is above certain limit (e.g. 4), the server will resulting in some error saying that the data is truncated somehow.
Maybe your server expects full application-level messages from single "recv" call? This is is not correct. A message may be spited across multiple recv buffers.

ramblemonkey 2 years ago

What if we changed the kernel or tcp stack to hold on to the packet for only a short time before sending it out. This could allow you to balance the latency against the network cost of many small packets. The tcp stack could even do it dynamically if needed.

tucnak 2 years ago

Genius

f1shy 2 years ago

If you are sooo worried about latency, maybe TCP is a bad choice to start with… I hate to see people using TCP for all, without the minimal understanding of what problems TCP wants to solve, and specially which dont.

pcai 2 years ago
TCP solves for “when i send a message i want the other side to actually receive it” which is…fairly common
- adgjlsfhk1 2 years ago
  
  tcp enforces a much stricter ordering than desirable (head of line blocking). quic does a much better job of emulating a stream of independent tasks.

pandemicsyn 2 years ago

I was gonna say its always lro offload but my experience is dated.

elhosots 2 years ago

This sounds like the root of my vncviewer / server interaction bugs i experience with some vnc viewer/server combo’s between ubuntu linux and freebsd… (tight/tiger)

kazinator 2 years ago

Here is the thing. Nagle's and the delayed ACK may suck for individual app performance, but fewer packets on the network is better for the entire network.

tempaskhn 2 years ago

Wow, never would have thought of that.

meisel 2 years ago

Is this something I should also adjust on my personal Ubuntu machine for better network performance?

gafferongames 2 years ago

It's always TCP_NODELAY. Except when it's head of line blocking, then it's not.

trollied 2 years ago

Brings back memories. This was a big Sybase performance win back in the day.

landswipe 2 years ago

Use UDP ;)

hi-v-rocknroll 2 years ago

Too many applications end up reinventing TCP or SCTP in user-space. Also, network-level QoS applied to unrecognized UDP protocols typically means it gets throttled before TCP. Use UDP when nothing else will work, when the use-case doesn't need a persistent connection, and when no other messaging or transport library is suitable.
gafferongames 2 years ago

Bingo

stonemetal12 2 years ago

>To make a clearer case, let’s turn back to the justification behind Nagle’s algorithm: amortizing the cost of headers and avoiding that 40x overhead on single-byte packets. But does anybody send single byte packets anymore?

That is a bit of a strawman there. While he uses single byte packets as the worst case example, the issue as stated is any not full packet.

AtNightWeCode 2 years ago

Agreed. Another thing along the same path is expect. Needs to be disabled in many cloud services.

spintin 2 years ago

[dead]

hartvile02 2 years ago

[flagged]

hartvile02 2 years ago

[flagged]

hi-v-rocknroll 2 years ago

Apropos repost from 2015:

> That still irks me. The real problem is not tinygram prevention. It's ACK delays, and that stupid fixed timer. They both went into TCP around the same time, but independently. I did tinygram prevention (the Nagle algorithm) and Berkeley did delayed ACKs, both in the early 1980s. The combination of the two is awful. Unfortunately by the time I found about delayed ACKs, I had changed jobs, was out of networking, and doing a product for Autodesk on non-networked PCs.

> Delayed ACKs are a win only in certain circumstances - mostly character echo for Telnet. (When Berkeley installed delayed ACKs, they were doing a lot of Telnet from terminal concentrators in student terminal rooms to host VAX machines doing the work. For that particular situation, it made sense.) The delayed ACK timer is scaled to expected human response time. A delayed ACK is a bet that the other end will reply to what you just sent almost immediately. Except for some RPC protocols, this is unlikely. So the ACK delay mechanism loses the bet, over and over, delaying the ACK, waiting for a packet on which the ACK can be piggybacked, not getting it, and then sending the ACK, delayed. There's nothing in TCP to automatically turn this off. However, Linux (and I think Windows) now have a TCP_QUICKACK socket option. Turn that on unless you have a very unusual application.

> Turning on TCP_NODELAY has similar effects, but can make throughput worse for small writes. If you write a loop which sends just a few bytes (worst case, one byte) to a socket with "write()", and the Nagle algorithm is disabled with TCP_NODELAY, each write becomes one IP packet. This increases traffic by a factor of 40, with IP and TCP headers for each payload. Tinygram prevention won't let you send a second packet if you have one in flight, unless you have enough data to fill the maximum sized packet. It accumulates bytes for one round trip time, then sends everything in the queue. That's almost always what you > want. If you have TCP_NODELAY set, you need to be much more aware of buffering and flushing issues.

> None of this matters for bulk one-way transfers, which is most HTTP today. (I've never looked at the impact of this on the SSL handshake, where it might matter.)

> Short version: set TCP_QUICKACK. If you find a case where that makes things worse, let me know.

> John Nagle

(2015)

https://man.netbsd.org/NetBSD-8.0/tcp.4