← Back to context

Comment by theamk

2 years ago

I don't by the reasoning for never needing Nagle anymore. Sure, telnet isn't a thing today, but I bet there are still plenty of apps which do equivalent of:

     write(fd, "Host: ")
     write(fd, hostname)
     write(fd, "\r\n")
     write(fd, "Content-type: ")
     etc...

this may not be 40x overhead, but it'd still 5x or so.

Fix the apps. Nobody expect magical perf if you do that when writing to files, even though the OS also has its own buffers. There is no reason to expect otherwise when writing to a socket and actually nagle already doesn't save you from syscall overhead.

  • Nagle doesn't save the derpy side from syscall overhead, but it would save the other side.

    It's not just apps doing this stuff, it also lives in system libraries. I'm still mad at the Android HTTPS library for sending chunked uploads as so many tinygrams. I don't remember exactly, but I think it's reasonable packetization for the data chunk (if it picked a reasonable size anyway), then one packet for \r\n, one for the size, and another for another \r\n. There's no reason for that, but it doesn't hurt the client enough that I can convince them to avoid the system library so they can fix it and the server can manage more throughput. Ugh. (It might be that it's just the TLS packetization that was this bogus and the TCP packetization was fine, it's been a while)

    If you take a pcap for some specific issue, there's always so many of these other terrible things in there. </rant>

  • I agree that such code should be fixed but having hard time persuading developers to fix their code. Many of them don't know what is a syscall, how making a syscall triggers sending of an IP packet, how a library call translates to a syscall e. t. c. Worse they don't want to know this, they write say Java code (or some other high level language) and argue that libraries/JDK/kernel should handle all 'low level' stuff.

    To get optimal performance for request-response protocols like HTTP one should send a full request which includes a request line, all headers and a POST body using a single write syscall (unless POST body is large and it make sense to write it in chunks). Unfortunately not all HTTP libraries work this way and a library user cannot fix this problem without switching a library which is: 1. not always easy 2. it is not widely known which libraries are efficient and which are not. Even if you have an own HTTP library it's not always trivial to fix: e. g. in Java a way to fix this problem while keeping code readable and idiomatic is too wrap socket into BufferedOutputStream which adds one more memory-to-memory copy for all data you are sending on top of at least one memory-to-memory copy you already have without a buffered stream; so it's not an obvious performance win for an application which already saturates memory bandwidth.

  • > Fix the apps. Nobody expect magical perf if you do that when writing to files,

    We write to files line-by-line or even character-by-character and expect the library or OS to "magically" buffer it into fast file writes. Same with memory. We expect multiple small mallocs to be smartly coalesced by the platform.

    • If you expect a POSIX-y OS to buffer write(2) calls, you're sadly misguided. Whether or not that happens depends on nature of the device file you're writing to.

      OTOH, if you're using fwrite(3), as you likely should be actual file I/O, then your expectation is entirely reasonable.

      Similarly with memory. If you expect brk(2) to handle multiple small allocations "sensibly" you're going to be disappointed. If you use malloc(3) then your expectation is entirely reasonable.

      3 replies →

    • True to a degree. But that is a singular platform wholly controlled by the OS.

      Once you put packets out into the world you're in a shared space.

      I assume every conceivable variation of argument has been made both for and against Nagles at this point but it essentially revolves around a shared networking resource and what policy is in place for fair use.

      Nagles fixes a particular case but interferes overall. If you fix the "particular case app" the issue goes away.

    • Yes, your libraries should fix that. The OS (as in the kernel) should not try to do any abstraction.

      Alas, kernels really like to offer abstractions.

  • Everybody expects magical perf if you do that when writing files. We have RAM buffers and write caches for a reason, even on fast SSDs. We expect it so much that macOS doesn't flush to disk even when you call fsync() (files get flushed to the disk's write buffer instead).

    There's some overhead to calling write() in a loop, but it's certainly not as bad as when a call to write() would actually make the data traverse whatever output stream you call it on.

  • Those are the apps are quickly written and do not care if they unnecessarily congest the network. The ones that do get properly maintained can set TCP_NODELAY. Seems like a reasonable default to me.

  • We actually have the similar behavior when writing to files: contents are buffered in page cache and are written to disk later in batch, unless user explicitly call "sync".

  • Apps can always misbehave, you never know what people implement, and you don't always have source code to patch. I don't think the role of the OS is to let the apps do whatever they wish, but it should give the possibility of doing it if it's needed. So I'd rather say, if you know you're properly doing things and you're latency sensitive, just TCP_NODELAY on all your sockets and you're fine, and nobody will blame you about doing it.

  • I would love to fix the apps, can you point me to the github repo with all the code written the last 30 years so I can get started?

The comment about telnet had me wondering what openssh does, and it sets TCP_NODELAY on every connection, even for interactive sessions. (Confirmed by both reading the code and observing behaviour in 'strace').

  • Especially for interactive sessions, it absolutely should! :)

    • Ironic since Nagle's Algorithm (which TCP_NODELAY disables) was invented for interactive sessions.

      It's hard to imagine interactive sessions making more than the tiniest of blips on a modern network.

      6 replies →

I don't think that's actually super common anymore when you consider that doing asynchronous I/O, the only sane way to do that is put it into a buffer rather than blocking at every small write(2).

Then you consider that asynchronous I/O is usually necessary both on server (otherwise you don't scale well) and client (because blocking on network calls is terrible experience, especially in today's world of frequent network changes, falling out of network range, etc.)

And they really shouldn't do this. Even disregarding the network aspect of it, this is still bad for performance because syscalls are kinda expensive.

Marc addresses that: “That’s going to make some “write every byte” code slower than it would otherwise be, but those applications should be fixed anyway if we care about efficiency.”

Does this matter? Yes, there's a lot of waste. But you also have a 1Gbps link. Every second that you don't use the full 1Gbps is also waste, right?

  • This is why I always pad out the end of my html files with a megabyte of &nbsp;. A half empty pipe is a half wasted pipe.

    • Just be sure HTTP Compression is off though, or you're still half-wasting the pipe.

      Better to just dump randomized uncompressible data into html comments.

    • I think that's an unfair comparison. By using Nagle's algorithm for interactive work, you save bytes, but the software you're interacting with is that much less responsive. (If the client was responsible for echoing typed characters, then it wouldn't matter. But ssh and telnet don't work like that, unfortunately.)

      So by saving bytes and leaving your pipe empty, you just suffer in user experience. Why not use something you're already paying for to make your life better?

      (In the end, it seems like SSH agrees with me, and just wastes the bytes by enabling TCP_NODELAY.)

Those aren't the ones you debug, so they won't be seen by OP. Those are the ones you don't need to debug because Nagle saves you.

Even if you do nothing 'fancy' like Nagle, corking, or userspace building up the complete buffer before writing etc., at the very least the above should be using a vectored write (writev() ).

Shouldn’t that go through some buffer? Unless you fflush() between each write?

I imagine the write calls show up pretty easily as a bottleneck in a flamegraph.

  • They don't. Maybe if you're really good you notice the higher overhead but you expect to be spending time writing to the network. The actual impact shows up when the bandwidth consumption is way up on packet and TCP headers which won't show on a flamegraph that easily.

The discussion here mostly seems to miss the point. The argument is to change the default, not to eliminate the behavior altogether.