Comment by amluto

3 years ago

IMO the real problem is that the socket API is insufficient, and the Nagle algorithm is a kludge around that.

When sending data, there are multiple logical choices:

1. This is part of a stream of data but more is coming soon (once it gets computed, once there is buffer space, or simply once the sender loops again).

2. This is the end of a logical part of the stream, and no more is coming right now.

3. This is latency-sensitive.

For case 1, there is no point in sending a partially full segment. Nagle may send a partial segment, which is silly. For case 2, Nagle is probably reasonable, but may be too conservative. For case 3, Nagle is wrong.

But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.

I'm pretty convinced that every foundational OS abstraction that we use today, most of which were invented in the 70's or 80's, is wrong for modern computing environments. It just sucks less for some people than for other people.

I do think Golang's choice of defaulting to TCP_NODELAY is probably right - they expect you to have some understanding that you should probably send large packets if you want to send a lot of stuff, and you likely do not want packets being Nagled if you have 20 bytes you want to send now. TCP_QUICKACK also seems wrong in a world with data caps - the unnecessary ACKs are going to add up.

Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient, and certainly should be expected to trigger pathological cases.

At this point, the OS is basically expected to guess what you actually want to do from how you incant around their bad abstractions, so it's not surprising that sending megabytes of data 50 bytes at a time would trigger some weird slowdowns.

  • > Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient

    This is the real crime here. The fact that it maxed out at 2.5Mb/s might be quite literally due to CPU limit.

    If you are streaming a large amount of data, you should use a user space buffer anyway, especially if you have small chunks. In Golang, buffers are standard practice and a one-liner to add.

    • *pedantry warning*

      In practice, buffers are more than a one-liner, as you probably want to deal with flushing them at some out-of-band moment (+1 line) as well as handle the error from that (+3 lines).

  • > Issuing a SEND syscall every 50 bytes is also horrendously CPU-inefficient

    io_uring is supposed to help with that

This seems like it should be very simple to fix without having to do much to the API. Just implement a flush() function for TCP sockets that tells the stack to kick the current buffer out to the wire immediately. It seems so obvious that I think I must be missing something. Why didn't this appear in the 80s?

  • It’s not portable but Linux has a TCP_CORK socket option that does this.

    • Here's how to emulate TCP_CORK using TCP_NODELAY, from [0]:

      - unset the TCP_NODELAY flag on the socket

      - Call send() zero or more times to add your outgoing data into the Nagle-queue

      - set the TCP_NODELAY flag on the socket

      - call send() with the number-of-bytes argument set to zero, to force an immediate send of the Nagle-queued data

      [0] https://stackoverflow.com/a/22118709

      1 reply →

It's a downside of the "everything is a file" mindset. As all abstractions are, it's leaky.

Nagle's algorithm is elegant because it allows poorly written applications to saturate a PHY.

Disabling it requires the application layer to implement its own buffer.

If I had a time machine and access to the early *nixes, I'd extend Nagle's algorithm and the kernel to treat fsync() as a signal to flush immediately.

> But the socket API is what it is, no one seems to want to fix this, and we’re stuck with a lousy situation.

Linux/FreeBSD/... have had the TCP corking API for what, 20 years?

  • IMO MSG_MORE is a substantially better interface. Sadly it seems to be rarely used.

    • My colleague added MSG_MORE support throughout libnbd[1]. It proved quite an elegant way to solve a common problem: You want to assemble a message in whatever protocol you're using, but it's probably being assembled across many functions (or in the case of libnbd, states in a complicated state machine), and using expanding buffers or whatever is a pain. So instead we let the kernel assemble it, or allow the kernel to make the decision to group the data or send it. The down side is multiple socket calls, but combining it with io_uring is a possibility to avoid this.

      [1] https://gitlab.com/search?search=MSG_MORE&nav_source=navbar&...

    • Oh that is truly elegant, I didn't know about that.

      Basically you set the MSG_MORE flag when you call `send` if you know you will have more data to send very soon, so the kernel is free to wait to form an optimally-sized packet instead of sending many small packets every time you run that syscall.

Latency can be affected by both CPU load and network congestion, so it's possible that Nagle's algorithm can help in Case 3. It's really trial and error to see what works best in practice.