Comment by ironman1478

2 years ago

I've fixed multiple latency issues due to nagle's multiple times in my career. It's the first thing I jump to. I feel like the logic behind it is sound, but it just doesn't work for some workloads. It should be something that an engineer needs to be forced to set while creating a socket, instead of letting the OS choose a default. I think that's the main issue. Not that it's a good / bad option but that there is a setting that people might not know about that manipulates how data is sent over the wire so aggressively.

Same here. I have a hobby that on any RPC framework I encounter, I file a Github issue "did you think of TCP_NODELAY or can this framework do only 20 calls per second?".

So far, it's found a bug every single time.

Some examples: https://cloud-haskell.atlassian.net/browse/DP-108 or https://github.com/agentm/curryer/issues/3

I disagree on the "not a good / bad option" though.

It's a kernel-side heuristic for "magically fixing" badly behaved applications.

As the article states, no sensible application does 1-byte network write() syscalls. Software that does that should be fixed.

It makes sense only in the case when you are the kernel sysadmin and somehow cannot fix the software that runs on the machine, maybe for team-political reasons. I claim that's pretty rare.

For all other cases, it makes sane software extra complicated: You need to explicitly opt-out of odd magic that makes poorly-written software have slightly more throughput, and that makes correctly-written software have huge, surprising latency.

John Nagle says here and in linked threads that Delayed Acks are even worse. I agree. But the Send/Send/Receive receive pattern that Nagle's Algorithm degrades is a totally valid and common use case, including anything that does pipelined RPC over TCP.

Both Delayed Acks and Nagle's Algorithm should be opt-in, in my opinion. It should be called TCP_DELAY, which you can opt-into if you can't be asked to implement basic userspace buffering.

People shouldn't /need/ to know about these. Make the default case be the unsurprising one.

  • "As the article states, no sensible application does 1-byte network write() syscalls." - the problem that this flag was meant to solve was that when a user was typing at a remote terminal, which used to be a pretty common use case in the 80's (think telnet), there was one byte available to send at a time over a network with a bandwidth (and latency) severely limited compared to today's networks. The user was happy to see that the typed character arrived to the other side. This problem is no longer significant, and the world has changed so that this flag has become a common issue in many current use cases.

    Was terminal software poorly written? I don't feel comfortable to make such judgement. It was designed for a constrained environment with different priorities.

    Anyway, I agree with the rest of your comment.

    • > when a user was typing at a remote terminal, which used to be a pretty common use case in the 80's

      Still is for some. I’m probably working in a terminal on an ssh connection to a remote system for 80% of my work day.

      7 replies →

    • It was not just a bandwidth issue. I remember my first encounter with the Internet was on a HP workstation in Germany connected to South-Africa with telnet. The connection went over a Datex-P (X25) 2400 Baud line. The issue with X25 nets was that it was expensive. The monthly rent was around 500 DM and each packet sent also had to been paid a few cents. You would really try to optimize the use of the line and interactive rsh or telnet trafic was definitely not ideal.

  • > As the article states, no sensible application does 1-byte network write() syscalls. Software that does that should be fixed.

    Yes! And worse, those that do are not gonna be “fixed” by delays either. In this day and age with fast internets, a syscall per byte will bottleneck the CPU way before it’ll saturate the network path. The cpu limit when I’ve been tuning buffers have been somewhere in the 4k-32k range for 10Gbps ish.

    > Both Delayed Acks and Nagle's Algorithm should be opt-in, in my opinion.

    Agreed, it causes more problems than it solves and is very outdated. Now, the challenge is rolling out such a change as smoothly as possible, which requires coordination and a lot of trivia knowledge of legacy systems. Migrations are never trivial.

  • The problem with making it opt in is that the point of the protocol was to fix apps that, while they perform fine for the developer on his LAN, would be hell on internet routers. So the people who benefit are the ones who don't know what they are doing and only use the defaults.

  • It's very easy to end up with small writes. E.g.

      1. Write four bytes (length of frame)
      2. Write the frame (write the frame itself)
    

    The easiest fix in C code, with the least chance of introduce a buffer overflow or bad performance is to keep these two pieces of information in separate buffers, and use writev. (How portable is that compared to send?)

    If you have to combine the two into one flat frame, you're looking at allocating and copying memory.

    Linux has something called corking: you can "cork" a socket (so that it doesn't transmit), write some stuff to it multiple times and "uncork". It's extra syscalls though, yuck.

    You could use a buffered stream where you control flushes: basically another copying layer.

  • Thanks for the reminder to set this on the new framework I’m working on. :)

  •      I have a hobby that on any RPC framework I encounter, I file a Github issue "did you think of TCP_NODELAY or can this framework do only 20 calls per second?".
    

    So true. Just last month we had to apply the TCP_NODELAY fix to one of our libraries. :)

  • Would one not also get clobbered by all the sys calls for doing many small packets? It feels like coalescing in userspace is a much better strategy all round if that's desired, but I'm not super experienced.

> It should be something that an engineer needs to be forced to set while creating a socket, instead of letting the OS choose a default.

If the intention is mostly to fix applications with bad `write`-behavior, this would make setting TCP_DELAY a pretty exotic option - you would need a software engineer to be both smart enough to know to set this option, but not smart enough to distribute their write-calls well and/or not go for writing their own (probably better fitted) application-specific version of Nagles.

I agree, it has been fairly well known to disable Nagle's Algorithm in HFT/low latency trading circles for quite some time now (like > 15 years). It's one of the first things I look for.

  • Surely serious HFT systems bypass TCP altogether now days. In that world, every millisecond of latency can potentially cost a lot of money.

    These are the guys that use microwave links to connect to exchanges because fibre-optics have too much latency.

    • They still need to send their orders to an exchange, which is often done with FIX protocol over TCP (some exchanges have binary protocols which are faster than FIX, but the ones I'm aware of still use TCP)

What you really want is for the delay to be n microseconds, but there’s no good way to do that except putting your own user space buffering in front of the system calls (user space works better, unless you have something like io_uring amortizing system call times)

The logic is really for things like Telnet sessions. IIRC that was the whole motivation.

  • And for block writes!

    The Nagler turns a series of 4KB pages over TCP into a stream of MTU sized packets, rather than a short packet aligned to the end of each page.

> that an engineer needs to be forced to set while creating a socket

Because there aren't enough steps in setting up sockets! Haha.

I suspect that what would happen is that many of the programming language run-times in the world which have easier-to-use socket abstractions would pick a default and hide it from the programmer, so as not to expose an extra step.

> I feel like the logic behind it is sound, but it just doesn't work for some workloads.

The logic is only sound for interactive plaintext typing workloads. It should have been turned off by default 20 years ago, let alone now.

  • Remember that IPv4 original "target replacement date" (as it was only an "experimental" protocol) was 1990...

    And a common thing in many more complex/advanced protocols was to explicitly delineate "messages", which avoids the issue of Nagle's algorithm altogether.

Same here. My first job out of college was at a database company. Queries at the client side of the client-server based database were slow. It was thought the database server was slow as hardware back then was pretty pathetic. I traced it down to the network driver and found out the default setting of TCP_NODELAY was off. I looked like a hero when turning on that option and the db benchmarks jumped up.

Not when creating a socket - when sending data. When sending data, you should indicate whether this data block prefers high throughput or low latency.

You’re right re: making delay explicit, but also crappy use the space networking tools don’t show whether no_delay is enabled on sockets.

Last time I had to do some Linux stuff, maybe 10 years ago you had to write a systemtap program. I guess it’s EBNF now. But I bet the userspace tools still suck.