Comment by voxic11

2 years ago

keepalives are an optional TCP feature so they are not necessarily supported by all TCP implementations and therefor default to off even when supported.

6 comments

voxic11

dilyevsky 2 years ago

Where is it off? Most linux distros have it on it’s just the default kickoff timer is ridiculously long (like 2 hours iirc). Besides, TCP keepalives won't help with the issue at hand and were put in for totally different purpose (gc'ing idle connections). Most of the time you don't even need them because the other side will send RST packet if it already closed the socket.

halter73 2 years ago
AFAIK, all Linux distros plus Windows and macOS have TCP keepalives off by default as mandated by the RFC 1122. Even when they are optionally turned on using SO_KEEPALIVE, the interval defaults to two hours because that is the minimum default interval allowed by spec. That can then be optionally reduced with something like /proc/sys/net/ipv4/tcp_keepalive_time (system wide) or TCP_KEEPIDLE (per socket).
By default, completely idle TCP connections will stay alive indefinitely from the perspective of both peers even if their physical connection is severed.
Implementors MAY include "keep-alives" in their TCP implementations, although this practice is not universally accepted. If keep-alives are included, the application MUST be able to turn them on or off for each TCP connection, and they MUST default to off. Keep-alive packets MUST only be sent when no data or acknowledgement packets have been received for the connection within an interval. This interval MUST be configurable and MUST default to no less than two hours.
[0]: https://datatracker.ietf.org/doc/html/rfc1122#page-101
- dilyevsky 2 years ago
  
  OK you're right - it's coming back to me now. I've been spoiled by software that enables keep-alive on sockets.

mort96 2 years ago

So we need a protocol with some kind of non-optional default-enabled keepalive.

josefx 2 years ago
Now your connections start to randomly fail in production because the implementation defaults to 20ms and your local tests never caught that.
- mort96 2 years ago
  
  I'm sure there's some middle ground between "never time out" and "time out after 20ms" that works reasonably well for most use cases