Comment by gsliepen

20 hours ago

If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP. It just is the right solution for the job.

The three drawbacks of the original TCP algorithm were the window size (the maximum value is just too small for today's speeds), poor handling of missing packets (addressed by extensions such as selective-ACK), and the fact that it only manages one stream at a time, and some applications want multiple streams that don't block each other. You could use multiple TCP connections, but that adds its own overhead, so SCTP and QUIC were designed to address those issues.

The congestion control algorithm is not part of the on-the-wire protocol, it's just some code on each side of the connection that decides when to (re)send packets to make the best use of the available bandwidth. Anything that implements a reliable stream on top of datagrams needs to implement such an algorithm. The original ones (Reno, Vegas, etc) were very simple but already did a good job, although back then network equipment didn't have large buffers. A lot of research is going into making better algorithms that handle large buffers, large roundtrip times, varying bandwidth needs and also being fair when multiple connections share the same bandwidth.

it only manages one stream at a time

I'll take flak for saying it, but I feel web developers are partially at fault for laziness on this one. I've often seen them trigger a swath of connections (e.g. for uncoordinated async events), when carefully managed multiplexing over one or a handful will do just fine.

Eg. In prehistoric times I wrote a JavaScript library that let you queue up several downloads over one stream, with control over prioritization and cancelability.

It was used in a GreaseMonkey script on a popular dating website, to fetch thumbnails and other details of all your matches in the background. Hovering over a match would bring up all their photos, and if some hadn't been retrieved yet they'd immediately move to the top of the queue. I intentionally wanted to limit the number of connections, to avoid oversaturating the server or the user's bandwidth. Idle time was used to prefetch all matches on the page (IIRC in a sensible order responsive to your scroll location). If you picked a large enough pagination, then stepped away to top up your coffee, by the time you got back you could browse through all of your recent matches instantly, without waiting for any server roundtrip lag.

It was pretty slick. I realize these days modern stacks give you multiplexing for free, but to put in context this was created in the era before even JQuery was well-known.

Funny story, I shared it with one of my matches and she found it super useful but was a bit surprised that, in a way, I was helping my competition. Turned out OK... we're still together nearly two decades later and now she generously jokes I invented Tinder before it was a thing.

  • This is wonderful to hear. I have a naive question. Is this the reason most websites/web servers absolutely need CDNs (apart from their edge capabilities) because they understand caching much more than a web developer does? But I would think the person more closer to the user access pattern would know the optimal caching strategy.

    • Most websites do not need CDNs.

      CDNs became popular back in the old days, when some people thought that if two websites are using jquery-1.2.3.min.js, CDN could cache it and second site would load quicker. These days, browser don't do that, they'll ignore cached assets from other websites because it somehow helps to protect user privacy and they value privacy over performance in this case.

      There are some reasons CDNs might be helpful. Edge capability probably is the most important one. Another reason is that serving lots of static data might be a complicated task for a small website, so it makes sense to offload it to a specialised service. These days, CDNs went beyond static data. They can hide your backend, so public user won't know its address and can't DDoS it. They can handle TLS for you. They can filter bots, tor and people from countries you don't like. All in a few clicks in the dashboard, no need to implement complicated solutions.

      But nothing you couldn't write yourself in a few days, really.

    • Generally by default CDNs don't necessarily cache anything. They just (try to) respect the cache headers that the developer provides in the response from the origin web server

      So it's still up to the developer to provide the correct headers otherwise you don't get too much of a benefit

      That said some of them will do some default caching if it is recognized as a static file etc

  • [Not a web dev but] I thought each site gets a handful of connections (4) to each host and more requests would have to wait to use one of them. That's pretty close to what I'd want with a reasonably fast connection.

    • That's basically right. Back when I made this, many servers out there still limited you to just 2 (or sometimes even 1) concurrent connections. As sites became more media-heavy that number trended up. HTTP/2 can handle many concurrent streams on one connection, I'm not sure if you get as fine-grained control as with the library I wrote (maybe!).

> If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP.

I'll add that at the time of TCP's writing, the telephone people far outnumbered everyone else in the packet switching vs circuit switching debate. TCP gives you a virtual circuit over a packet switched network as a pair of reliable-enough independent byte streams over IP. This idea, that the endpoints could implement reliability through retransmission came from an earlier French network, Cylades, and ends up being a core principle of IP networks.

  • We're still "suffering" from the latency and jitter effects of the packet switching victory. (The debate happened before my time and I don't know if I would have really agreed with circuit switching.) Latency and jitter on the modern Internet are very best effort emphasis on "effort".

    • True, but with circuit switching, we'd probably still be paying by the minute, so most of these jittery/bufferbloated connections would not exist in the first place.

      3 replies →

    • As someone who at one point was working with people that were trying to keep an ATM network reliable there is a reason packet switching won.

  • The telephone people were basically right with their criticisms of TCP/IP such as:

    What about QoS? Jitter, bandwidth, latency, fairness guarantees? What about queuing delay? What about multiplexing and tunneling? Traffic shaping and engineering? What about long-haul performance? Easy integration with optical circuit networks? etc. ATM addressed these issues, but TCP/IP did not.

    All of these things showed up again once you tried to do VOIP and video conferencing, and in core ISPs as well as access networks, and they weren't (and in many cases still aren't) easy to solve.

    • If that is true, then why did the telcos rapidly move the entire backbone of the telephone network to IP in the 1990s?

      And why are they trying to persuade regulators to let them get rid of the remaining (peripheral) part of the old circuit-switched network, i.e., to phase out old-school telephone hardware, requiring all customers to have IP phone hardware?

      3 replies →

TCP has another unfixable flaw - it cannot be properly secured. Writing a security layer on top of TCP can at most detect, not avoid, attacks.

It is very easy for a malicious actor anywhere in the network to inject data into a connection. By contrast, it is much harder for a malicious actor to break the legitimate traffic flow ... except for the fact that TCP RST grants any rando the power to upgrade "inject" to "break". This is quite common in the wild for any traffic that does not look like HTTP, even when both endpoints are perfectly healthy.

Blocking TCP RST packets using your firewall will significantly improve reliability, but this still does not project you from more advanced attackers which cause a desynchronization due to forged sequence numbers with nonempty payload.

As a result, it is mandatory for every application to support a full-blown "resume on a separate connection" operation, which is complicated and hairy and also immediately runs into the additional flaw that TCP is very slow to start.

---

While not an outright flaw, I also think it has become clear by now that it is highly suboptimal for "address" and "port" to be separate notions.

  • > While not an outright flaw, I also think it has become clear by now that it is highly suboptimal for "address" and "port" to be separate notions.

    If we fixed that issue, what would be different? The operating system still needs to assign different addresses to different sockets. So now... we have 48-bit addresses, and instead of reserving one "address" and giving each socket a 16-bit port, we reserve a /32 block of 65536 addresses and give each socket its own address with a unique 16-bit suffix?

"... some applications want multiple streams that don't block each other. You could use multiple TCP connections, but that adds its own overhead, so SCTP and QUIC were designed to address those issues."

Other applications work just fine with a single TCP connection

If I am using TCP for DNS, for example, and I am retrieving data from a single host such as a DNS cache, I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order. No blocking.^1 If the cache (application) supports it, this is much faster than receiving answers sequentially and it's more efficient and polite than opening multiple TCP connections

1. I do this every day outside the browser with DNS over TLS (DoT) using something like streamtcp from NLNet Labs. I'm not sure that QUIC is faster, server support for QUIC is much more limited, but QUIC may have other advantages

I also do it with DNS over HTTPS (DoH), outside the browser, using HTTP/1.1 pipelining, but there I receive answers sequentially. I'm still not convinced that HTTP/2 is faster for this particular use case, i.e., downloading data from a single host using multiple HTTP requests (compared to something like integrating online advertising into websites, for example)

  • > I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order.

    This is because DoT allows the DNS server to resolve queries concurrently and send query responses out of order.

    However, this is an application layer feature, not a transport layer one. The underlying TCP packets still have to arrive in order and therefore are subject to blocking.

  • > I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order. No blocking.

    You're missing the point. You have one TCP connection, and the sever sends you response1 and then response2. Now if response1 gets lost or delayed due to network conditions, you must wait for response1 to be retransmitted before you can read response2. That is blocking, no way around it. It has nothing to do with advertising(?), and the other protocols mentioned don't have this drawback.

    • I work on an application that does a lot of high frequency networking in a tcp like custom framework. Our protocol guarantees ordering per “channel” so you can send requesr1 on channel 1 and request2 on channel 2 and receive the responses in any order. (But if you send request 1 and then request 2 on the same channel you’ll get them back in order)

      It’s a trade off, and there’s a surprising amount of application code involved on the receiving side in the application waiting for state to be updated on both channels. I definitely prefer it, but it’s not without its tradeoffs.

      1 reply →

Yeah the fact that the congestion control algorithm isn’t part of the wire protocol is very ahead of its time and gave the protocol flexibility that’s much needed in retrospective. OTOH a lot of college courses about TCP don’t really emphasize this fact and still many people I interacted with thought that TCP had a single defined congestion control algorithm.

> how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP. It just is the right solution for the job

A stream of bytes made sense in the 1970s for remote terminal emulation. It still sort of makes sense for email, where a partial message is useful (though downloading headers in bulk followed by full message on demand probably makes more sense.)

But in 2025 much of communication involves messages that aren't useful if you only get part of them. It's also a pain to have to serialize messages into a byte stream and then deserialize the byte stream into messages (see: gRPC etc.) and the byte stream ordering is costly, doesn't work well with multipathing, and doesn't provide much benefit if you are only delivering complete messages.

TCP without congestion control isn't particularly useful. As you note traditional TCP congestion control doesn't respond well to reordering. Also TCP's congestion control traditionally doesn't distinguish between intentional packet drops (e.g. due to buffer overflow) and packet loss (e.g. due to corruption.) This means, for example that it can't be used directly over networks with wireless links (which is why wi-fi has its own link layer retransmission).

TCP's traditional congestion control is designed to fill buffers up until packets are dropped, leading to undesirable buffer bloat issues.

TCP's traditional congestion control algorithms (additive increase/multiplicative decrease on drop) also have the poor property that your data rate tends to drop as RTT increases.

TCP wasn't designed for hardware offload, which can lead to software bottlenecks and/or increased complexity when you do try to offload it to hardware.

TCP's three-way handshake is costly for one-shot RPCs, and slow start means that short flows may never make it out of slow start, neutralizing benefits from high-speed networks.

TCP is also poor for mobility. A connection breaks when your IP address changes, and there is no easy way to migrate it. Most TCP APIs expose IP addresses at the application layer, which causes additional brittleness.

Additionally, TCP is poorly suited for optical/WDM networks, which support dedicated bandwidth (signal/channel bandwidth as well as data rate), and are becoming more important in datacenters and as interconnects for GPU clusters.

etc.

> If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer

> poor handling of missing packets

so it was poor at exact thing it was designed for?

  • Poor for high speed connections () or very unreliable connections.

    ) compared to when TCP was invented.

    When I started at university the ftp speed from the US during daytime was 500 bytes per second! You don't have many unacknowledged packages in such a connection.

    Back then even a 1 megabits/sec connection was super high speed and very expensive.

Might be obvious in hindsight, but it was not clear at all back then, that the congestion is manageable this way. There were legitimate concerns that it will all just melt down.

There are a lot of design alternatives possible to TCP within the "create a reliable stream of data on top of an unreliable datagram layer" space:

• Full-duplex connections are probably a good idea, but certainly are not the only way, or the most obvious way, to create a reliable stream of data on top of an unreliable datagram layer. TCP's predecessor NCP was half-duplex.

• TCP itself also supports a half-duplex mode—even if one end sends FIN, the other end can keep transmitting as long as it wants. This was probably also a good idea, but it's certainly not the only obvious choice.

• Sequence numbers on messages or on bytes?

• Wouldn't it be useful to expose message boundaries to applications, the way 9P, SCTP, and some SNA protocols do?

• If you expose message boundaries to applications, maybe you'd also want to include a message type field? Protocol-level message-type fields have been found to be very useful in Ethernet and IP, and in a sense the port-number field in UDP is also a message-type field.

• Do you really need urgent data?

• Do servers need different port numbers? TCPMUX is a straightforward way of giving your servers port names, like in CHAOSNET, instead of port numbers. It only creates extra overhead at connection-opening time, assuming you have the moral equivalent of file descriptor passing on your OS. The only limitation is that you have to use different client ports for multiple simultaneous connections to the same server host. But in TCP everyone uses different client ports for different connections anyway. TCPMUX itself incurs an extra round-trip time delay for connection establishment, because the requested server name can't be transmitted until the client's ACK packet, but if you incorporated it into TCP, you'd put the server name in the SYN packet. If you eliminate the server port number in every TCP header, you can expand the client port number to 24 or even 32 bits.

• Alternatively, maybe network addresses should be assigned to server processes, as in Appletalk (or IP-based virtual hosting before HTTP/1.1's Host: header, or, for TLS, before SNI became widespread), rather than assigning network addresses to hosts and requiring port numbers or TCPMUX to distinguish multiple servers on the same host?

• Probably SACK was actually a good idea and should have always been the default? SACK gets a lot easier if you ack message numbers instead of byte numbers.

Why is acknowledgement reneging allowed in TCP? That was a terrible idea.

• It turns out that measuring round-trip time is really important for retransmission, and TCP has no way of measuring RTT on retransmitted packets, which can pose real problems for correcting a ridiculously low RTT estimate, which results in excessive retransmission.

• Do you really need a PUSH bit? C'mon.

• A modest amount of overhead in the form of erasure-coding bits would permit recovery from modest amounts of packet loss without incurring retransmission timeouts, which is especially useful if your TCP-layer protocol requires a modest amount of packet loss for congestion control, as TCP does.

• Also you could use a "congestion experienced" bit instead of packet loss to detect congestion in the usual case. (TCP did eventually acquire CWR and ECE, but not for many years.)

• The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network.

• TCP's hardcoded timeout of 5 minutes is also a major flaw. Wouldn't it be better if the application could set that to 1 hour, 90 minutes, 12 hours, or a week, to handle intermittent connectivity, such as with communication satellites? Similarly for very-long-latency datagrams, such as those relayed by single LEO satellites. Together this and the previous flaw have resulted in TCP largely being replaced for its original session-management purpose with new ad-hoc protocols such as HTTP magic cookies, protocols which use TCP, if at all, merely as a reliable datagram protocol.

• Initial sequence numbers turn out not to be a very good defense against IP spoofing, because that wasn't their original purpose. Their original purpose was preventing the erroneous reception of leftover TCP segments from a previous incarnation of the connection that have been bouncing around routers ever since; this purpose would be better served by using a different client port number for each new connection. The ISN namespace is far too small for current LFNs anyway, so we had to patch over the hole in TCP with timestamps and PAWS.

  • • Full-duplex connections are probably a good idea, but certainly are not the only way, or the most obvious way, to create a reliable stream of data on top of an unreliable datagram layer. TCP itself also supports a half-duplex mode—even if one end sends FIN, the other end can keep transmitting as long as it wants. This was probably also a good idea, but it's certainly not the only obvious choice.

    Much of that comes from the original applications being FTP and TELNET.

    • Sequence numbers on messages or on bytes?

    Bytes, because the whole TCP message might not fit in an IP packet. This is the MTU problem.

    • Wouldn't it be useful to expose message boundaries to applications, the way 9P, SCTP, and some SNA protocols do?

    Early on, there were some message-oriented, rather than stream-oriented, protocols on top of IP. Most of them died out. RDP was one such. Another was QNet.[2] Both still have assigned IP protocol numbers, but I doubt that a RDP packet would get very far across today's internet.

    This was a lack. TCP is not a great message-oriented protocol.

    • Do you really need urgent data?

    The purpose of urgent data is so that when your slow Teletype is typing away, and the recipient wants it to stop, there's a way to break in. See [1], p. 8.

    • It turns out that measuring round-trip time is really important for retransmission, and TCP has no way of measuring RTT on retransmitted packets, which can pose real problems for correcting a ridiculously low RTT estimate, which results in excessive retransmission.

    Yes, reliable RTT is a problem.

    • Do you really need a PUSH bit? C'mon.

    It's another legacy thing to make TELNET work on slow links. Is it even supported any more?

    • A modest amount of overhead in the form of erasure-coding bits would permit recovery from modest amounts of packet loss without incurring retransmission timeouts, which is especially useful if your TCP-layer protocol requires a modest amount of packet loss for congestion control, as TCP does.

    • Also you could use a "congestion experienced" bit instead of packet loss to detect congestion in the usual case. (TCP did eventually acquire CWR and ECE, but not for many years.)

    Originally, there was ICMP Source Quench for that, but Berkley didn't put it in BSD, so nobody used it. Nobody was sure when to send it or what to do when it was received.

    • The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network.

    That would require a security system to prevent hijacking sessions.

    [1] https://archive.org/stream/rfc854/rfc854.txt_djvu.txt

    [2] https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers

  • AppleTalk didn't get much love for its broadcast (or possibly multicast?) based service discovery protocol - but of course that is what inspired mDNS. I believe AppleTalk's LAN addresses were always dynamic (like 169.x IP addresses), simplifying administration and deployment.

    I tend to think that one of the reasons linux containers are needed for network services is that DNS traditionally only returns an IP address (rather than address + port) so each service process needs to have its own IP address, which in linux requires a container or at least a network namespace.

    AppleTalk also supported a reliable transaction (basically request-response RPC) protocol (ATP) and a session protocol, which I believe were used for Mac network services (printing, file servers, etc.) Certainly easier than serializing/deserializing byte streams.

    • Does "session protocol" mean that it provided packet retransmission and reordering, like TCP? How does that save you serializing and deserializing byte streams?

      I agree that, given the existing design of IP and TCP, you could get much of the benefit of first-class addresses for services by using, for example, DNS-SD, and that is what ZeroConf does. (It is not a coincidence that the DNS-SD RFC was written by a couple of Apple employees.) But, if that's the way you're going to be finding endpoints to initiate connections to, there's no benefit to having separate port numbers and IP addresses. And IP addresses are far scarcer than just requiring a Linux container or a network namespace: there are only 2³² of them. But it is rare to find an IP address that is listening on more than 64 of its 2¹⁶ TCP ports, so in an alternate history where you moved those 16 bits from the port number to the IP address, we would have one thousandth of the IP-address crunch that we do.

      Historically, possibly the reason that it wasn't done this way is that port numbers predated the DNS by about 10 years.

      2 replies →

  • > The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network

    This 100% !! And basically the reason mosh had to be created in the first place (and it probably wasn't easy.) Unfortunately mosh only solves the problem for ssh. Exposing fixed IP addresses to the application layer probably doesn't help either.

    So annoying that TCP tends to break whenever you switch wi-fi networks or switch from wi-fi to cellular. (On iPhones at least you have MPTCP, but that requires server-side support.)

I was excited about SCTP over 10 years ago but getting it to work was hard.

The Linux kernel supports it but at least when I had tried this those modules were disabled on most distros.