Comment by Veserv

17 days ago

While their improvements are real and necessary for actual high speed (100 Gb/s and up), 4 Gb/s is not fast. That is only 500 MB/s. Something somewhere, likely not in their code, is terribly slow. I will explain.

As the author cited, kernel context switch is only on the order of 1 us (which seems too high for a system call anyways). You can reach 500 MB/s even if you still call sendmsg() on literally every packet as long as you average ~500 bytes/packet which is ~1/3 of the standard 1500 bytes MTU. So if you average MTU sized packets, you get 2 us of processing in addition to a full system call to reach 4 Gb/s.

The old number of 1 Gb/s could be reached with a average of ~125 bytes/packet, ~1/12 of the MTU or ~11 us of processing.

“But there are also memory copies in the network stack.” A trivial 3 instruction memory copy will go ~10-20 GB/s, 80–160 Gb/s. In 2 us you can drive 20-40 KB of copies. You are arguing the network stack does 40-80(!) copies to put a UDP packet, a thin veneer over a literal packet, into a packet. I have written commercial network drivers. Even without zero-copy, with direct access you can shovel UDP packets into the NIC buffers at basically memory copy speeds.

“But encryption is slow.” Not that slow. Here is some AES-128 GCM performance done what looks like over 5 years ago. [1] The Intel i5-6500, a midline processor from 8 years ago, averages 1729 MB/s. It can do the encryption for a 500 byte packet in 300 ns, 1/6 of the remaining 2 us budget. Modern processors seem to be closer to 3-5 GB/s per core, or about 25-40 Gb/s, 6-10x the stated UDP throughput.

[1] https://calomel.org/aesni_ssl_performance.html

29 comments

Veserv

raggi 17 days ago

> which seems too high for a system call anyways

spectre & meltdown.

> you get 2 us of processing in addition to a full system call to reach 4 Gb/s

TCP has route binding, UDP does not (connect(2) helps one side, but not both sides).

> “But encryption is slow.” Not that slow.

Encryption _is slow_ for small PDUs, at least the common constructions we're currently using. Everyone's essentially been optimizing for and benchmarking TCP with large frames.

If you hot loop the state as the micro-benchmarks do you can do better, but you still see a very visible cost of state setup that only starts to amortize decently well above 1024 byte payloads. Eradicate a bunch of cache efficiency by removing the tightness of the loop and this amortization boundary shifts quite far to the right, up into tens of kilobytes.

---

All of the above, plus the additional framing overheads come into play. Hell even the OOB data blocks are quite expensive to actually validate, it's not a good API to fix this problem, it's just the API we have shoved over bsd sockets.

And we haven't even gotten to buffer constraints and contention yet, but the default UDP buffer memory available on most systems is woefully inadequate for these use cases today. TCP buffers were scaled over time, but UDP buffers basically never were, they're still conservative values from the late 90s/00s really.

The API we really need for this kind of UDP setup is one where you can do something like fork the fd, connect(2) it with a full route bind, and then fix the RSS/XSS challenges that come from this splitting. After that we need a submission queue API rather than another bsd sockets ioctl style mess (uring, rio, etc). Sadly none of this is portable.

On the crypto side there are KDF approaches which can remove a lot of the state cost involved, it's not popular but some vendors are very taken with PSP for this reason - but PSP becoming more well known or used was largely suppressed by its various rejections in the ietf and in linux. Vendors doing scale tests with it have clear numbers though, under high concurrency you can scale this much better than the common tls or tls like constructions.

ori_b 17 days ago
> spectre & meltdown.
I just measured. On my Ryzen 7 9700X, with Linux 6.12, it's about 50ns to call syscall(__NR_gettimeofday). Even post-spectre, entering the kernel isn't so expensive.
- whatevaa 17 days ago
  
  Are you sure that system call actually enters the kernel mode? It might be one of the special ones where kernel serves it from user space, forgot their name.
  
  4 replies →
- menaerus 17 days ago
  
  If it isn't a vDSO call, I think 50ns figure shouldn't be possible.
  
  17 replies →
Veserv 17 days ago
I think you are just agreeing with me?
You are basically saying: “It is slow because of all these system/protocol decisions that mismatch what you need to get high performance out of the primitives.”
Which is my point. They are leaving, by my estimation, 10-20x performance on the floor due to external factors. They might be “fast given that they are bottlenecked by low performance systems”, which is good as their piece is not the bottleneck, but they are not objectively “fast” as the primitives can be configured to solve a substantially similar problem dramatically faster if integrated correctly.
- raggi 17 days ago
  
  > I think you are just agreeing with me?
  sure, i mean i have no goal of alignment or misalignment, i'm just trying to provide more insights into what's going on based on my observations of this from having also worked on this udp path.
  > Which is my point. They are leaving, by my estimation, 10-20x performance on the floor due to external factors. They might be “fast given that they are bottlenecked by low performance systems”, which is good as their piece is not the bottleneck, but they are not objectively “fast” as the primitives can be configured to solve a substantially similar problem dramatically faster if integrated correctly.
  yes, though this basically means we're talking about throwing out chunks of the os, the crypto design, the protocol, and a whole lot of tuning at each layer.
  the only vendor in a good position to do this is apple (being the only vendor that owns every involved layer in a single product chain), and they're failing to do so as well.
  the alternative is a long old road, where folks make articles like this from time to time, we share our experiences and hope that someone is inspired enough reading it to be sniped into making incremental progress. it'd be truly fantastic if we sniped a group with the vigor and drive that the mptcp folks seem to have, as they've managed to do an unusually broad and deep push across a similar set of layered challenges (though still in progress).

vlovich123 17 days ago

There is no indication what class the CPU they're benchmarking on. Additionally, this is presumably including the overhead of managing the QUIC protocol as well given they mention encryption which isn't relevant for raw UDP. And QUIC is known to not have a good story of NIC offload for encryption at the moment the way you can do kTLS offload for TCP streams.

Veserv 17 days ago

Encryption is unlikely to be relevant. As I pointed out, doing it on any modern desktop CPU with no offload gets you 25-40 Gb/s, 6-10x faster than the benchmarked throughput. It is not the bottleneck unless it is being done horribly wrong or they do not have access to AES instructions.
“It is slow because it is being layered over QUIC.” Then why did you layer over a bottleneck that slows you down by 25x. Second of all, they did not used to do that and they still only got 1 Gb/s previously which is abysmal.
Third of all, you can achieve QUIC feature parity (minus encryption which will be your per-core bottleneck) at 50-100 Gb/s per core, so even that is just a function of using a slow protocol.
Finally, CPU class used in benchmarking is largely irrelevant because I am discussing 20x per-core performance bottlenecks. You would need to be benchmarking on a desktop CPU from 25 years ago to get that degree of single-core performance difference. We are talking iPhone 6, a decade old phone, territory for a efficient implementation to bottleneck on the processor at just 4 Gb/s.
But again, it is probably not a problem with their code. It is likely something else stupid happening on the network stack or protocol side of which they are merely a client.