Comment by metadat
17 days ago
The key takeaway is hidden in the middle:
> In extreme cases, on purely CPU bound benchmarks, we’re seeing a jump from < 1Gbit/s to 4 Gbit/s. Looking at CPU flamegraphs, the majority of CPU time is now spent in I/O system calls and cryptography code.
400% increase in throughput, which should translate to a proportionate reduction in CPU utilization for UDP network activity. That's pretty cool, especially for better power efficiency on portable clients (mobile and notebook).
I found this presentation refreshing. Too often, claims about transition to "modern" stacks are treated as being inherently good and do not come with the data to back it up.
Any guesses on whether they have other cases where they get more than 4 Gbps but wasn't CPU bound or was this the fastest they got?
_Author here_.
4 Gbit/s is on our rather dated benchmark machines. If you run the below command on a modern laptop, you likely reach higher throughput. (Consider disabling PMTUD to use a realistic Internet-like MTU. We do the same on our benchmark machines.)
https://github.com/mozilla/neqo
cargo bench --features bench --bench main -- "Download"
i wonder if we'll ever see hardware accelerated cross-context message passing for user and system programs.
Shared ring buffers for IO exist in Linux, I don't think we'll ever see it extend to DMA for the NIC due to the rearchitecture of security required. However if the NIC is smart enough and the rules simple maybe.
There are systems that move the NIC control to user space entirely. For example Snabb has an Intel 10g Ethernet controller driver that appears to use a ring buffer on DMA memory.
https://github.com/snabbco/snabb/blob/master/src/apps/intel/...
1 reply →
RDMA offers that. The NIC can directly access user space buffers. It does require that the buffers are “registered” first but applications usually aim to do that once up front.
There is AMD's onload https://github.com/Xilinx-CNS/onload. It works with Solarflare, Xilinx but also generic NIC support via AF_XDP.
1 reply →
sure, but what about some kind of generalized cross-context ipc primitive towards a zero copy messaging mechanism for high performance multiprocessing microkernels?