Fast UDP I/O for Firefox in Rust

4 months ago (max-inden.de)

118 comments

Bender

The key takeaway is hidden in the middle:

> In extreme cases, on purely CPU bound benchmarks, we’re seeing a jump from < 1Gbit/s to 4 Gbit/s. Looking at CPU flamegraphs, the majority of CPU time is now spent in I/O system calls and cryptography code.

400% increase in throughput, which should translate to a proportionate reduction in CPU utilization for UDP network activity. That's pretty cool, especially for better power efficiency on portable clients (mobile and notebook).

I found this presentation refreshing. Too often, claims about transition to "modern" stacks are treated as being inherently good and do not come with the data to back it up.

fulafel 4 months ago
Any guesses on whether they have other cases where they get more than 4 Gbps but wasn't CPU bound or was this the fastest they got?
- mxinden 4 months ago
  
  _Author here_.
  4 Gbit/s is on our rather dated benchmark machines. If you run the below command on a modern laptop, you likely reach higher throughput. (Consider disabling PMTUD to use a realistic Internet-like MTU. We do the same on our benchmark machines.)
  https://github.com/mozilla/neqo
  cargo bench --features bench --bench main -- "Download"
a-dub 4 months ago
i wonder if we'll ever see hardware accelerated cross-context message passing for user and system programs.
- wbl 4 months ago
  
  Shared ring buffers for IO exist in Linux, I don't think we'll ever see it extend to DMA for the NIC due to the rearchitecture of security required. However if the NIC is smart enough and the rules simple maybe.
  
  6 replies →

Veserv 4 months ago

While their improvements are real and necessary for actual high speed (100 Gb/s and up), 4 Gb/s is not fast. That is only 500 MB/s. Something somewhere, likely not in their code, is terribly slow. I will explain.

As the author cited, kernel context switch is only on the order of 1 us (which seems too high for a system call anyways). You can reach 500 MB/s even if you still call sendmsg() on literally every packet as long as you average ~500 bytes/packet which is ~1/3 of the standard 1500 bytes MTU. So if you average MTU sized packets, you get 2 us of processing in addition to a full system call to reach 4 Gb/s.

The old number of 1 Gb/s could be reached with a average of ~125 bytes/packet, ~1/12 of the MTU or ~11 us of processing.

“But there are also memory copies in the network stack.” A trivial 3 instruction memory copy will go ~10-20 GB/s, 80–160 Gb/s. In 2 us you can drive 20-40 KB of copies. You are arguing the network stack does 40-80(!) copies to put a UDP packet, a thin veneer over a literal packet, into a packet. I have written commercial network drivers. Even without zero-copy, with direct access you can shovel UDP packets into the NIC buffers at basically memory copy speeds.

“But encryption is slow.” Not that slow. Here is some AES-128 GCM performance done what looks like over 5 years ago. [1] The Intel i5-6500, a midline processor from 8 years ago, averages 1729 MB/s. It can do the encryption for a 500 byte packet in 300 ns, 1/6 of the remaining 2 us budget. Modern processors seem to be closer to 3-5 GB/s per core, or about 25-40 Gb/s, 6-10x the stated UDP throughput.

[1] https://calomel.org/aesni_ssl_performance.html

raggi 4 months ago
> which seems too high for a system call anyways
spectre & meltdown.
> you get 2 us of processing in addition to a full system call to reach 4 Gb/s
TCP has route binding, UDP does not (connect(2) helps one side, but not both sides).
> “But encryption is slow.” Not that slow.
Encryption _is slow_ for small PDUs, at least the common constructions we're currently using. Everyone's essentially been optimizing for and benchmarking TCP with large frames.
If you hot loop the state as the micro-benchmarks do you can do better, but you still see a very visible cost of state setup that only starts to amortize decently well above 1024 byte payloads. Eradicate a bunch of cache efficiency by removing the tightness of the loop and this amortization boundary shifts quite far to the right, up into tens of kilobytes.
---
All of the above, plus the additional framing overheads come into play. Hell even the OOB data blocks are quite expensive to actually validate, it's not a good API to fix this problem, it's just the API we have shoved over bsd sockets.
And we haven't even gotten to buffer constraints and contention yet, but the default UDP buffer memory available on most systems is woefully inadequate for these use cases today. TCP buffers were scaled over time, but UDP buffers basically never were, they're still conservative values from the late 90s/00s really.
The API we really need for this kind of UDP setup is one where you can do something like fork the fd, connect(2) it with a full route bind, and then fix the RSS/XSS challenges that come from this splitting. After that we need a submission queue API rather than another bsd sockets ioctl style mess (uring, rio, etc). Sadly none of this is portable.
On the crypto side there are KDF approaches which can remove a lot of the state cost involved, it's not popular but some vendors are very taken with PSP for this reason - but PSP becoming more well known or used was largely suppressed by its various rejections in the ietf and in linux. Vendors doing scale tests with it have clear numbers though, under high concurrency you can scale this much better than the common tls or tls like constructions.
- ori_b 4 months ago
  
  > spectre & meltdown.
  I just measured. On my Ryzen 7 9700X, with Linux 6.12, it's about 50ns to call syscall(__NR_gettimeofday). Even post-spectre, entering the kernel isn't so expensive.
  
  23 replies →
- Veserv 4 months ago
  
  I think you are just agreeing with me?
  You are basically saying: “It is slow because of all these system/protocol decisions that mismatch what you need to get high performance out of the primitives.”
  Which is my point. They are leaving, by my estimation, 10-20x performance on the floor due to external factors. They might be “fast given that they are bottlenecked by low performance systems”, which is good as their piece is not the bottleneck, but they are not objectively “fast” as the primitives can be configured to solve a substantially similar problem dramatically faster if integrated correctly.
  
  1 reply →
vlovich123 4 months ago
There is no indication what class the CPU they're benchmarking on. Additionally, this is presumably including the overhead of managing the QUIC protocol as well given they mention encryption which isn't relevant for raw UDP. And QUIC is known to not have a good story of NIC offload for encryption at the moment the way you can do kTLS offload for TCP streams.
- Veserv 4 months ago
  
  Encryption is unlikely to be relevant. As I pointed out, doing it on any modern desktop CPU with no offload gets you 25-40 Gb/s, 6-10x faster than the benchmarked throughput. It is not the bottleneck unless it is being done horribly wrong or they do not have access to AES instructions.
  “It is slow because it is being layered over QUIC.” Then why did you layer over a bottleneck that slows you down by 25x. Second of all, they did not used to do that and they still only got 1 Gb/s previously which is abysmal.
  Third of all, you can achieve QUIC feature parity (minus encryption which will be your per-core bottleneck) at 50-100 Gb/s per core, so even that is just a function of using a slow protocol.
  Finally, CPU class used in benchmarking is largely irrelevant because I am discussing 20x per-core performance bottlenecks. You would need to be benchmarking on a desktop CPU from 25 years ago to get that degree of single-core performance difference. We are talking iPhone 6, a decade old phone, territory for a efficient implementation to bottleneck on the processor at just 4 Gb/s.
  But again, it is probably not a problem with their code. It is likely something else stupid happening on the network stack or protocol side of which they are merely a client.

philipallstar 4 months ago

I really liked this. All Mozilla content should be like this. Technical content written by literate engineers. No alegria.

znpy 4 months ago

It’s crazy thar sendmmsg/recvmmsg are considered “modern”… i mean, they’ve been around for quite a while.

I was expecting to see io_uring mentioned somewhere in the linux section of the article.

Cloudef 4 months ago
io_uring doesn't really have equivalent[1], it can't batch multiple UDP diagrams, best it can do is batch multiple sendmsgs and recvmsgs. GSO/GRO is the way to go. sendmmsg/recvmmsg are indeed very old, and some kernel devs wish they could sunset them :)
1: https://github.com/axboe/liburing/discussions/1346
- LtdJorge 4 months ago
  
  Will ZCRX help here? I’m not sure it supports UDP. It should provide great speed-ups but it requires hardware support which is very scarce for now.
  
  1 reply →

jcranmer 4 months ago

> After many hours of back and forth with the reporter, luckily a Mozilla employee as well, I ended up buying the exact same laptop, same color, in a desperate attempt to reproduce the issue.

Glad to know that networking still produces insanity trying to reproduce issues à la https://xkcd.com/2259/.

3form 4 months ago
For that matter, a fun read in the "The map download struggle, part 2 (Technical)" section at https://www.factorio.com/blog/post/fff-176 (end of the document).
- Analemma_ 4 months ago
  
  Factorio's dev blog is a great deal of fun. It's on pause at the moment after the release of 2.0, but if you go through the archives there's great stuff in there. A lot of it is about optimizations which only matter once you're building 10,000+ SPM gigafactories, which casual players will never even come close to, but since crazy excess is practically what defines hardcore Factorio players it's cool to see the devs putting in the work to make the experience shine for their most devoted fans.
  
  7 replies →
- bobmcnamara 4 months ago
  
  Could be related to UDP checksum offload.
  0x0000 is a special value for some NICs meaning please calculate for me.
  One NIC years ago would set 0xFFFF for bad checksum. At first we thought this was horrifyingly broken. But really you can just fallback to software verification for the handful of legitimate and bad packets that arrive with that checksum.
Joel_Mckay 4 months ago

It is funnier if you've ever dealt with mystery packet runts, as most network appliances still do not handle them very cleanly.
UDP/QUIC can DoS any system not based on a cloud deployment large enough to soak up the peak traffic. It is silly, but it pushes out any hosting operation that can't reach a disproportionate bandwidth asymmetry with the client traffic. i.e. fine for FAANG, but a death knell for most other small/medium organizations.
This is why many LAN still drop most UDP traffic, and rate-limit the parts needed for normal traffic. Have a nice day =3

Too 4 months ago

Why are they supporting Android 5? It’s over 10 years old, the devices running it after updates even older. Mobile devices from that era must have a real tough time to browse the modern bloated web. It shouldn’t even be possible to publish to Play store when targeting such an old API level. Who is the user base? Hackers who refurbished their old OnePlus, run it with charger always plugged in, didn’t upgrade to a newer LineageOS, and installed an alternative App Store, just for the sake of it? While novel, it’s a steep price to pay, as we see here it is slowing down development for the rest of us.

mxinden 4 months ago

Note that I (author) made a mistake. We (Mozilla) recently raised the minimum Android version off of 5. See https://blog.mozilla.org/futurereleases/2025/09/15/raising-t... for details.

brycewray 4 months ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1979683

Still seeing this in Firefox with Cloudflare-hosted sites on both macOS and Fedora.

mxinden 4 months ago
Author here. Thanks for raising this. I posted a comment. Maybe you can help us reproduce.
https://bugzilla.mozilla.org/show_bug.cgi?id=1979683#c3
- brycewray 4 months ago
  
  I was the one who filed the original webcompat issue :-) ...
  https://github.com/webcompat/web-bugs/issues/168913
  Although the form result made it sound like a macOS-only issue, I actually have observed (and continue to observe) it on both macOS and Fedora.
  EDIT: In the thread, am seeing the reference to how Firefox-on-QUIC works if one has IPv6. My ISP (Frontier FiOS) infamously doesn't support IPv6, so I'm out of luck there where Firefox is concerned.

Cloudef 4 months ago

Interesting I was not aware of GSO/GRO equivalent on Windows and MacOS, though unfortunate that they seem buggy.

Avamander 4 months ago

I wonder why Microsoft and Apple do not care about the proper functioning of their network stacks.
Pretty sure GSO/GRO aren't the only buggy parts either.

Arcuru 4 months ago

> Instead of starting from scratch, we built on top of quinn-udp, the UDP I/O library of the Quinn project, a QUIC implementation in Rust. This sped up our development efforts significantly. Big thank you to the Quinn project.

Awesome, so you sponsored them right?

https://opencollective.com/quinn-rs

dochtman 4 months ago
When I asked about financial support, the Senior Principal Software Engineer from Mozilla I talked to said "Mozilla has no money".
To be fair, we've gotten a great amount of code contributions from the Mozilla folks, so it's not like they haven't contributed anything.
(I am one of the Quinn maintainers.)
- sethev 4 months ago
  
  It's always interesting how these large organizations can bring in tens of millions of dollars in excess of expenses, yet still manage to "have no money"
  Source: https://assets.mozilla.net/annualreport/2024/b200-mozilla-fo...
- LtdJorge 4 months ago
  
  It is true, Mozilla has no money (except for paying execs)
kouteiheika 4 months ago

> Awesome, so you sponsored them right?
Why bother sponsoring any open source projects when they can throw a few extra million into their CEO's salary, while that CEO is running their flagship product (Firefox) into the ground?
Avamander 4 months ago

They contributed in other ways?

riobard 4 months ago

Can someone explain how UDP GSO/GRO works in detail? Since UDP packets can arrive out-or-order, how does a single large QUIC packet be split into multiple smaller UDP packets without any header sequence number, and how does the receiving side know the order of the UDP packets to merge?

mxinden 4 months ago
Author here.
QUIC does not depend on UDP datagrams to be delivered in order. Re-ordering happens on the QUIC layer. Thus, when receiving, the kernel passes a batch (i.e. segmented super datagram) of potentially out-of-order datagrams to the QUIC layer. QUIC reorders them.
Maybe https://blog.cloudflare.com/accelerating-udp-packet-transmis... brings some clarity.
- riobard 4 months ago
  
  Thanks! The Cloudflare blog article explained GSO pretty well: application must send a contiguous data buffer with a fixed segment size (except for the tail of the buffer) for GSO to split into smaller packets. But how does GRO work on the receiving side?
  For example GSO might split a 3.5KB data buffer into 4 UDP datagrams: U1, U2, U3, and U4, with U1/U2/U3 being 1KB and U4 being 512B. When U1~4 arrives on the receiving host, how does GRO deal with the different permutations of orderingof the four packets (assuming no loss) and pass them to the QUIC layer? Like if U1/U2/U3/U4 come in the original sending order GRO can batch nicely. But what if they come in the order U1/U4/U3/U2? How does GRO deal with the fact that U4 is shorter?
  
  2 replies →
jiehong 4 months ago
I think as an application, when receiving packets you never really see a coalesced UDP datagrams when GRO is active.
It’s more like the kernel puts multiple datagrams into a single structure and passes that around between layers, maintaining the boundaries between them in that structure (sk_buff data fragments?)
Not an expert, but I tried looking at how this works and stumbled upon [0].
[0]: https://lwn.net/Articles/768995/
- wheezle 4 months ago
  
  You definitely see the coalesced datagram as an application. That is kind of the whole point: Passing a big buffer to the syscall and segment it in user-space to minimize the syscall overhead per packet.

NooneAtAll3 4 months ago

idk if author reads this, but

> The combination of the two did cost me a couple of days, resulting in this (basically single line) change in quinn-udp.

2 hyper-links here probably were meant to be different, but got copy pasted the same link

mxinden 4 months ago

Fixed. Thank you!

pabs3 4 months ago

Wonder if this will lead to BitTorrent in the browser.

nefarious_ends 4 months ago

[flagged]

singron 4 months ago

It's true, but since this is a Firefox project, it is relevant since rust was largely developed for years specifically for (re)writing exactly this kind of code in Firefox.
kibwen 4 months ago
Except for, you know, the majority of Rust projects which reach the HN front page and don't, like the stories on PopOS, Redox, and the Wild linker from the past day.
- MisterTea 4 months ago
  
  > Redox
  Any project who's name alludes to oxidation or crustations is a Rust project so its already in the title by default.
  
  4 replies →
timeon 4 months ago

How do you know someone is bothered by headline? They will write comment!
quotemstr 4 months ago
Yeah. Rust, good or bad, affords no special performance advantage for IO performance.
- acdha 4 months ago
  
  Not innately, no, but the kinds of optimizations they’re talking about batching operations and avoiding copies are certainly safer to make using a memory safe language.
- cultofmetatron 4 months ago
  
  correct, stable, fast <- rust's whole deal is giving normal people a chance of building something that gets you all 3.
  
  2 replies →
a456463 4 months ago
Agreed. This could have been done in C or anything else for that matter
- kstrauser 4 months ago
  
  The people who actually wrote it seem to disagree.

bergheim 4 months ago

[flagged]

timeon 4 months ago

Yeah about that... https://chromium.googlesource.com/chromium/src/+/refs/heads/...

ajsnigrutin 4 months ago

[flagged]

metaltyphoon 4 months ago

[dupe]

superkuh 4 months ago

Wow! Does this mean that Firefox can re-enable self-signed certs for it's HTTP/3 stack since it's using a custom implementation and not someone elses big QUIC lib and default build flags anymore? That'd be a huge win for human people and their typical LAN use cases. Even if the corporate use cases don't want it for 'security' reasons.

wolrah 4 months ago
You can still have self-signed certs, you just have to actually set up your own CA and import it as trusted in the relevant trust store so it can be verified.
You can't just have some random router, printer, NAS, etc. generate its own cert out of thin air and tell the browser to ignore the fact that it can't be verified.
IMO this is a good thing. The way browsers handle HTTPS on older protocols is a result of the number of legacy badly configured systems there are out there which browser vendors don't want to break. Anywhere someone's supporting HTTP/3 they're doing something new, so enforcing a "do it right or don't do it at all" policy is possible.
- superkuh 4 months ago
  
  Which also means it's impossible to host a visitable webserver for random people on HTTP/3 without the continued permission of a third party corporation. Do it "right" means "Do it for the corps' use cases only" to most people it seems.
  
  2 replies →
ekr____ 4 months ago
Certificate verification in Firefox happens at a layer way above HTTP and TLS (for those who care, it's in PSM), so which QUIC library is used is basically not relevant.
The reason that Firefox -- and other major browsers -- make self-signed certs so difficult to use is that allowing users to override certificate checks weakens the security of HTTPS, which otherwise relies on certificates being verifiable against the trust anchor list. It's true that this makes certain cases harder, but the judgement of the browser community was that that wasn't worth the security tradeoff. In other words, it's a policy decision, not a technical one.
- rcxdude 4 months ago
  
  It's a pretty bad one, though. It massively undermines the security of connections to local devices for a slight improvement in security on the open internet. It's very frustrating how browser vendors don't even seem to consider it something worth solving, even if e.g. the way it is presented to the user is different. At the moment if you just use plain HTTP then things do mostly work (apart from some APIs which are somewhat arbitrarily locked to 'secure contexts' which means very little about the trustworthiness of the code that does or does not have access to those APIs), but if you try to use HTTPs then you get a million 'this is really inesecure' warnings. There's no 'use HTTPs but treat it like HTTP' option.
  
  6 replies →
jeroenhd 4 months ago

I think self-signed certs should be possible on principal, but is there a reason to use HTTP/3 on LAN use cases? In low-latency situations, there's barely any advantage to using HTTP3 over http/2, and even HTTP 1.1 is good enough for most use cases (and will outperform the other options in terms of pure throughput).
mxinden 4 months ago

Author here. You can find details on why we disable HTTP/3 on self-signed certs here: https://bugzilla.mozilla.org/show_bug.cgi?id=1985341#c7