QUIC for the kernel

1 day ago (lwn.net)

I recently had to add `ssl_preread_server_name` to my NGINX configuration in order to `proxy_pass` requests for certain domains to another NGINX instance. In this setup, the first instance simply forwards the raw TLS stream (with `proxy_protocol` prepended), while the second instance handles the actual TLS termination.

This approach works well when implementing a failover mechanism: if the default path to a server goes down, you can update DNS A records to point to a fallback machine running NGINX. That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.

However, this method won't work with HTTP/3. Since HTTP/3 uses QUIC over UDP and encrypts the SNI during the handshake, `ssl_preread_server_name` can no longer be used to route based on domain name.

What alternatives exist to support this kind of SNI-based routing with HTTP/3? Is the recommended solution to continue using HTTP/1.1 or HTTP/2 over TLS for setups requiring this behavior?

  • Clients supporting QUIC usually also support HTTPS DNS records, so you can use a lower priority record as a failover, letting the client potentially take care of it. (See for example: host -t https dgl.cx.)

    That's the theory anyway. You can't always rely on clients to do that (see how much of the HTTPS record Chromium actually supports[1]), but in general if QUIC fails for any reason clients will transparently fallback, as well as respecting the Alt-Svc[2] header. If this is a planned failover you could stop sending a Alt-Svc record and wait for the alternative to timeout, although it isn't strictly necessary.

    If you do really want to route QUIC however, one nice property is the SNI is always in the first packet, so you can route flows by inspecting the first packet. See cloudflare's udpgrm[3] (this on its own isn't enough to proxy to another machine, but the building block is there).

    Without Encrypted Client Hello (ECH) the client hello (including SNI) is encrypted with a known key (this is to stop middleboxes which don't know about the version of QUIC breaking it), so it is possible to decrypt it, see the code in udpgrm[4]. With ECH the "router" would need to have a key to decrypt the ECH, which it can then decrypt inline and make a decision on (this is different to the TLS key and can also use fallback HTTPS records to use a different key than the non-fallback route, although whether browsers currently support that is a different issue, but it is possible in the protocol). This is similar to how fallback with ECH could be supported with HTTP/2 and a TCP connection.

    [1]: https://issues.chromium.org/issues/40257146

    [2]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...

    [3]: https://blog.cloudflare.com/quic-restarts-slow-problems-udpg...

    [4]: https://github.com/cloudflare/udpgrm/blob/main/ebpf/ebpf_qui...

  • > for setups requiring this behavior?

    TLS terminating at your edge (which is presumably where the IP addresses attach) isn't any particular risk in a world of letsencrypt where an attacker (who gained access to that box) could simply request a new SSL certificate, so you might as well do it yourself and move on with life.

    Also: I've been unable to reproduce performance and reliability claims of quic. I keep trying a couple times a year to see if anything's gotten better, but I mostly leave it disabled for monetary reasons.

    > This approach works well when implementing a failover mechanism: if the default path to a server goes down...

    I'm not sure I agree: DNS can take minutes for updates to be reflected, and dumb clients (like web browsers) don't failover.

    So I use an onerror handler to load the second path. When my ad tracking that looks something like this:

        <img src=patha.domain1?tracking
          onerror="this.src='pathb.domain2?tracking';this.onerror=function(){}">
    

    but with the more complex APIs, fetch() is wrapped up similarly in the APIs I deliver to users. This works much better than anything else I've tried.

    • > […] isn't any particular risk in a world of letsencrypt where an attacker (who gained access to that box) could simply request a new SSL certificate

      You can use CAA records with validationmethods and accounturi to limit issuance, so simply access to the machine isn’t enough. (E.g. using dns and an account stored on a different machine.)

  • For a failover circumstance, I wouldn’t bother with failover for QUIC at all. If a browser can’t make a QUIC connection (even if advertised in DNS), it will try HTTP1/2 over TLS. Then you can use the same fallback mechanism you would if it wasn’t in the picture.

  • Unfortunately I think that falls under the "Not a bug" category of bugs. Keeping the endpoint concealed all the way to the TLS endpoint is a feature* of HTTP/3.

    * I do actually consider it a feature, but do acknowledge https://xkcd.com/1172/

    PS. HAProxy can proxy raw TLS, but can't direct based on hostname. Cloudflare tunnel I think has some special sauce that can proxy on hostname without terminating TLS but requires using them as your DNS provider.

    • Unless you're using ECH (encrypted client helo) the endpoint is obscured (known keys), not concealed.

      PS: HAProxy definitely can do this too, something using req.ssl_sni like this:

         frontend tcp-https-plain
             mode tcp
             tcp-request inspect-delay 10s
             bind [::]:443 v4v6 tfo
             acl clienthello req.ssl_hello_type 1
             acl example.com req.ssl_sni,lower,word(-1,.,2) example.com
             tcp-request content accept if clienthello
             tcp-request content reject if !clienthello
             default_backend tcp-https-default-proxy
             use_backend tcp-https-example-proxy if example.com
      

      Then tcp-https-example-proxy is a backend which forwards to a server listening for HTTPS (and using send-proxy-v2, so the client IP is kept). Cloudflare really isn't doing anything special here; there are also other tools like sniproxy[1] which can intercept based on SNI (a common thing commerical proxies do for filtering reasons).

      [1]: https://github.com/ameshkov/sniproxy

      1 reply →

  • Hm, that’s a good question. I suppose the same would apply to TCP+TLS with Encrypted Client Hello as well, right? Presumably the answer would be the same/similar between the two.

  • Not an expert on eSNI, but my understanding was that the encryption in eSNI is entirely separate from the "main" encryption in TLS, and the eSNI keys have to be the same for every domain served from the same IP address or machine.

    Otherwise, the TLS handshake would run into the same chicken/egg problem that you have: To derive the keys, it needs the certificate, but to select the certificate, it needs the domain name.

    So you only need to replicate the eSNI key, not the entire cert store.

    • Personally, I'd like to have an option of the outbound firewall doing the eSNI encryption, is that possible?

  • That fallback instance can then route requests for specific domains to the original backend over an alternate path without needing to replicate the full TLS configuration locally.

    Won't you need to "replicate the TLS config" on the back end servers then? And how hard is it to configure TLS on the nginx side anyway, can't you just use ACME?

  • QUIC v1 does encrypt the SNI in the client hello, but the keys are derived from a predefined salt and the destination connection id. I don't see why decrypting this would be difficult for a nginx plugin.

  • There is no way to demultiplex incoming QUIC or HTTP/3 connections based on plaintext metadata inside the protocol. The designers went one step too far in their fight against middleboxes of all sorts. Unless you can assign a each destination at least its own (IP address, UDP port) pair you're shit out of luck and can't have end-to-end encryption. A QUIC proxy has to decrypt, inspect, and reencrypt the traffic. Such a great performance and security improvement :-(. With IPv6 you can use unique IP addresses which immediately undoes any of the supposed privacy advantages of encrypting the server name in the first place. With IPv4 your pretty much fucked. Too bad SRV record support for HTTP(S) was never accepted because it would threatten business models. I guess your best bet is to try to redirect clients to unique ports.

  • Hiding SNI is more important than breaking rare cases of weird web server setups. This setup is not typical because large organizations like Google tend to put all the services behind the same domain name.

I thought nowadays every driver wants to be in userspace for exactly speed, especially network ones. So there's no overhead at kernel/userspace later. Strange this article claims the other way, I definitely miss something.

I recall this article on QUIC disadvantages: https://www.reddit.com/r/programming/comments/1g7vv66/quic_i...

Seems like this is a step in the right direction to resole some of those issues. I suppose nothing is preventing it from getting hardware support in future network cards as well.

  • QUIC does not work very well for use cases like machine-to-machine traffic. However most of traffic in Internet today is from mobile phones to servers and it is were QUIC and HTTP 3 shine.

    For other use cases we can keep using TCP.

    • Let me try providing a different perspective based on experience. QUIC works amazingly well for _some_ kinds of machine to machine traffic.

      ssh3, based on QUIC is quicker at dropping into a shell compared to ssh. The latency difference was clearly visible.

      QUIC with the unreliable dgram extension is also a great way to implement port forwarding over ssh. Tunneling one reliable transport over another hides the packer losses in the upper layer.

      1 reply →

    • Why doesn't QUIC work well for machine-to-machine traffic ? Is it due to the lack of offloads/optimizations for TCP and machine-to-machine traffic tend to me high volume/high rate ?

      49 replies →

    • I don't understand what you mean by "machine-to-machine" if a phone (a machine) talking to a server (a machine) is not machine-to-machine.

      2 replies →

What will the socket API look like for multiple streams? I guess it is implied it is the same as multiple connections, with caching behind the scenes.

I would hope for something more explicit, where you get a connection object and then open streams from it, but I guess that is fine for now.

https://github.com/microsoft/msquic/discussions/4257 ah but look at this --- unless this is an extension, the server side can also create new streams, once a connection is established. The client creating new "connections" (actually streams) cannot abstract over this. Something fundamentally new is needed.

My guess is recvmsg to get a new file descriptor for new stream.

I don't know about using it in the kernel but I would love to see OpenSSH support QUIC so that I get some of the benefits of Mosh [1] while still having all the features of OpenSSH including SFTP, SOCKS, port forwarding, less state table and keep alive issues, roaming support, etc... Could OpenSSH leverage the kernel support?

[1] - https://mosh.org/

  • SSH would need a lot of work to replace its crypto and mux layers with QUIC. It's probably worth starting from scratch to create a QUIC login protocol. There are a bunch of different approaches to this in various states of prototyping out there.

    • Fair points. I suppose Mosh would be the proper starting point then. I'm just selfish and want the benefits of QUIC without losing all the really useful features of OpenSSH.

  • OpenSSH is an OpenBSD project therefore I guess a Linux api isn't that interesting but I could be wrong ofc.

    • Once Linux implements it, I think odds are high that FreeBSD sooner or later does too. And maybe NetBSD and XNU/Darwin/macOS/iOS thereafter. And if they’ve all got it, that increases the odds that eventually OpenBSD also implements it. And if OpenBSD has the support in its kernel, then they might be willing to consider accepting code in OpenSSH which uses it. So OpenSSH supporting QUIC might eventually happen, but if it does, it is going to be some years away

I have a question - bottleneck for TCP is said to the handshake. But that can be solved by reusing connections and/or multiplexing. The current implementation is 3-4x slower than the Linux impl and performance gap is expected to close.

If speed is touted as the advantage for QUIC and it is in fact slower, why bother with this protocol ? The author of the PR itself attributes some of the speed issues to the protocol design. Are there other problems in TCP that need fixing ?

  • The article discusses many of the reasons QUIC is currently slower. Most of them seem to come down to "we haven't done any optimization for this yet".

    > Long offers some potential reasons for this difference, including the lack of segmentation offload support on the QUIC side, an extra data copy in transmission path, and the encryption required for the QUIC headers.

    All of these three reasons seem potentially very addressable.

    It's worth noting that the benchmark here is on pristine network conditions, a drag race if you will. If you are on mobile, your network will have a lot more variability, and there TCP's design limits are going to become much more apparent.

    TCP itself often has protocols run on top of it, to do QUIC like things. HTTP/2 is an example of this. So when you compare QUIC and TCP, it's kind of like comparing how fast a car goes with how fast an engine bolted to a frame with wheels on it goes. QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3. Thats less system design.

    QUIC also has wins for connecting faster, and especially for reconnecting faster. It also has IP mobility: if you're on mobile and your IP address changes (happens!) QUIC can keep the session going without rebuilding it once the client sends the next packet.

    It's a fantastically well thought out & awesome advancement, radically better in so many ways. The advantages of having multiple non-blocking streams (alike SCTP) massively reduces the scope that higher level protocol design has to take on. And all that multi-streaming stuff being in the kernel means it's deeply optimizable in a way TCP can never enjoy.

    Time to stop driving the old rust bucket jalopy of TCP around everywhere, crafting weird elaborate handmade shit atop it. We need a somewhat better starting place for higher level protocols and man oh man is QUIC alluring.

    • > QUIC goes significantly up the OSI network stack, is layer 5+, where-as TCP+TLS is layer 3

      IP is layer 3 - network(ensures packets are routed to the correct host). TCP is layer 4 - transport(some people argue that TCP has functions from layer 5 - eg. establishing sessions between apps), while TLS adds a few functions from layer 6(eg. encryption), which QUIC also has.

      1 reply →

  • That's just one bottleneck. The other issue is head-of-line blocking. When there is packet loss on a TCP connection, nothing sent after that is delivered until the loss is repaired.

  • > bottleneck for TCP is said to the handshake. But that can be solved by reusing connections

    You can't reuse a connection that doesn't exist yet. A lot of this is about reducing latency not overall speed.

I'm confused, I thought the revolution of the past decade or so was in moving network stacks to userspace for better performance.

  • Most QUIC stacks are built upon in-kernel UDP. You get significant performance benefits if you can avoid your traffic going through kernel and userspace and the context switches involved.

    You can work that angle by moving networking into user space... setting up the NIC queues so that user space can access them directly, without needed to context switch into the kernel.

    Or you can work the angle by moving networking into kernel space ... things like sendfile which let a tcp application instruct the kernel to send a file to the peer without needing to copy the content into userspace and then back into kernel space and finally into the device memory, if you have in-kernel TLS with sendfile then you can continue to skip copying to userspace; if you have NIC based TLS, the kernel doesn't need to read the data from the disk; if you have NIC based TLS and the disk can DMA to the NIC buffers, the data doesn't need to even hit main memory. Etc

    But most QUIC stacks don't get benefit from either side of that. They're reading and writing packets via syscalls, and they're doing all the packetization in user space. No chance to sendfile and skip a context switch and skip a copy. Batching io via io_uring or similar helps with context switches, but probably doesn't prevent copies.

    • Yeah, there’s also a lot of offloads that can be done to the kernel with UDP (e.g. UDP segmentation offload, generic receive offload, checksum offload), and offloading quick entirely would be a natural extension to that.

      It just offers people choice for the right solution at the right moment.

  • You are right but it's confusing because there are two different approaches. I guess you could say both approaches improve performance by eliminating context switches and system calls.

    1. Kernel bypass combined with DMA and techniques like dedicating a CPU to packet processing improve performance.

    2. What I think of as "removing userspace from the data plane" improves performance for things like sendfile and ktls.

    To your point, Quic in the kernel seems to not have either advantage.

  • You still need to offload your bytes to a NIC buffer. Either you can do something like DMA where you get privileged space to write your bytes to that the NIC reads from or you have to cross the syscall barrier and have your kernel write the bytes into the NIC's buffer. Crossing the syscall barrier adds a huge performance penalty due to the switch in memory space and privilege rings so userspace networking only makes sense if you're not having to deal with the privilege changes or you have DMA.

    • That is only a problem if you do one or more syscalls per packet which is a utterly bone-headed design.

      The copy itself is going at 200-400 Gbps so writing out a standard 1,500 byte (12,000 bit) packet takes 30-60 ns (in steady state with caches being prefetched). Of course you get slaughtered if you stupidly do a syscall (~100 ns hardware overhead) per packet since that is like 300% overhead. You just batch like 32 packets so the write time is ~1,000-2,000 ns then your overhead goes from 300% to 10%.

      At a 1 Gbps throughput, that is ~80,000 packets per second or one packet per ~12.5 us. So, waiting for a 32 packet batch only adds a additional 500 us to your end-to-end latency in return for 4x efficiency (assuming that was your bottleneck; which it is not for these implementations as they are nowhere near the actual limits). If you go up to 10 Gbps, that is only 50 us of added latency, and at 100 Gbps you are only looking at 5 us of added latency for a literal 4x efficiency improvement.

  • What is done for that is userspace gets the network data directly without (I believe) involving syscalls. It's not something you'd do for end-user software, only the likes of MOFAANG need it.

    In theory the likes of io_uring would bring these benefits across the board, but we haven't seen that delivered (yet, I remain optimistic).

  • The constant mode switching for hardware access is slow. TCP/IP remains in the kernel for windows and Linux.

  • Performance comes from dedicating core(s) to polling, not from userspace.

  • Networking is much faster in the kernel. Even faster on an ASIC.

    Network stacks were moved to userspace because Google wanted to replace TCP itself (and upgrade TLS), but it only cared about the browser, so they just put the stack in the browser, and problem solved.

What is the need for mashing more and more stuff into the kernel? I thought the job of the kernel was to manage memory, hardware, and tasks. Shouldn't protocols built on top of IP be handled by userland?

  • Having networking, routing, VPN etc all not leave kernel space can be a performance improvement for some use cases.

    Similarly, splitting the networking/etc stacks out from the kernel into userspace can also be a performance improvement for some use cases.

    • Can't you say that about virtually anything? I'm sure having, say, MIDI synthesizers in the kernel would improve performance too, but not many think that is a good idea.

      1 reply →

    • Yup, context switches between kernelspace and userspace are very expensive in high-performance situations, which is why these types of offloads are used.

      At specific workloads (think: load balancers / proxy servers / etc), these things become extremely expensive.

  • Maybe. Getting stuff into the kernel means (in theory) it’s been hardened, it has a serious LTS, and benefits from… well, the performance of being part of the kernel.

  • No, protocols directly on IP specifically can’t be used in userland because they can’t be multiplexed to multiple processes.

    If everything above IP was in userland, only one program at a time could use TCP.

    TCP and UDP being intermediated by the kernel allow multiple programs to use the protocols at the same time because the kernel routes based on port to each socket.

    QUIC sits a layer even higher because it cruises on UDP, so I think your point still stands, but it’s stuff on top of TCP/UDP, not IP.

Looks good. Quick is a real game changer for many. Internet should be a little faster with it. Probably we will not care because of 5g, but still valuable. Wondering that there is a separate tow handshake, I was thinking that qick embeds tls, but seams like I am wrong.

The general web is slowed down by bloated websites. But I guess this can make game latency lower.

  • https://en.m.wikipedia.org/wiki/Jevons_paradox

    The Jevons Paradox is applicable in a lot of contexts.

    More efficient use of compute and communications resources will lead to higher demand.

    In games this is fine. We want more, prettier, smoother, pixels.

    In scientific computing this is fine. We need to know those simulation results.

    On the web this is not great. We don’t want more ads, tracking, JavaScript.

    • No, the last 20 years of browser improvements has made my static site incredibly fast!

      I'm benefiting from WebP, JS JITs, Flexbox, zstd, Wasm, QUIC, etc, etc

> Calls to bind(), connect(), listen(), and accept() can be used to initiate and accept connections in much the same way as with TCP, but then things diverge a bit. [...] The sendmsg() and recvmsg() system calls are used to carry out that setup

I wish the article explained why this approach was chosen, as opposed to adding a dedicated system call API that matches the semantics of QUIC.

This seems to be a categorical error, for reasons that are contained in the article itself. The whole appeal of QUIC is being immune to ossification, being free to change parameters of the protocol without having to beg Linux maintainers to agree.

  • IMHO, you likely want the server side to be in the kernel, so you can get to performance similar to in-kernel TCP, and ossification is less of a big deal, because it's "easy" to modify the kernel on the server side.

    OTOH, you want to be in user land on the client, because modifying the kernel on clients is hard. If you were Google, maybe you could work towards a model where Android clients could get their in-kernel protocol handling to be something that could be updated regularly, but that doesn't seem to be something Google is willing or able to do; Apple and Microsoft can get priority kernel updates out to most of their users quickly; Apple also can influence networks to support things they want their clients to use (IPv6, MP-TCP). </rant>

    If you were happy with congestion control on both sides of TCP, and were willing to open multiple TCP connections like http/1, instead of multiplexing requests on a single connection like http/2, (and maybe transfer a non-pessimistic bandwidth estimate between TCP connections to the same peer), QUIC still gives you control over retransmission that TCP doesn't, but I don't think that would be compelling enough by itself.

    Yes, there's still ossification in middle boxes doing TCP optimization. My information may be old, but I was under the impression that nobody does that in IPv6, so the push for v6 is both a way to avoid NAT and especially CGNAT, but also a way to avoid optimizer boxes as a benefit for both network providers (less expense) and services (less frustration).

    • One thing is that congestion control choice is sort of cursed in that it assumes your box/side is being switched but the majority of the rest of the internet continues with legacy limitations (aside from DCTCP, which is designed for intra-datacenter usage), which is an essential part of the question given that resultant/emergent network behavior changes drastically depending on whether or not all sides are using the same algorithm. (Cubic is technically another sort-of-exception, at least since it became the default Linux CC algorithm, but even then you’re still dealing with all sorts of middleware with legacy and/or pathological stateful behavior you can’t control.)

      1 reply →

    • This is a perspective, but just one of many. The overwhelming majority of IP flows are within data centers, not over planet-scale networks between unrelated parties.

      4 replies →

  • Ossification does not come about from the decisions of "Linux maintainers". You need to look at the people who design, sell, and deploy middleboxes for that.

    • I disagree. There is plenty of ossification coming from inside the house. Just some examples off the top of my head are the stuck-in-1974 minimum RTO and ack delay time parameters, and the unwillingness to land microsecond timestamps.

      3 replies →

    • The "middleboxes" excuse for not improving (or replacing) protocols in the past was horseshit. If a big incumbent player in the networking world releases a new feature that everyone wants (but nobody else has), everyone else (including 'middlebox' vendors) will bend over backwards to support it, because if you don't your competitors will and then you lose business. It was never a technical or logistical issue, it was an economic and supply-demand issue.

      To prove it:

      1. Add a new OSI Layer 4 protocol called "QUIC" and give it a new protocol number, and just for fun, change the UDP frame header semantics so it can't be confused for UDP.

      2. Then release kernel updates to support the new protocol.

      Nobody's going to use it, right? Because internet routers, home wireless routers, servers, shared libraries, etc would all need their TCP/IP stacks updated to support the new protocol. If we can't ship it over a weekend, it takes too long!

      But wait. What if ChatGPT/Claude/Gemini/etc only supported communication over that protocol? You know what would happen: every vendor in the world would backport firmware patches overnight, bending over backwards to support it. Because they can smell the money.

  • The protocol itself is resistant to ossification, no matter how it is implemented.

    It is mostly achieved by using encryption, and it is a reason why it is such an important and mandatory part of the protocol. The idea is to expose as little as possible of the protocol between the endpoints, the rest is encrypted, so that "middleboxes" can't look at the packet and do funny things based on their own interpretation of the protocol stack.

    Endpoint can still do whatever they want, and ossification can still happen, but it helps against ossification at the infrastructure level, which is the worst. Updating the linux kernel on your server is easier than changing the proprietary hardware that makes up the network backbone.

    The use of UDP instead of doing straight QUIC/IP is also an anti-ossification technique, as your app can just use UDP and a userland library regardless of the QUIC kernel implementation. In theory you could do that with raw sockets too, but that's much more problematic since because you don't have ports, you need the entire interface for yourself, and often root access.

  • Do you think putting QUIC in the kernel will significantly ossify QUIC? If so, how do you want to deal with the performance penalty for the actual syscalls needed? Your concern makes sense to me as the Linux kernel moves slower than userspace software and middleboxes sometimes never update their kernels.

That's so wrong, putting more and more stuff into the kernel and expanding attack surface. How long will it take before someone finds a vulnerability in QUIC handling?

The kernel should be as minimal as possible and everything that can be moved to userspace should be moved there. If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.

  • Use a microkernel if this is your strong opinion. Linux is a monolithic kernel and includes a whole lot in kernel space for the sake of performance and (as mentioned in the article) hardware integration. A well designed microkernel may be able to provide similar performance with better security, but until people put serious work in, it won't be competitive with Linux.

    • Unfortunately the os community puts 99% of it'st collective energy into Linux. There is definitely pent up demand for a different architecture. China seems to be innovating here, but it's unclear if the west will get anything out of their designs.

    • Sadly Linux distributions use large kernel and there is no simple way to get a working desktop system with a microkernel.

  • > If you are afraid of performance issues then maybe you should stop using legacy processors with slow context switch timing.

    By the same logic, we should never improve performance in software and just tell everyone to buy new hardware instead. A bit ridiculous.

The article didn’t discuss ACK. I have often wondered if it makes sense for the protocol to not have ACKs, and to leave that up to the application layer. I feel like the application layer has to ensure this anyway, so I don’t know how much benefit it is to additionally support this at a lower layer.

> QUIC is meant to be fast, but the benchmark results included with the patch series do not show the proposed in-kernel implementation living up to that. A comparison of in-kernel QUIC with in-kernel TLS shows the latter achieving nearly three times the throughput in some tests. A comparison between QUIC with encryption disabled and plain TCP is even worse, with TCP winning by more than a factor of four in some cases.

Jesus, that's bad. Does anyone know if userspace QUIC implementations are also this slow?

  • I think the ‘fast’ claims are just different. QUIC is meant to make things fast by:

    - having a lower latency handshake

    - avoiding some badly behaved ‘middleware’ boxes between users and servers

    - avoiding resetting connections when user up addresses change

    - avoiding head of line blocking / the increased cost of many connections ramping up

    - avoiding poor congestion control algorithms

    - probably other things too

    And those are all things about working better with the kind of network situations you tend to see between users (often on mobile devices) and servers. I don’t think QUIC was meant to be fast by reducing OS overhead on sending data, and one should generally expect it to be slower for a long time until operating systems become better optimised for this flow and hardware supports offloading more of the work. If you are Google then presumably you are willing to invest in specialised network cards/drivers/software for that.

    • Yeah I totally get that it optimizes for different things. But the trade offs seem way too severe. Does saving one round trip on the handshake mean anything at all if you're only getting one fourth of the throughput?

      7 replies →

    • > - avoiding some badly behaved ‘middleware’ boxes between users and servers

      Surely badly behaving middleboxes won't just ignore UDP traffic? If anything, they'd get confused about udp/443 and act up, forcing clients to fall back to normal TCP.

      1 reply →

  • Yes. msquic is one of the best performing implementations and only achieves ~7 Gbps [1]. The benchmarks for the Linux kernel implementation only get ~3 Gbps to ~5 Gbps with encryption disabled.

    To be fair, the Linux kernel TCP implementation only gets ~4.5 Gbps at normal packets sizes and still only achieves ~24 Gbps with large segmentation offload [2]. Both of which are ridiculously slow. It is straightforward to achieve ~100 Gbps/core at normal packet sizes without segmentation offload with the same features as QUIC with a properly designed protocol and implementation.

    [1] https://microsoft.github.io/msquic/

    [2] https://lwn.net/ml/all/cover.1751743914.git.lucien.xin@gmail...

  • Yes, they are. Worse, I’ve seen them shrink down to nothing in the face of congestion with TCP traffic. If Quic is indeed the future protocol, it’s a good thing to move it into the kernel IMO. It’s just madness to provide these massive userspace impls everywhere, on a packet switched protocol nonetheless, and expect it to beat good old TCP. Wouldn’t surprise me if we need optimizations all the way down to the NIC layer, and maybe even middleboxes. Oh and I haven’t even mentioned the CPU cost of UDP.

    OTOH, TCP is like a quiet guy at the gym who always wears baggy clothes but does 4 plates on the bench when nobody is looking. Don't underestimate. I wasted months to learn that lesson.

  • QUIC performance requires careful use of batching. Using UDP spckets naively, i.e. sending one QUIC packet per syscall, will incur a lot of oberhead - every time the kernel has to figure out which interface to use, queue it up on a buffer, and all the rest. If one uses it like TCP, batching up lots of data and enquing packets in one “call” helps a ton. Similarly, the kernel wireguard implementation can be slower than wireguard-go since it doesn’t batch traffic. At the speeds offered by modern hardware, we really need to use vectored I/O to be efficient.

  • I would expect that a protocol such as TCP performs much better than QUIC in benchmarks. Now do a realistic benchmark over roaming LTE connection and come back with the results.

    Without seeing actual benchmark code, it's hard to tell if you should even care about that specific result.

    If your goal is to pipe lots of bytes from A to B over internal or public internet there probably aren't make things, if any, that can outperform TCP. Decades were spent optimizing TCP for that. If HOL blocking isn't an issue for you, then you can keep using HTTP over TCP.

For the love of god, can we please move to microkernel-based operating systems already? We're adding a million lines of code to the linux kernel every year. That's so much attack surface area. We're setting ourselves up for a kessler syndrome of sorts with every system that we add to the kernel.

  • Most of that code is not loaded into the kernel, only when needed.

    • True, but the last time I checked (several years ago), the size of the portion of code that is not drivers or kernel modules was still 7 million lines of code, and the average system still has to load a few million more via kernel modules and drivers. That is still a phenomenally large attack surface.

      The SeL4 kernel is 10k lines of code. OKL4 is 13k. QNX is ~30k.

      6 replies →

  • I might be wrong, but microkernel also need drivers, so the attack surface would be the same, or not?

    • You're not wrong, but monolithic kernel drivers run at a privilege level that's even higher than root (ring 0) while microkernels run them at userspace so they're as dangerous as running a normal program.

      2 replies →

  • Naive question: is the Mac OS or iOS a microkernel? They seem to support http3 in their network foundation librairies and I’m wondering if it’s userland only or more.

    • MacOS is a hybrid kernel, which has been becoming more microkernel-like over time, and they are aggressively pushing more and more things to userspace. I don't think it will ever be a full microkernel, but it is promising to see that happening there.

      Ironic (in the alannis morrisette sense) that Apple has strictly controlled hardware AND OS-level software...if there's anybody out there that can possibly get away with a monolithic kernel in a safe way, it would be them. But Linux...where you have to support practically infinite variations in hardware and the full bazaar of software, that's a dumpster fire waiting to happen.

Brace for unauthenticated remote execution exploits on network stack.

I've been hearing about QUIC for ages, yet it is still an obscure tech and will likely end up like IPv6.