Comment by drewg123
7 hours ago
I come from a very different world (optimizing the FreeBSD kernel for the Netflix CDN, running on bare metal) but performance leaps like this are fascinating to me.
One of the things that struck me when reading this with only general knowledge of the linux kernel is: What makes things so terrible? Is iptables really that bad? Is something serialized to a single core somewhere in the other 3 scenarios? Is the CPU at 100% in all cases? Is this TCP or UDP traffic? How many threads is iperf using? It would be cool to see the CPU utilization of all 4 scenarios, along with CPU flamegraphs.
In the case of XDP, the reason it's so much faster is that it requires 0 allocations in the most common case. The DMA buffers are recycled in a page pool that's already allocated and mapped at least queue depth buffers for each hardware queue. XDP is simply running on the raw buffer data, then telling the driver what the user wants to do with the buffer. If all you are doing is rewriting an IP address, this is incredibly fast.
In the non XDP case (ebpf on TC) you have to allocate a sk buff and initialize it. This is very expensive, there's tons of accounting in the struct itself, and components that track every sk buff. Then there are the various CPU bound routing layers.
Overall the network core of Linux is very efficient. The actual page pool buffer isn't copied until the user reads data. But there's a million features the stack needs to support, and all of these cost efficiency.
Yes, I (with a few others) did a similar optimization for FreeBSD's firewall, with similar results but much greater simplicity using what we call "pfil memory pointer hooks" We wrote a paper about it in 2020 for a conference that was cancelled due to Covid, so its fairly unknown.
On what's now almost 10 year old hardware, we could drop 44Mpps of a volumetric DOS attack and still serve our nominal workload with no impact. See PFILCTL(8) and PFIL(9), focus on ethernet (link layer) packets.
It relies on the same principal -- NIC passes the RX buffer directly to the firewall (ipfw, pf, or ipfilter). If the firewall says the packet is OK, rx processing happens as normal. If it says to drop, then dropping is very fast because it can simply re-use the buffer without re-allocation, re-doing DMA mapping, etc.
This is an essential use case for XDP - this is how FB's firewall works, and above that their LB uses the same technology.
The beauty of XDP is that it's all eBPF. Completely customizable by injecting policy where it's needed and native to the kernel.
As far as we can tell, it’s a mixture of a lot of things. One of the questions I got asked was how useful this is if you have a smaller performance requirement than 200Gbps (or, maybe a better way to put it, what if your host is small and can only do 10Gbps anyways).
You’ll have to wait for the follow up post with the CNI plugin for the full self-reproducible benchmark, but on a 16 core EC2 instance with a 10Gbps connection IPtables couldn’t do more than 5Gbps of throughput (TCP!), whereas again XDP was able to do 9.84Gbps on average.
Furthermore, running bidirectional iPerf3 tests in the larger hosts shows us that both ingress and egress throughput increase when we swap out iptables on just the egresss path.
This is all to say, our current assumption is when the CPU is thrashed by iPerf3, the RSS queues, the Linux kernel’s ksoftirqd threads, etc. all at once it destroys performance. XDP is moving some of the work outside the kernel, while at the same time the packet is only processed through the kernel stack half as much as without XDP (only on the path before or after the veth).
It really is all CPU usage in the end as far as I can tell. It’s not like our checksumming approach is any better than what the kernel already does.
> IPtables couldn’t do more than 5Gbps of throughput (TCP!)
Is this for a single connection? IIRC, AWS has a 5gbps limit per connection, does it not? I am guessing since you were able to get to ~10 it must be a multi connection number.
No this was multiple connections - and we tried with both `iperf2` and `iperf3`, UDP and TCP traffic. UDP actually does much worse on `iptables` than TCP, and I'm not sure why just yet.
2 replies →
The kernel will allocate, merge packets in skbs if needed, extract data, and do quite a lot. XDP runs as early as possible in the datapath. Pretty much all drivers have to do is call the XDP code when they receive an IRQ from the NIC.
You'll bypass a memory copy (ringbuf -> kernel memory), allocations (skb), parsing (ips & such), firewalling, checking if the packet is local, checksum validation, the list goes on...
The following diagram helps seeing all the things that happens: https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilte...
(yes, xdp is the leftmost step, literally after "card dma'd packet in memory")
It's also a bit depressing that everyone is still using the slower iptables, when nftables has been in the kernel for over a decade.
Actually the latest benchmarks were ran on a Fedora 43 host, which as far as I can tell uses the nftables backend for iptables!
Iptables uses nftables under the hood.