← Back to context

Comment by cptnntsoobv

6 hours ago

XDP, and the eBPF ecosystem in general, is quite neat. However, a word of caution:

* The BPF verifier's DX is not great yet. If it finds problems with your BPF code it will spit our a rather inscrutable set of error messages that often requires a good understanding of the verifier internals (e.g the register nomenclature) to debug

* For the same source code, the code generated by the verifier can change across compiler versions in a breaking way, e.g. because the new compiler version implemented an optimization that broke the verifier (see https://github.com/iovisor/bcc/issues/4612)

* Checksum updating requires extra care. I believe you can only do incremental updates, not just because of better perf as the post suggests but also because the verifier does not allow BPF programs to operate on unbounded buffers (so checksumming a whole packet of unknown size is tricky / cumbersome). This mostly works but you have to be careful with packets that were generated with csum offload, don't have a valid checksum and whose csum can't be incrementally updated.

As the blog post points out, the kernel networking stack does a lot of work that we don't generally think about. Once you start taking things into your own hands you don't have the luxury of ignorance anymore (think not just ARP but also MTU, routing, RP filtering etc.), something any user of userspace networking frameworks like DPDK will tell you.

My general recommendation is to stick with the kernel unless you have a very good justification for chasing better performance and if you do use eBPF save yourself some trouble and try to limit yourself to readonly operations, if your use case allows.

Also, if you are trying to debug packet drops, newer kernels have started logging this information that you can track using bpftrace which gives you better diagnostics.

Example script (might have to adjust based on kernel version):

    bpftrace -e '
        kprobe:kfree_skb_reason {
        $skb = (struct sk_buff *)arg0;
        $ipheader = ((struct iphdr *) ($skb->head + $skb->network_header));
        printf("reason :%d %s -> %s\n", arg1, ntop($ipheader->saddr), ntop($ipheader->daddr));
    }'

We absolutely ran into these issues.

A couple notes that help quite a bit:

1. Always build the eBPF programs in a container - this is great for reproducibility of course, but also makes DevX on MacOS better for those who prefer to use that.

2. You actually can do a full checksum! You need to limit the MTU but you can:

  static __always_inline void tcp_checksum(const struct iphdr *ip_header, struct tcphdr *tcp_header, const __u16 tcp_len, const void *data_end) {
    __u32 sum = 0;
    __u16 *buf = (void *)tcp_header;
    ip_header_pseudo_checksum(ip_header, tcp_len, &sum);
    tcp_header->check = 0;
    __u16 max_packet_size = tcp_len;
    if (max_packet_size > MAX_TCP_PACKET_SIZE) {
        max_packet_size = MAX_TCP_PACKET_SIZE;
    }
    for (int i = 0; i < max_packet_size / 2; i++) {
        if ((void *)(buf + 1) > data_end) {
            break;
        }
        sum += *buf;
        buf++;
    }
    if ((void *)buf + 1 <= data_end && ((__u8 *)buf - (__u8 *)tcp_header) < max_packet_size) {
        sum += *(__u8 *)buf;
    }
    tcp_header->check = csum_fold_helper(sum);
  }

With that being said, it's not lost on me that XDP in general is something you should only reach for once you hit some sort of bottleneck. The original version of our network migration was actually implemented in userspace for this exact reason!

  • > You actually can do a full checksum

    Indeed! This is what I had in mind when I wrote "cumbersome" :).

    It's been a while for me to be able to recall whether the problem was the verifier or me, and things may have improved since, but I recall having the verifier choke on a static size limit too. Have you been able to use this trick successfully?

    > Always build the eBPF programs in a container

    That should work generally but watch out for any weirdness due to the fact that in a container you are already inside a couple of layers of networking (bridge, netns etc.).

  • Different kernels will be different levels of fussy about the bounded loop you're using there. Bounded loops are themselves a relatively recent feature.

    Of course, checksum fixups in eBPF are idiomatically incremental.

openonload is faster than the kernel even with the most basic configuration, which is pretty much drop-in and requires zero changes on your application.