← Back to context

Comment by derefr

3 days ago

> It's a rather strange crate because network IO doesn't actually need contiguous memory.

Network IO doesn't need contiguous memory, no, but each side of the duplex kind of benefits from it in its own way:

1. on receive, you can treat a contiguous received network datagram as its own little memory arena — write code that sends sliced references to the contents of the datagram to other threads to work with, where those references keep the datagram arena itself alive for as long as it's being worked with; and then drop the whole thing when the handling of the datagram is complete.

(This is somewhat akin to the Erlang approach — where the received message is a globally-shared binary; it gets passed by refcount into an actor started just for handling that request; that actor is spawned with its own preallocated memory arena; into that arena, the actor spits any temporaries related to copying/munging the slices of the shared binary, without having to grow the arena; the actor quickly finishes and dies; the arena is deallocated without ever having had to GC, and the refcount of the shared binary goes to zero — unless non-copied slices of it were async-forwarded to other actors for further processing.)

Also note that the whole premise here is zero-copy networking (as the bytes docs say: https://docs.rs/bytes/1.9.0/bytes/#bytes). The "message" being received here isn't a copy of the one from the network card, but literally the same physical wired memory the PHY sees as being part of its IO ring-buffer — just also mapped into your process's memory on (zero-copy) receive. If this data came chunked, you'd need to copy some of it to assemble those chunks into a contiguous string or data structure. But since it arrives contiguously, you can just slice it, and cast the resulting slice into whatever type you like.

2. on send — presuming you're doing non-blocking IO — it's nice to once again have a preallocated arena into which you can write out byte-sequences before flinging them at the kernel as [vectors of] large, contiguous DMA requests, without having to stop to allocate. (This removes the CPU as a bottleneck from IO performance — think writev(2).)

The ideal design here is that you allocate fixed-sized refcounted buffers; fill them up until the next thing you want to write doesn't fit†; and then intentionally drop the current buffer, switching your write_arena reference to point to a freshly-allocated buffer; and repeating. Each buffer then lives until all its slice-references get consumed. This forms kind of a "memory-lifetime-managed buffer-persisted message queue" — with the backing buffers of your messages living until all the messages held in them get "ACKed" [i.e. dropped by the receiving threads.]

Also, rather than having the buffers deallocate when you "use them up" — requiring you to allocate the next time you need a buffer — you can instead have the buffer's destructor release the memory it's holding into a buffer pool; and then have your next-buffer-please logic pull from that pool in preference to allocating. But then you'll want a higher-level "writable stream that is actually a mempool + current write_arena reference" type. (Hey, that's BufMut!)

† And at that point, when the next message doesn't fit, you do not split the message. That violates the whole premise of vectorizing the writes. Instead, you leave some of the buffer unused, and push the large message into a fresh buffer, so that the message will still correspond to a single vectorized-write element / io_uring call / DMA request / etc. If the message is so large it won't fit in your default buffer size, you allocate a buffer just for that one message, or better yet, you utilize a special second pool of larger fixed-size buffers. "Jumbo" buffers, per se.

(Get it yet? Networking hardware is also doing exactly what I'm describing here to pack and unpack your packets into frames. For a NIC or switch, the buffers are the [bodies of the] frames; a jumbo buffer is an Ethernet jumbo frame; and so on.)

> Get it yet

I'm not sure if your comment was meant to be condescending, but it really does come across at that. I'm very well versed in this domain.

Having a per-request/connection arena isn't the only option. What I have seen/use, which is still zero copy (as far as IO zero copy can be in Rust without resorting to bytemuck/blittable types), is to have a pool of buffers of a specific length - typically page-sized by default and definitely page-aligned. These buffers can come from a single large contiguous allocation. If you run out of space in a buffer you grab a new/reused one from the pool, add it to your vec of buffers, and carry on. At the end of the story you would use vectored IO to submit all of them at once - all the way down to the NIC and everything.

This approach is more widespread mainly due to historical reasons: it's really easy to fragment 32bit address space, so allocating jumbo buffers simply wasn't an option if you didn't want your server OOMing with 1GB of available (but non-contiguous) memory.

https://man7.org/linux/man-pages/man3/iovec.3type.html

https://learn.microsoft.com/en-us/windows/win32/api/ws2def/n...

  • > I'm very well versed in this domain.

    Apologies, I wasn't really responding to you directly; I was just taking the opportunity to write an educational-blog-post-as-comment aimed at the average HN reader (who has likely never considered what an Ethernet frame even is, or how a device that uses what are essentially DSPs does TDM packet scheduling) — with your comment being the parent because it's the necessary prerequisite reading to motivate the lesson.

    > Having a per-request/connection arena isn't the only option. What I have seen/use, which is still zero copy (as far as IO zero copy can be in Rust without resorting to bytemuck/blittable types), is to have a pool of buffers of a specific length - typically page-sized by default and definitely page-aligned. These buffers can come from a single large contiguous allocation. If you run out of space in a buffer you grab a new/reused one from the pool, add it to your vec of buffers, and carry on. At the end of the story you would use vectored IO to submit all of them at once - all the way down to the NIC and everything.

    I think you're focusing too much on the word "arena" here, because AFAICT we're both describing the same concept.

    In your model (closer to the one used in actual switching), there's a single global buffer pool that all concurrent requests lease from; in my model, there's global heap memory, and then a per-thread/actor/buf-object elastic buffer pool that allocates from the global heap every once in a while, but otherwise reuses buffers internally.

    I would say that your model is probably the one used in most zero-copy networking frameworks like DPDK. While my model is probably the one used in most language runtimes — especially managed + garbage-collected runtimes, where contending over a global language-exposed pool, can be more expensive than "allocating" (especially when the runtime has its own buffer pool and "allocation" rarely hits the kernel.)

    But both models are essentially the same from the perspective of someone using the buffer ADT and trying to understand why it's designed the way it is, what it gets them, etc. :)

    > it's really easy to fragment 32bit address space, so allocating jumbo buffers simply wasn't an option if you didn't want your server OOMing with 1GB of available (but non-contiguous) memory.

    Maybe you're imagining something else here, but when I say "jumbo buffer", I don't mean custom buffers allocated on demand and right-sized to hold one message; rather, I'm speaking of something very closely resembling actual jumbo frames — i.e. another pre-allocated pool containing a smaller number of larger, fixed-size MTU-slot buffers.

    With this kind of jumbo-buffer-pool, when your messages get big, you switch over from filling regular buffers to filling jumbo buffers — which holds off message fragmentation, but also means new messages go "out the door" a bit slower, maybe "platoon" a bit and potentially overwhelm the recipient with each burst, etc (which is why you don't just use the larger buffer pool as the only pool.)

    But if your messages can be bigger than your set jumbo-buffer size, then there's nowhere to go from there; you still need to have a way to split messages across frames.

    (Luckily, in the case of `bytes`, splitting a message across frames just means the message now needs multiple iovec-list entries to submit, rather than implying a framing protocol / L2 message encoding with a continuation marker / sequence ID / etc.)

How does bytes crate, or anyone else, offer zero copy receive from kernel (as opposed to kernel bypass) sockets?

As far as I know that is not possible: there's always a copy.

  • For network receive, I was assuming kernel-bypass sockets, not kernel sockets.

    `bytes` can give you "ring-buffer-like" one-copy kernel-socket receive by e.g. using the Buf as the target for scheduling io_uring read/recv into.

    Also, RDMA is technically networking! (Though I think all the Rust RDMA libraries already provide ADTs that work like Buf/MutBuf, rather than just saying "here's some network-shared memory, build your own ADT on top.")

    • Thanks, you mention explicitly kernel networking right below about the send path:

      > before flinging them at the kernel as [vectors of] large, contiguous DMA requests, without having to stop to allocate

      So I had assumed you were taking about kernel networking elsewhere as well.

      BTW, on the kernel send path, there is again a copy, contiguous or not, regardless of what API you use.

      When using kernel networking I don't think contiguity matters as you suggest, as there is always a copy. Furthermore "contiguous" in userspace doesn't correspond to contiguous in physical address space so in any case the hardware is just often going to see a userspace buffer as a series of discontiguous pages anyway: that's what happens with direct IO disk writes, which _are_ zero copy (huge pages helps).