Comment by derefr

3 days ago

> I'm very well versed in this domain.

Apologies, I wasn't really responding to you directly; I was just taking the opportunity to write an educational-blog-post-as-comment aimed at the average HN reader (who has likely never considered what an Ethernet frame even is, or how a device that uses what are essentially DSPs does TDM packet scheduling) — with your comment being the parent because it's the necessary prerequisite reading to motivate the lesson.

> Having a per-request/connection arena isn't the only option. What I have seen/use, which is still zero copy (as far as IO zero copy can be in Rust without resorting to bytemuck/blittable types), is to have a pool of buffers of a specific length - typically page-sized by default and definitely page-aligned. These buffers can come from a single large contiguous allocation. If you run out of space in a buffer you grab a new/reused one from the pool, add it to your vec of buffers, and carry on. At the end of the story you would use vectored IO to submit all of them at once - all the way down to the NIC and everything.

I think you're focusing too much on the word "arena" here, because AFAICT we're both describing the same concept.

In your model (closer to the one used in actual switching), there's a single global buffer pool that all concurrent requests lease from; in my model, there's global heap memory, and then a per-thread/actor/buf-object elastic buffer pool that allocates from the global heap every once in a while, but otherwise reuses buffers internally.

I would say that your model is probably the one used in most zero-copy networking frameworks like DPDK. While my model is probably the one used in most language runtimes — especially managed + garbage-collected runtimes, where contending over a global language-exposed pool, can be more expensive than "allocating" (especially when the runtime has its own buffer pool and "allocation" rarely hits the kernel.)

But both models are essentially the same from the perspective of someone using the buffer ADT and trying to understand why it's designed the way it is, what it gets them, etc. :)

> it's really easy to fragment 32bit address space, so allocating jumbo buffers simply wasn't an option if you didn't want your server OOMing with 1GB of available (but non-contiguous) memory.

Maybe you're imagining something else here, but when I say "jumbo buffer", I don't mean custom buffers allocated on demand and right-sized to hold one message; rather, I'm speaking of something very closely resembling actual jumbo frames — i.e. another pre-allocated pool containing a smaller number of larger, fixed-size MTU-slot buffers.

With this kind of jumbo-buffer-pool, when your messages get big, you switch over from filling regular buffers to filling jumbo buffers — which holds off message fragmentation, but also means new messages go "out the door" a bit slower, maybe "platoon" a bit and potentially overwhelm the recipient with each burst, etc (which is why you don't just use the larger buffer pool as the only pool.)

But if your messages can be bigger than your set jumbo-buffer size, then there's nowhere to go from there; you still need to have a way to split messages across frames.

(Luckily, in the case of `bytes`, splitting a message across frames just means the message now needs multiple iovec-list entries to submit, rather than implying a framing protocol / L2 message encoding with a continuation marker / sequence ID / etc.)