← Back to context

Comment by ComputerGuru

3 days ago

Rust is missing an abstraction over non-contiguous chunks of contiguous allocations of data that would make handling ropes seamless and more natural even for smaller sizes.

C# has the concept of “Sequences” which is basically a generalization of a deque with associated classes and apis such as ReadOnlySequence and SequenceReader to encourage reduced allocations, reuse of existing buffers/slices even for composition, etc

Knowing the rust community, I wouldn’t be surprised if there’s already an RFC for something like this.

I think you might be looking for the bytes crate, which is pretty widely used in networking code: https://docs.rs/bytes/latest/bytes/index.html

In general this sort of structure is the sort of thing I'd expect to see in an external crate in rust, not the standard library. So it's unlikely there's any RFCs, and more likely there's a few competing implementations lying around.

  • Bytes is essentially multiple slices over a optimistically single contiguous arc buffer. It's basically the inverse of what the root comment is after (an array of buffers). It's a rather strange crate because network IO doesn't actually need contiguous memory.

    std does actually have a vague version of what the root comment wants: https://doc.rust-lang.org/std/io/struct.IoSlice.html and its sibling IoSliceMut (slicing, appending, inserting, etc. is out of scope for both - so not usable for rope stuff)

    • > It's a rather strange crate because network IO doesn't actually need contiguous memory.

      Network IO doesn't need contiguous memory, no, but each side of the duplex kind of benefits from it in its own way:

      1. on receive, you can treat a contiguous received network datagram as its own little memory arena — write code that sends sliced references to the contents of the datagram to other threads to work with, where those references keep the datagram arena itself alive for as long as it's being worked with; and then drop the whole thing when the handling of the datagram is complete.

      (This is somewhat akin to the Erlang approach — where the received message is a globally-shared binary; it gets passed by refcount into an actor started just for handling that request; that actor is spawned with its own preallocated memory arena; into that arena, the actor spits any temporaries related to copying/munging the slices of the shared binary, without having to grow the arena; the actor quickly finishes and dies; the arena is deallocated without ever having had to GC, and the refcount of the shared binary goes to zero — unless non-copied slices of it were async-forwarded to other actors for further processing.)

      Also note that the whole premise here is zero-copy networking (as the bytes docs say: https://docs.rs/bytes/1.9.0/bytes/#bytes). The "message" being received here isn't a copy of the one from the network card, but literally the same physical wired memory the PHY sees as being part of its IO ring-buffer — just also mapped into your process's memory on (zero-copy) receive. If this data came chunked, you'd need to copy some of it to assemble those chunks into a contiguous string or data structure. But since it arrives contiguously, you can just slice it, and cast the resulting slice into whatever type you like.

      2. on send — presuming you're doing non-blocking IO — it's nice to once again have a preallocated arena into which you can write out byte-sequences before flinging them at the kernel as [vectors of] large, contiguous DMA requests, without having to stop to allocate. (This removes the CPU as a bottleneck from IO performance — think writev(2).)

      The ideal design here is that you allocate fixed-sized refcounted buffers; fill them up until the next thing you want to write doesn't fit†; and then intentionally drop the current buffer, switching your write_arena reference to point to a freshly-allocated buffer; and repeating. Each buffer then lives until all its slice-references get consumed. This forms kind of a "memory-lifetime-managed buffer-persisted message queue" — with the backing buffers of your messages living until all the messages held in them get "ACKed" [i.e. dropped by the receiving threads.]

      Also, rather than having the buffers deallocate when you "use them up" — requiring you to allocate the next time you need a buffer — you can instead have the buffer's destructor release the memory it's holding into a buffer pool; and then have your next-buffer-please logic pull from that pool in preference to allocating. But then you'll want a higher-level "writable stream that is actually a mempool + current write_arena reference" type. (Hey, that's BufMut!)

      † And at that point, when the next message doesn't fit, you do not split the message. That violates the whole premise of vectorizing the writes. Instead, you leave some of the buffer unused, and push the large message into a fresh buffer, so that the message will still correspond to a single vectorized-write element / io_uring call / DMA request / etc. If the message is so large it won't fit in your default buffer size, you allocate a buffer just for that one message, or better yet, you utilize a special second pool of larger fixed-size buffers. "Jumbo" buffers, per se.

      (Get it yet? Networking hardware is also doing exactly what I'm describing here to pack and unpack your packets into frames. For a NIC or switch, the buffers are the [bodies of the] frames; a jumbo buffer is an Ethernet jumbo frame; and so on.)

      5 replies →

    • Yah I'd Bytes' chief use is avoiding copies when dealing with distinct portions of (contiguous) buffers.

      It is not a tool for composing disparate pieces into one (while avoiding copies)

I wrote a utf-8 capable (but also fully generic over element type) rope implementation in Rust last fall (edit: 2023) and the main issue I ran into was the lack of a suitable regex library capable of working across slice boundaries. With some finagling I did manage to get it to work with most/all of the other relevant iterator/reader traits IIRC, and it benchmarked fairly well from a practical perspective, though it's not as fast as some of the other explicitly performance-focused implementations out there.

I'm afraid I might not have that much free time again for a long time, but maybe when I do, somebody will have solved the regex issue for me...

Hmm. It's similar to, but not fully, a `BufRead`? Maybe a `BufRead + Seek`. The slicing ability isn't really covered by those traits, though, but I think you could wrap a BufRead+Seek in something that effectively slices it.

A `BufRead + Seek` need not be backed by memory, though, except in the midst of being read. (A buffered normal file implements `BufRead + Seek`, for example.)

I feel like either Iterator or in some rare case of requiring generic indexing, Index, are more important than "it is composed of some number of linked memory allocations"?

A ReadOnlySequence seems to imply a linked-list of memory sections though; I'm not sure a good rope is going to be able to non-trivially interface with that, since the rope is a tree; walking the nodes in sequence is possible, but it's a tree walk, and something like ReadOnlySequenceSegment::Next() is then a bit tricky. (You could gather the set of nodes into an array ahead of time, but now merely turning it into that is O(nodes) which is sad.)

(And while it might be tempting to say "have the leaf nodes be a LL", I don't think you want to, as it means that inserts need to adjust those links, and I think you would rather have mutations produce a cheaply made but entirely new tree, which I don't think permits a LL of the leafs. You want this to make undo/redo cheap: it's just "go back to the last rope", and then all the ropes share the underlying character data that's not changing rope to rope. The rope in the OP seems to support this: "Cloning ropes is extremely cheap. Rope clones share data,")