Comment by woodruffw

2 months ago

I think the framing in the post is that it's specific to Rust, relative to what Python packaging tools are otherwise written in (Python). It's not very easy to do zero-copy deserialization in pure Python, from experience.

(But also, I think Rust can fairly claim that it's made zero-copy deserialization a lot easier and safer.)

11 comments

woodruffw

stefan_ 2 months ago

I suppose it can fairly claim that now every other library and blog post invokes "zero-copy" this and that, even in the most nonsensical scenarios. It's a technique for when you can literally not afford the memory bandwidth, because you are trying to saturate a 100Gbps NIC or handling 8k 60Hz video, not for compromising your data serialization schemes portability for marketing purposes while all applications hit the network first, disk second and memory bandwidth never.

vlovich123 2 months ago

You’ve got this backward. The vast majority of time due to spatial and temporal locality, in practice for any application you’re actually usually doing CPU registers first, cache second, memory third, disk fourth, network cache fifth, and network origin sixth. So this stuff does actually matter for performance.
Also, aside from memory bandwidth, there’s a latency cost inherent in traversing object graphs - 0 copy techniques ensure you traverse that graph minimally, just what’s needed to actually be accessed which is huge when you scale up. There’s a difference between one network request and fetching 1 MB vs making 100 requests to fetch 10kib and this difference also appears in memory access patterns unless they’re absorbed by your cache (not guaranteed for object graph traversal that a package manager would be doing).
woodruffw 2 months ago

Many of the hot paths in uv involve an entirely locally cached set of distributions that need to be loaded into memory, very lightly touched/filtered, and then sunk to disk somewhere else. In those contexts, there are measurable benefits to not transforming your representation.
(I'm agnostic on whether zero-copy "matters" in every single context. If there's no complexity cost, which is what Rust's abstractions often provide, then it doesn't really hurt.)
zahlman 2 months ago

The point is that the packaging tool can analyze files from within the archives it downloads, without writing them to disk.

zahlman 2 months ago

I can't even imagine what "safety" issue you have in mind. Given that "zero-copy" apparently means "in-memory" (a deserialized version of the data necessarily cannot be the same object as the original data), that's not even difficult to do with the Python standard library. For example, `zipfile.ZipFile` has a convenience method to write to file, but writing to in-memory data is as easy as

  with zipfile.ZipFile(archive_name) as a:
      with a.open(file_name) as f, io.BytesIO() as b:
          b.write(f.read())
          return b.getvalue()

(That does, of course, copy data around within memory, but.)

woodruffw 2 months ago
> Given that "zero-copy" apparently means "in-memory" (a deserialized version of the data necessarily cannot be the same object as the original data), that's not even difficult to do with the Python standard library
This is not what zero-copy means. Here's a working definition[1].
Specifically, it's not just about keeping things in memory; copying in memory is normal. The goal is to not make copies (or more precisely, what Rust would call "clones"), but to instead convey the original representation/views of that representation through the program's lifecycle where feasible.
> a deserialized version of the data necessarily cannot be the same object as the original data
rust-asn1 would be an example of a Rust library that doesn't make any copies of data unless you explicitly ask it to. When you load e.g. a Utf8String[2] in rust-asn1, you get a view into the original input buffer, not an intermediate owning object created from that buffer.
> (That does, of course, copy data around within memory, but.)
Yes, that's what makes it not zero-copy.
[1]: https://rkyv.org/zero-copy-deserialization.html
[2]: https://docs.rs/asn1/latest/asn1/struct.Utf8String.html
- zahlman 2 months ago
  
  > Yes, that's what makes it not zero-copy.
  Yeah, so you'd have to pass around the `BytesIO` instead.
  I know that zero-copy doesn't ordinarily mean what I described, but that seemed to be how TFA was using it, based on the logic in the rest of the sentence.
  
  3 replies →
SpaceNugget 2 months ago
As a quick and kind of oversimplified example of what zero copy means, imagine you read the following json string from a file/the network/whatever:
json = '{"user":"nugget"}' // from somewhere
A simple way to extract json["user"] to a new variable would be to copy the bytes. In pythony/c pseudo code
let user = allocate_string(6 characters) for i in range(0, 6) user[i] = json["user"][i] // user is now the string "nugget"
instead, a zero copy strategy would be to create a string pointer to the address of json offset by 9, and with a length of 6.
{"user":"nugget"} ^ ]end
The reason this can be tricky in C is that when you call free(json), since user is a pointer to the same string that was json, you have effectively done free(user) as well.
So if you use user after calling free(json), You have written a classic _memory safety_ bug called a "use after free" or UAF. Search around a bit for the insane number of use after free bugs there have been in popular software and the havoc they have wreaked.
In rust, when you create a variable referencing the memory of another (user pointing into json) it keeps track of that (as a "borrow", so that's what the borrow checker does if you have read about that) and won't compile if json is freed while you still have access to user. That's the main memory safety issue involved with zero-copy deserialization techniques.