← Back to context

Comment by zahlman

16 hours ago

I can't even imagine what "safety" issue you have in mind. Given that "zero-copy" apparently means "in-memory" (a deserialized version of the data necessarily cannot be the same object as the original data), that's not even difficult to do with the Python standard library. For example, `zipfile.ZipFile` has a convenience method to write to file, but writing to in-memory data is as easy as

  with zipfile.ZipFile(archive_name) as a:
      with a.open(file_name) as f, io.BytesIO() as b:
          b.write(f.read())
          return b.getvalue()

(That does, of course, copy data around within memory, but.)

> Given that "zero-copy" apparently means "in-memory" (a deserialized version of the data necessarily cannot be the same object as the original data), that's not even difficult to do with the Python standard library

This is not what zero-copy means. Here's a working definition[1].

Specifically, it's not just about keeping things in memory; copying in memory is normal. The goal is to not make copies (or more precisely, what Rust would call "clones"), but to instead convey the original representation/views of that representation through the program's lifecycle where feasible.

> a deserialized version of the data necessarily cannot be the same object as the original data

rust-asn1 would be an example of a Rust library that doesn't make any copies of data unless you explicitly ask it to. When you load e.g. a Utf8String[2] in rust-asn1, you get a view into the original input buffer, not an intermediate owning object created from that buffer.

> (That does, of course, copy data around within memory, but.)

Yes, that's what makes it not zero-copy.

[1]: https://rkyv.org/zero-copy-deserialization.html

[2]: https://docs.rs/asn1/latest/asn1/struct.Utf8String.html

  • > Yes, that's what makes it not zero-copy.

    Yeah, so you'd have to pass around the `BytesIO` instead.

    I know that zero-copy doesn't ordinarily mean what I described, but that seemed to be how TFA was using it, based on the logic in the rest of the sentence.

    • > Yeah, so you'd have to pass around the `BytesIO` instead.

      That wouldn’t be zero-copy either: BytesIO is an I/O abstraction over a buffer, so it intentionally masks the “lifetime” of the original buffer. In effect, reading from the BytesIO creates new copies of the underlying data by design, in new `bytes` objects.

      (This is actually a great capsule example of why zero-copy design is difficult in Python: the Pythonic thing to do is to make lots of bytes/string/rich objects as you parse, each of which owns its data, which in turn means copies everywhere.)

      2 replies →

As a quick and kind of oversimplified example of what zero copy means, imagine you read the following json string from a file/the network/whatever:

    json = '{"user":"nugget"}' // from somewhere

A simple way to extract json["user"] to a new variable would be to copy the bytes. In pythony/c pseudo code

    let user = allocate_string(6 characters)
    for i in range(0, 6)
      user[i] = json["user"][i]
    // user is now the string "nugget"

instead, a zero copy strategy would be to create a string pointer to the address of json offset by 9, and with a length of 6.

    {"user":"nugget"}
             ^     ]end

The reason this can be tricky in C is that when you call free(json), since user is a pointer to the same string that was json, you have effectively done free(user) as well.

So if you use user after calling free(json), You have written a classic _memory safety_ bug called a "use after free" or UAF. Search around a bit for the insane number of use after free bugs there have been in popular software and the havoc they have wreaked.

In rust, when you create a variable referencing the memory of another (user pointing into json) it keeps track of that (as a "borrow", so that's what the borrow checker does if you have read about that) and won't compile if json is freed while you still have access to user. That's the main memory safety issue involved with zero-copy deserialization techniques.