← Back to context

Comment by gcr

2 years ago

Why does this conclusion follow?

Of course similar text compresses more efficiently, but NNs don’t work with compressed (varying-size) representations, they work with vector representations which happen to be close in similarity space

they work with compressed representations, you take an arbitrary information with varying entropy into a fized size vector representation, that's a compression.

  • That's like saying hashing is compression because the output is always x bits. You see what I mean, right?

    • Hashing is literally compression (and is considered so), it's just not easily decompressable.

      It's also often lossy, though it depends on your input space.

      For example, if you accept all text of all length, then it's lossy for sure. But if you were to accept something like "the text of any book", then it's easy to make it non-lossy.

      There's only like 140-150 million books in the world, at a rough estimate, so you could easily losslessly compress all the existing books in the world to a few bytes. Even if you multiply to all variants and translations, it would still likely hash to less than 100 bytes. But you still have to store the table of books somewhere at least once, and it wouldn't be able to compress new books :P

    • I'm not sure I see the distinction. Hashing is compression because the output is fixed size. It's mapping to a codebook that could be used to try to map back to the original space. Its incredibly lossy by design but it is a type of (bad) compression

    • > hashing is compression because the output is always x bits

      Well, in certain specific scenarios, a perfect hash (MPHF) can be used as a "compression" method.