Comment by nullc

6 years ago

The paper claims that it is more space efficient than golomb coded sequences, but this is clearly not the case at least when the golomb parameters and FP rate are set optimally[1] a GCS can be well within a fraction of 1% of the information theoretic limit.

The main limitation of GCS in terms of communications efficiency is just that there aren't that many optimal rate options. Though arithmetic coding the differences isn't really that much more complex to implement and it's optimal for whatever parameters you want.

Now, if it wanted to talk about _access_ efficiency, then sure, GCS don't result in an data structure thats efficient for random access.

[1] https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc84...

(Co-author here) For GCS, I also found the Golomb-Rice parameter Q-1 to be better, this is what we used. According to my calculation, the overhead of GCS is around 1.5 bit/key. See also https://github.com/0xcb/Golomb-coded-map . Sure, with arithmetic coding, or ANS coding, it should be lower; it would be interesting to see how much. I kind of like GCS.

If I were coding a static approximate membership query structure, I wouldn't use either bloom filters or GCS, I'd use an Elias-Fano-coded sequence of fingerprints. That has nearly the same compression efficiency as Golomb coding but can be efficiently queried with no decompression.

  • Interesting idea! Elias-Fano monotone sequences unfortunately do need a bit more space than just a list of Rice codes. What is possible, without having to increase the space usage, is to re-arrange the Rice codes in each bucket, so that all variable parts come first, and the fixed parts at the end. That way, lookup speed would be much, much faster for large buckets. We used this for RecSplit: https://arxiv.org/abs/1910.06416 I will try that out for GCS as well!