Comment by nullc

6 years ago

> but the real minimum is simply log2(M)

The minimum entropy to code such a thing is log2(MN choose N)/N bits per element.

log2(eM) is just an approximation when N is large.

You cannot code with N uniform choices out of MN options with less than log2(MN choose N) bits on average, by the pigeonhole principle.

> the unary terminator alone

The terminator carries information. If M is chosen well the terminator carries almost one bit of information.

The minimum for approximate membership testers is log2((M choose N)/(N + fp_rate * (M - N) choose N))/N, assuming M >> N. The asymptotic is log2(1/fp_rate). For simplicity's sake below I pretend we're dealing with a set of 1, and so 1/fp_rate = M.

I thought you had this in mind in the first paragraph with "1 / log(2) (around 1.44) times more compact than Bloom filters". However using your definition of the minimum, for practical purposes it isn't possible to be anywhere near 1.44 times more compact than Bloom filters.

Using your example of M=2^20, log2(M) log2(e) / log2(eM) = 1.34. Personally I've never spent more than 45 bits per item on such structures, but even at 64 bits log2(eM) only reaches 1.41.

The oft-cited 1.44x Bloom filter tax is simply taken from its per-item cost of log2(M) log2(e), absent the information-theoretic minimum of log2(M), leaving just the log2(e) multiplier.

  • I think I was caught up thinking the discussion was all about the efficiency of the GCS coding for the underlying bit-set; which it is indeed indeed nearly optimal for (but only for some parameters, which don't happen to be where the fp rate is 1/2^k, which unfortunately is the case usually considered in papers!). Sorry about that.

    The bloom cost 1.44 times the lower bound. The difference between an optimally coded bitset and the asymptotic lower bound is an additive 1.44 bits per element. This is a pretty big difference!