Comment by jstimpfle

3 months ago

Having gone full-in on this approach before, with some good success, it still feels wrong to me today. Contiguous storage may work for reasonable numbers of elements, but it's potentially blocking a huge contiguous chunk of address space especially for large numbers of elements.

I probably say this because I still have to main 32-bit binaries (only 2G of address space), but it can potentially be problematic even on 64-bit machines (typically 256 TB of address space), especially if the data structure should be a reusable container with unknown number of instances. If you don't know a reasonable upper bound of elements beforehand, you have to reallocate later, or drastically over-reserve from the start. The former removes a pointer stability guarantee, the later is uneconomical, it may even be uneconomical on 64-bit depending on how many instances of the data structures you plan to have. And having to reallocate when overflowing the preallocated space makes operations less deterministic with regards to execution time.

1 comment

jstimpfle

josephg 3 months ago

> Having gone full-in on this approach before, with some good success, it still feels wrong to me today. Contiguous storage may work for reasonable numbers of elements, but it's potentially blocking a huge contiguous chunk of address space especially for large numbers of elements.

That makes sense. If my btree was gigabytes in size, I might rethink the approach for a number of reasons. But in my case, even for quite large input, the data structure never gets more than a few megabytes in size. Thats small enough that resizing the vec has a negligible performance impact.

It helps that my btree stores its contents using lossless internal run-length encoding. Eg, if I have values like this:

    {key: 5, value: 'a'}
    {key: 6, value: 'a'}
    {key: 7, value: 'a'}

Then I store them like this:

    {key: [5..8), value: 'a'}

In my use case, this compaction decreases the size of the data structure by about 20x. There's some overhead in joining and splitting values - but its easily worth it.