Comment by rtheunissen

2 years ago

That is true, but in a persistent setting they likely also copy more data, and iterator invalidation might be a concern in some cases when moving values around within a B-tree node. The motivation was not really to come up with the best tree structure or to even consider memory hierarchy at all.

Instead, within just the scope of binary search trees specifically, what cool things might we uncover by taking a closer look? In a hypothetical future, our hardware might be so different and caches so large and advanced that the best practices of today might not apply all the same.

Binary search trees today are probably only viable for absolutely massive collections, and at that scale there is almost definitely a better, bespoke solution available. Exploring without necessarily thinking about the hardware, in the abstract, has been a fun sandbox to play in. The fact that we still teach them suggests that they provide value beyond practical application.

Thank you for your feedback, you are absolutely right. In most practical use-cases today, a B-tree would likely achieve better results.

> caches so large and advanced that the best practices of today might not apply all the same.

All trends are going in the wrong direction for binary trees to ever become useful again, and some of those limits are physics, not design.

- Branch predictors hate binary trees.

- CPU pipelines hate pointer chasing.

- Larger caches work less efficiently with binary trees than alternatives, such as "cache-oblivious" algorithms.

- Operations wasted per cache-miss due to memory latency has been going up, not down.

Etc...

> even consider memory hierarchy at all.

There's an often overlooked physics aspect here: physical memory requires physical space. As you add "N" bytes of memory, the layout of the circuits needs to expand into space. If deployed in a flat plane as is done typically, then this is a shape with radius roughly proportional to √N. For a space-filling volumetric design, at best it is ∛N.

Access to a piece of data requires a round-trip that takes time proportional to the distance taken through the computer. This is a simple consequence of the speed of light.

This means that unpredictable or random memory access time scales as ∛N to √N, not the constant "1" as typically assumed in big O analysis!

This is inescapable for computers made from physical components instead of pencil & paper abstractions.

If you add in this additional term, a lot of algorithms traditionally considered to be efficient suddenly look much worse. Conversely, algorithms preferred by developers with a knack for performance tuning don't get such a big penalty.

  • What if some future technology or material breakthrough provides a sort of self-adjusting liquid memory that provides true constant time access to any address? I'm not being entirely serious of course, as I dream about sequences across nodes on planets through other solar systems.

    Focusing on fundamental algorithms in the abstract provides a fun playground to explore and learn and teach, before you learn about memory hierarchy when all your hopes and dreams of the ideal data structure fades away.

    I don't think there is any time wasted exploring the fundamental. Who knows what technology might see renaissance in the future as hardware continues to change. Analog computers, binary search trees, who knows.

    It's fun to dream and take a break from reality sometimes, digging deep into a simple concept with a rich design space and complex analysis.

  • I would not disregard that direction. Architectures that do well on pointer chasing have been attempted in the past (e.g. Cray XMT) and are cropping up in prototypes (PIUMA https://arxiv.org/abs/2010.06277) and startups. The caveat is that latency is hidden with massive amounts of parallelism.

    edit: To be clear, I agree with the sentiment that the paper abstract machine models are inadequate in most practical cases. There are workloads where you're dealing with asymptotically large datasets, such as genome assembly and analysis. Even there, the RAM model is abstracting away the bottlenecks of modern computer systems.

  • Interesting idea. I'm not confident it holds, access time to memory is often described as some count of cycles. E.g. 3 cycles to L1, some number of hundreds to somewhere else in the hierarchy.

    Memory also positioned at some distance from the CPU (or whatever silicon is doing the arithmetic), where copying from one place to another involves copying into and then back out of the CPU.

    More memory is slower, but within a given level of the cache hierarchy, I'd guess access time to any memory to be constant. How much variation is there in latency as a function of physical address, within say system level ddr4?

    • The variation is not smooth in real systems, but just like you’ve noticed: it’s right there in the L1->L2->L3->RAM->Disk hierarchy.

      Each one is physically bigger, further away, and higher latency.

      We might one day have 1 PB memory systems with 1 TB of on-chip cache… but the larger memory will still need more space and be further away…