← Back to context

Comment by zawerf

8 years ago

Sort of off topic: In theory hash tables are supposed to be O(1) (under the RAM model). But looking at these graphs, they seem to grow linearly after they stop fitting in the cache which is especially apparent on the unsuccessful lookup graph. Since it's on a log-scale graph, O(log(n)) would probably fit better.

Just thought it's an interesting divergence between what's taught in school and in practice. I wonder if there's something that's more predictive than big Oh for these types of analysis.

The biggest divergence between what's taught in school and what happens in practice is that the Big O limits you saw in school use a different abstract machine relative to what we use today, in practice. Specifically, I'm talking about the assumption that random memory access is O(1), where in reality modern machines' cache hierarchy makes it behave more like O(log n) or O(sqrt n) or something to that effect.

  • Is that really true, though? It's certainly the case that a cache miss takes longer than a cache hit, but the time for a cache miss doesn't depend on how much memory you have. Or does it? Like, a cache miss isn't going to be slower if you have 64 gigabytes of memory compared to having 1 gigabyte of memory, it's just a cache miss. That makes random memory access O(1) in the sense of complexity analysis.

    • There is typically L1 cache, L2 cache, L3 cache, RAM, and SWAP. These ranges from kb to mb, and then to gb. And each lower level is faster than all levels above it.

      Most of the relevant memory here is L2 and L3. As you have more items in memory, the statistical likelihood that the page you're requesting will be in an efficient cache level decreases. Eventually you have so many items that you're statistically only getting the performance of RAM (let's ignore the swap case)... that's the point where you'll get closer to a flat `O(1)`... but it will be a lot slower than the `O(1)` with more cache hits.

      1 reply →

    • More RAM doesn't mean more cache misses, but more elements in your data set does — "n" isn't the amount of memory in your system, it has the same meaning here as it does everywhere else in complexity analysis.

      Small datasets fit in L3, smaller datasets fit in L2, and if they're smaller still they'll fit in L1, all of which are progressively faster. If your data set is bigger than L3, you need to hit RAM, and bigger still you'll need to hit the SSD or HDD — going progressively slower.

      6 replies →

    • You should see it from the other side: a cache miss takes longer the bigger the data you have to fit in your cache (as you go through bigger and slower cache levels).

      So, as your N (length of data) changes, access time changes too.

You should think memory as fixed length continuous blocks of different sizes.

Inside the processor there are cache-lines (usually 64 bytes). They are blocks of memory that are are tagged by CPU at once and moved together.

In the typical n-way set associative cache architecture the main RAM memory is divided blocks with n-lines each. Each set on the memory cache can hold up to n-lines from the same memory block. 8-way cache would divide 1 GB RAM to 1,024 1 MB blocks. If you work with more than 512 bytes (= 8X64) at the time within that block, there will be cache misses. In other words, CPU caches have have limited amount of cache lines dedicated to large continuous block of RAM (unless they are fully associative caches)

From CPU to DRAM access there is typically 64-byte and 4096-byte regions with range cross penalties. I think 64-byte cross penalty is typically less than 10 cycles, 4096-byte region range cross penalty is several tens of cycles (this on the top of the penalty of accessing DRAM).

Interestingly in Java 8 they changed HashMap buckets to hold a binary tree instead of linked list to change from O(n) to O(log N).

https://stackoverflow.com/questions/24673567/change-to-hashm...

  • To be more accurate, binary trees are constructed only after a bucket ends up containing a small number of elements, which should only happen rarely unless a bad hash function is chosen.

  • That's defending against hash collisions which isn't what the grandparent is noticing (the slowing down of memory accesses due to not fitting into cache).

  • I think they implemented that change to deal with a denial of service attack in http servers. A malicious client could send a large list of parameters with an identical String::hashCode() value, resulting in a single large bucket. Other workarounds included a size limit on the parameter list or a hash function with some parameters randomized at program start.

  • FYI this only happens if the objects implement comparable; if you do not, they will not make binary trees.

Big O is about doing the big picture high level analysis without wasting time on every little detail like the complexity of addition and multiplication. High performance implementation on the other hand is all about twiddling the constant factors in your algorithm(s) for your specific hardware architecture/topology.

See:

http://kaldewey.com/pubs/FAST__SIGMOD10.pdf

  • The parent's claim implies that on a machine with an abstract cache, the complexity of hashmaps could be proven to be something else. You may be able to recover the right behaivor by modeling a machine with infinite memory, divided up into an infinite tower of caches where each level was P times bigger and Q times slower.

  • You definitely have to take the time complexity of addition/multiplication into account when doing big O analysis. But in most cases they are O(1). It is the constant factors you don’t worry about.

    • Yeah but the same could be said about memory accesses. You can always pessimize and assume it misses LLC at which point becomes a constant factor.

      1 reply →

> divergence between what's taught in school and in practice

I remember this sort of thing being mentioned but not in as much detail as I would have gone into (though I'd self-taught a lot of this before Uni, so for me it was more an exercise in unlearning mistakes and formalising & rounding off knowledge than it was about taking in new concepts, perhaps for others cramming the extra detail in would have been counter productive). Algorithm courses would say things like "of course caching methods should be taking into consideration, you'll cover this in architecture modules" and architecture modules would say "you'll cover the effect on some algorithms in much more detail in your software engineering modules"!

All the JS sorting demos that were an active fad a short while ago were simplified in the same way: they assumed that either a comparison or a swap was always the most expensive operation and the other can be discounted, or that all swaps/comparisons are equal due to uniform memory speed (no accounting for caching), and no consideration was made for what was being sorted (inserts are much more expensive in an array than a linked list for instance). I keep meaning to find the time to put a less superficial demo together that allows such things to be modelled at a basic level, perhaps even trying to illustrate the effect of a more distributed architecture (where there is more than one processing unit involved and everything isn't in local fast memory).

Of course, for the most part these simplifications are perfectly valid, but they should always carry a caveat noting what simplifying assumptions are being applied so people learning know that the issues could be more complicated so some critical thinking should be applied in any real world situation.

> Just thought it's an interesting divergence between what's taught in school and in practice.

Not really, no. This isn't a matter of complexity theory being less useful than promised, it's just a matter of understanding what it tells you.

In complexity theoretic terms, platform-specific micro-optimisations are constant factors. They can have a real impact, but they're only worth worrying about after you've got good complexity.

Bubble-sort in fine-tuned SIMD assembly code will out-perform bubble-sort in Python, but will be far slower than a Python quicksort (except for small inputs).

Notice that we're discussing which hash-map to use, as opposed to which data-structure to use in the first place. Their basic time-complexities are the same, so we have the luxury of worrying about the constant factors.

Linear-scan dictionaries can't compete here, no matter how well micro-optimised. (Again, except for very small inputs.) Complexity theory tells us why.

Additionally, complexity theory is useful when reasoning about large problems (high values of n), but it doesn't tell you how to do the platform-specific micro-optimisations that are important when attacking small problems. Implementation details will dominate there, rather than the number of algorithmic operations.

Anyone who teaches complexity theory without explaining these points, has done their students a disservice.

> I wonder if there's something that's more predictive than big Oh for these types of analysis.

When it comes to real-world performance, there's no substitute for just measuring real-world performance. Detailed analysis of things like cache behaviour, and behaviour under highly concurrent access, could be neat though.

Edit For completeness, I should mention that in extreme cases, there are indeed algorithms with superior complexity which are in practice completely useless, as the break-even point is so awfully high that they could never be useful. Matrix multiplication, for instance, where the algorithms with the best time-complexity are quite literally never useful. https://en.wikipedia.org/w/index.php?title=Computational_com...

Perhaps that contradicts my whole point :-P

The other limitation of complexity theory is that in practice we often care about average-case performance, whereas theorists tend to care more about worst-care performance.

  • The increasing latency [1] of access to larger amounts of storage isn't a constant factor, though, even asymptotically. There was a fun post on HN a while ago arguing that random access latency to N bytes is asymptotically O(N^0.5), from both empirical data about real systems and theoretically from the Bekenstein bound and the speed of light. And even if you don't buy that it clearly can't be less than O(N^1/3). Don't go preaching this in job interviews, though :-)

    The truth is that complexity theory is always done relative to some abstract machine model, and as with all models it's up to you to determine if the model is a good fit for your practical reality. "To the extent that mathematics reflects reality it is not certain, and to the extent that it is certain it does not reflect reality." There exist abstract models, like the cache oblivious model, that do a somewhat better job of grappling with this issue but don't require to just throw up your hands and rely solely on benchmarks of real hardware. There is probably room for new theory in this area.

    [1] not throughput, which isn't as hard to scale up

    • The article in question: https://news.ycombinator.com/item?id=12383012

      It's fun to think about, but the presentation played fast and loose with a few concepts. Sparked some lively discussion because of that. I really like the idea of needing to count random memory accesses as costing more than sequential accesses when analyzing algorithms in a big-O-type notation.

      1 reply →

The performance gap between performing cache-hot lookups vs lookups that miss can be 10X. The increase is more a reflection of that rather than doing more operations.

The RAM model doesn't concern itself with memory hierarchies or slow ops (ie divisions are supposed to be as fast as additions.)

For example, if you were benchmarking cuckoo hashing lookups, the results would look similar even though the number of operations is deterministically O(1) rather than expected O(1).

It's not a log-scale graph for the things you talk about. Y-axes are linear scale. So for O(1) algorithm time has an upper bound and plateaus as caches stop being helpful and every operation has to go all the way into the RAM.

If we were talking about the worst lookups, then even in the RAM model they would become slower as the map grows.

To be more specific, if you place n balls into n bins, on average each bin will contain 1 ball, but the largest one, with high probability, will contain O(log(n)/log(log(n))) of them.

can any hash table whose hash function operates over the entire string ever show O(1) time?

  • The O(1) amortized access of a hash table is for the number of entries, not key hashing or comparison. Doubling the number of entries does not double the retrieval time.

    As a hash table gets larger, the cost of key hashing (for the same domain of keys) does not increase. Of course hashing and comparison performance can still be important, it's just not what is generally being analyzed with basic complexity theory.

    • I guess I get the idea that the O(1) is for the number of entries, but in my experience 99% of prod time is spent hashing keys on amortized O(1) systems, so focusing on the basic complexity theory would seem to be leaving a lot of performance on the table and also contributes confusion.

  • > can any hash table whose hash function operates over the entire string ever show O(1) time?

    If we limit the length of the has key, the time taken by the hash function is also limited and turns into a constant. Big-O is all about asymptotic behavior.

    OTOH noticing how the hash function's run time depends on the size of the key can be separately useful. I suspect all such functions are linear in practice, though.

    • I'm talking about functions where the entire string is used to that's necessary on the internet when dealing with billions of strings that are similar to each other

One thing to think about is that accessing a byte memory in a system with n bytes is pretty much guaranteed to be at best a log(n) operation. After all you need log(n) sized address to index the memory, and you need to read every single one of the bits in the address to index the memory.

In practice on any particular machine all the addresses are the same length (and we have finite memory), so it doesn't quite explain the effect you are observing in the article. But it does hint that it makes sense.

I thought a hash table has O(n) access.

  • Average O(1), worst case O(n)

    So worse than a B-tree or a Red-black tree for the worst case.

    • "Average O(1)" is an oxymoron. Big Oh is about the worst case. The only thing that makes sense is "Amortized O(1)", but I believe its possible to create access cases where hash tables are slower.

  • Only some primitive hash table algorithms have O(n) worst case access. No one implements those algorithms anymore though.

  • Big-O specifically refers to worst case, though, people use Big-O to refer to average or worst case scale in-practice.