← Back to context

Comment by TeMPOraL

4 years ago

I'm guessing GP means the complexity guarantee sidesteps the complexity of the hashing function. It probably doesn't matter all that much in typical case - I'm guessing 80-90% of hash map use is with very short strings.

Well-written complexity guarantees specify the operations they count. Otherwise sorting in O(n log(n)) also "sidesteps" the cost of comparison too.

  • And the analysis of hashmaps is not such a well-written guarantee -- as you resize, you need a bigger hash function output to reach all possible buckets. A bigger hash function output, assuming you have to keep the avalanche effect to keep output well-scrambled, requires more computations.

    Earlier discussion: https://news.ycombinator.com/item?id=9807739

Short strings, long strings; they're going to use the same key length. Calculating the key may take longer for the long string, if you're basing the hash on the contents of the string[1], but the key won't end up being a different size. The md5 of a 3-byte string is 16 bytes and the md5 of a 40GB string is also 16 bytes.

[1] Not typical. e.g. Java takes the hash key of an object to be its address in memory, which doesn't require looking at the contents.

  • Calculating the key may take longer for the long string

    Right, that’s exactly what they are warning about.

    Not typical. e.g. Java takes the hash key of an object to be its address in memory

    No, that’s just the base implementation in Object (and arguably it was a bad idea). All useful “value type” classes will override it with a real hash of the content, including String.

    There are some cases in Java where you do want to use IDs instead of values as your map keys, but they’re rare.

    • > All useful “value type” classes will override it with a real hash of the content

      Well, this is necessary for a lot of sensible things you'd want to do with non-numeric value types as hash keys...

      > including String

      ...except String is something of an intermediate case. There are loads of use cases where what you're really using is a set of constant strings, not variables that contain arbitrary character data. In that case, you should intern the strings, resulting in non-"value type" keywords where the only thing you care about for equality is whether two keywords do or don't have the same machine address.

      I don't actually know how Java handles this, but I had the vague idea that two equal String literals will in fact share their machine address. And String is specifically set up to accommodate this; Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.

      6 replies →