Comment by thaumasiotes
4 years ago
Short strings, long strings; they're going to use the same key length. Calculating the key may take longer for the long string, if you're basing the hash on the contents of the string[1], but the key won't end up being a different size. The md5 of a 3-byte string is 16 bytes and the md5 of a 40GB string is also 16 bytes.
[1] Not typical. e.g. Java takes the hash key of an object to be its address in memory, which doesn't require looking at the contents.
Calculating the key may take longer for the long string
Right, that’s exactly what they are warning about.
Not typical. e.g. Java takes the hash key of an object to be its address in memory
No, that’s just the base implementation in Object (and arguably it was a bad idea). All useful “value type” classes will override it with a real hash of the content, including String.
There are some cases in Java where you do want to use IDs instead of values as your map keys, but they’re rare.
> All useful “value type” classes will override it with a real hash of the content
Well, this is necessary for a lot of sensible things you'd want to do with non-numeric value types as hash keys...
> including String
...except String is something of an intermediate case. There are loads of use cases where what you're really using is a set of constant strings, not variables that contain arbitrary character data. In that case, you should intern the strings, resulting in non-"value type" keywords where the only thing you care about for equality is whether two keywords do or don't have the same machine address.
I don't actually know how Java handles this, but I had the vague idea that two equal String literals will in fact share their machine address. And String is specifically set up to accommodate this; Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.
Java does intern string literals and constants, but you can’t rely on reference equality unless you intern every string you create at runtime by formatting or decoding, and it isn’t specified whether that creates strong references that will never be GC’d.
Yes, Strings are immutable, so they only calculate their hashCode once, then cache it. However, you need to explicitly intern them with String.intern() if you want to avoid multiple copies of the same String.
> Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.
Hey, and now you have two problems: String hashing and finding all strings which are equal to each other in memory
3 replies →