Comment by thaumasiotes
4 years ago
> All useful “value type” classes will override it with a real hash of the content
Well, this is necessary for a lot of sensible things you'd want to do with non-numeric value types as hash keys...
> including String
...except String is something of an intermediate case. There are loads of use cases where what you're really using is a set of constant strings, not variables that contain arbitrary character data. In that case, you should intern the strings, resulting in non-"value type" keywords where the only thing you care about for equality is whether two keywords do or don't have the same machine address.
I don't actually know how Java handles this, but I had the vague idea that two equal String literals will in fact share their machine address. And String is specifically set up to accommodate this; Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.
Java does intern string literals and constants, but you can’t rely on reference equality unless you intern every string you create at runtime by formatting or decoding, and it isn’t specified whether that creates strong references that will never be GC’d.
Yes, Strings are immutable, so they only calculate their hashCode once, then cache it. However, you need to explicitly intern them with String.intern() if you want to avoid multiple copies of the same String.
> Strings are immutable, so in theory it could easily be the case that any two equal Strings must share their machine address, even if you got them from user input.
Hey, and now you have two problems: String hashing and finding all strings which are equal to each other in memory
Well, no, the whole point of this discussion is that solving the second problem means the first problem never comes up.
And this isn't exactly some exotic approach; how often do you think people write Hashes in Ruby where the keys they use are all symbols? It's so common that there's dedicated syntax for it.
It's as old as Lisp, but there's a reason symbols exist separately from strings - they're used differently. Strings are frequently transformed, symbols almost never are. String are frequently taken from end-user input, symbols very rarely. Strings sometimes are very large, symbol names are almost universally very short.
The problem is, interning is an expensive operation. It means adding to an ever growing database of strings, but first checking if the string isn't already there. You don't want to do that every time you change case or flip a letter in a string, or use it to access a hash table. I'm not saying it can't be done, but I honestly have no idea how to implement sane, generic, automatic interning of strings. I feel more comfortable having a symbol type, and control over turning strings into symbols.
1 reply →