Comment by sigmoid10
2 months ago
That would be purely statistic and not based on any algorithmic insight. In fact for hash functions it is quite a common problem that this exact assumption does not hold in the end, even though you might assume so for any "real" scenarios.
> That would be purely statistic and not based on any algorithmic insight.
This is machine learning research ?
Usually we still ask for statistics to be at least valid (i.e. have a significant signal under a null hypothesis). This paper doesn't even do that. It's like claiming no humans have been to the moon and then "verifying" this by randomly asking a million random strangers on the street if they've been there.
I'm not quite getting your point. Are you saying that their definition of "collision" is completely arbitrary (agreed), or that they didn't use enough data points to draw any conclusions because there could be some unknown algorithmic effect that could eventually cause collisions, or something else?
I think they are saying that there is no proof of being injective. The argument with the hash is essentially saying, doing the same experiment with a hash would yield a similar result, yet hash function are not injective by definition. So from this experimental result you cannot conclude language models are injective.
That's not really formally true, there are so called perfect hash functions that are injective over a certain domain, but in most parlance hashing is not considered injective.
Sure, but the paper doesn't claim absolute injectivity. It claims injectivity for practical purposes ("almost surely injective"). That's the same standard to which we hold hash functions -- most of us would consider it reasonable to index an object store with SHA256.
6 replies →