Comment by gnfargbl
2 months ago
The nature of high-dimensional spaces kind of intuitively supports the argument for invertability though, no? In the sense that:
> I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small.
That would be purely statistic and not based on any algorithmic insight. In fact for hash functions it is quite a common problem that this exact assumption does not hold in the end, even though you might assume so for any "real" scenarios.
> That would be purely statistic and not based on any algorithmic insight.
This is machine learning research ?
Usually we still ask for statistics to be at least valid (i.e. have a significant signal under a null hypothesis). This paper doesn't even do that. It's like claiming no humans have been to the moon and then "verifying" this by randomly asking a million random strangers on the street if they've been there.
I'm not quite getting your point. Are you saying that their definition of "collision" is completely arbitrary (agreed), or that they didn't use enough data points to draw any conclusions because there could be some unknown algorithmic effect that could eventually cause collisions, or something else?
I think they are saying that there is no proof of being injective. The argument with the hash is essentially saying, doing the same experiment with a hash would yield a similar result, yet hash function are not injective by definition. So from this experimental result you cannot conclude language models are injective.
That's not really formally true, there are so called perfect hash functions that are injective over a certain domain, but in most parlance hashing is not considered injective.
7 replies →
I don't think that intuition is entirely trustworthy here. The entire space is high-dimensional, true, but the structure of the subspace encompassing linguistically sensible sequences of tokens will necessarily be restricted and have some sort of structure. And within such subspaces there may occur some sort of sink or attractor. Proving that those don't exist in general seems highly nontrivial to me.
An intuitive argument against the claim could be made from the observation that people "jinx" eachother IRL every day, despite reality being vast, if you get what I mean.
I do get what you're saying, and it sounds almost analogous to visualisations of bad PRNGs, e.g. https://www.reddit.com/r/dataisbeautiful/comments/gv4fhr/oc_...