Comment by danielmarkbruce
20 hours ago
It's just plain old addition. There is nothing fancy about the operation. The fancy part is training a model such that it would produce vector representations of words which had this property of conceptually making sense.
If someone says: "conceptually, what is king - man + woman". One might reasonably say "queen". This isn't some well defined math thing, just sort of a common sense thing.
Now, imagine you have a function (lets call it an "embedding model") which turns words into vectors. The function turns king into [3,2], man into [1,1], woman into [1.5, 1.5] and queen into [3.5, 2.5].
Now for king - man + woman you get [3,2] - [1,1] + [1.5,1.5] = [3.5, 2.5] and hey presto, that's the same as queen [3.5, 2.5].
Now you have to ask - how do you get a function to produce those numbers? If you look at the word2vec paper, you'll come to see they use a couple of methods to train a model and if you think about those methods and the data, you'll realize it's not entirely surprising (in retrospect) that you could end up with a function that produced vectors which had such properties. And, if at the same time you are sort of mind blown, welcome to the club. It blew Jeff Dean's big brain too.
I'm sorry, but I think you are overestimating your knowledge.
Have you gone through abstract algebra? Are you familiar with monoids, groups, rings, fields, algebras, and so on?
Because it seems you aren't aware that these structures exist and area critical part of mathematics. It's probably why you're not understanding the conversation. @yellocake seems to understand that "addition" doesn't mean 'addition' (sorry, I assumed you meant how normal people use the word lol). You may not realize it, but you're already showing that addition doesn't have a single meaning. 1+1 = 2 but [1,0] + [0, 1] = [1,1] and 1+0i + 0+i = 1 + i. The operator symbol is the same but the operation actually isn't.
The same as? Or is queen the closest?
If it were just "plain old addition" then @yellowcake (or me![0]) wouldn't have any confusion. Because
We literally just proved that it isn't "plain old addition". So stop being overly confident and look at the facts.
This is commonly true, but not necessarily. Floating point arithmetic is not associative.
Except it doesn't work as well as you think, and that's the issue. There are many examples of it working and this is indeed surprising, but the effect does not generalize. If you go back to Jeff's papers you'll find some reasonable assumptions that are also limiting. Go look at "Distributed Representations of Words and Phrases and their Compositionality"[1] and look at Figure 2. See anything interesting? Notice that the capitals aren't always the closest? You might notice Ankara is closer to Japan than Tokyo. You'll also notice that the lines don't all point in the same direction. So if we made the assumption that the space was well defined then clearly we aren't following the geodesic. But you probably didn't realize a second issue, PCA only works on linear representations. Yet the model is not linear. Now there aren't many details on what they did for the PCA, but it is easy to add information implicitly and there's a good chance that happened here. The model definitely still is facing the challenges of metrics in high dimensional spaces, where notions such as distance become ill-defined.
I've met Jeff and even talked with him at length. He's a brilliant dude and I have no doubt about that. But I don't believe he thinks this works in general. I'm aware he isn't a mathematician, but anyone who plays around with vector embeddings will experience the results I'm talking about. He certainly seems to understand that there are major limits to these models but also that just because something has limits doesn't mean it isn't useful. The paper says just as much and references several works that go into that even further. If you've misinterpreted me as saying embeddings are not useful then you're sorely mistaken. But neither should we talk about tools as if they are infallible and work perfectly. All that does is makes us bad tool users.
[0] I also have no idea what mathematical structure vector embeddings follow. I'm actually not sure anyone does. This is definitely an under researched domain despite it being very important. This issue applies to even modern LLMs! But good luck getting funding for that kind of research. You're going to have a hard time getting it at a big lab (despite having high value) and you don't have the time in academia unless you're tenured, but then you got students to prioritize.
[1] https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec...
Maybe spend more time reading a response than writing. Yellowcake doesn't know what you are talking about either (note the "pulling teeth" comment).
The examples you gave are a result of the embedding model in question not producing vectors which would map to most peoples conceptual view of the world. Go through that website you quote from and see for yourself - it's just element wise addition.
The examples I gave are entirely made up, 2 dimensional vectors to explain what plain old addition means (ie, plain old "add the vectors element wise") in the context of embedding models. And yes, it's the same as, because I defined it that way. Your website uses 300 dimensions, not 2.
As I mentioned, not all embedding models work the same way (or, as you've said, "this doesn't generalize"). They get trained differently, on different data. The word "similar" is used very loosely.
You even directly quote me and don't seem to be able to read the quote. The word "could" is there. You could end up with a model which had these nice properties.
The entire point of my post was to highlight that yellowcake's confusion arises because he assumes it's an esoteric definition of addition that results in your examples, when it's not that.
Quite ironic considering
I actually said
Which is entirely based off of
I didn't assume their knowledge, they straight up told me and I updated my understanding based on that. That's how conversations work. And the fact that they understand operator overloading doesn't mean they understand more either. Do they understand monoids, fields, groups, and rings? Who knows? We'll have to let yellowcake tell us.
Regardless, what you claim I assumed about yellowcake's knowledge is quite different than what I actually said. So maybe take your own advice.
I write a lot because, unlike you, I understand these things are complex. Were it simpler, I would not need as many words.
3 replies →