← Back to context

Comment by danielmarkbruce

1 day ago

You guys are debating this as though embedding models and/or layers work the same way. They don't.

Vector addition is absolutely associative. The question is more "does it magically line up with what sounds correct in a semantic sense?".

I'm just trying to get an idea of what the operation is such that man - man + man = woman, but it's like pulling teeth.

  • It's just plain old addition. There is nothing fancy about the operation. The fancy part is training a model such that it would produce vector representations of words which had this property of conceptually making sense.

    If someone says: "conceptually, what is king - man + woman". One might reasonably say "queen". This isn't some well defined math thing, just sort of a common sense thing.

    Now, imagine you have a function (lets call it an "embedding model") which turns words into vectors. The function turns king into [3,2], man into [1,1], woman into [1.5, 1.5] and queen into [3.5, 2.5].

    Now for king - man + woman you get [3,2] - [1,1] + [1.5,1.5] = [3.5, 2.5] and hey presto, that's the same as queen [3.5, 2.5].

    Now you have to ask - how do you get a function to produce those numbers? If you look at the word2vec paper, you'll come to see they use a couple of methods to train a model and if you think about those methods and the data, you'll realize it's not entirely surprising (in retrospect) that you could end up with a function that produced vectors which had such properties. And, if at the same time you are sort of mind blown, welcome to the club. It blew Jeff Dean's big brain too.

    •   > It's just plain old addition
      

      I'm sorry, but I think you are overestimating your knowledge.

      Have you gone through abstract algebra? Are you familiar with monoids, groups, rings, fields, algebras, and so on?

      Because it seems you aren't aware that these structures exist and area critical part of mathematics. It's probably why you're not understanding the conversation. @yellocake seems to understand that "addition" doesn't mean 'addition' (sorry, I assumed you meant how normal people use the word lol). You may not realize it, but you're already showing that addition doesn't have a single meaning. 1+1 = 2 but [1,0] + [0, 1] = [1,1] and 1+0i + 0+i = 1 + i. The operator symbol is the same but the operation actually isn't.

        > Now for king - man + woman you get [3,2] - [1,1] + [1.5,1.5] = [3.5, 2.5] and hey presto, that's the same as queen [3.5, 2.5].
      

      The same as? Or is queen the closest?

      If it were just "plain old addition" then @yellowcake (or me![0]) wouldn't have any confusion. Because

           man - man  + man 
        = (man - man) + man 
        =      0      + man 
        = man != woman
      

      We literally just proved that it isn't "plain old addition". So stop being overly confident and look at the facts.

        >>> Vector addition is absolutely associative
      

      This is commonly true, but not necessarily. Floating point arithmetic is not associative.

        > you'll realize it's not entirely surprising that you could end up with a function that produced vectors which had such properties
      

      Except it doesn't work as well as you think, and that's the issue. There are many examples of it working and this is indeed surprising, but the effect does not generalize. If you go back to Jeff's papers you'll find some reasonable assumptions that are also limiting. Go look at "Distributed Representations of Words and Phrases and their Compositionality"[1] and look at Figure 2. See anything interesting? Notice that the capitals aren't always the closest? You might notice Ankara is closer to Japan than Tokyo. You'll also notice that the lines don't all point in the same direction. So if we made the assumption that the space was well defined then clearly we aren't following the geodesic. But you probably didn't realize a second issue, PCA only works on linear representations. Yet the model is not linear. Now there aren't many details on what they did for the PCA, but it is easy to add information implicitly and there's a good chance that happened here. The model definitely still is facing the challenges of metrics in high dimensional spaces, where notions such as distance become ill-defined.

      I've met Jeff and even talked with him at length. He's a brilliant dude and I have no doubt about that. But I don't believe he thinks this works in general. I'm aware he isn't a mathematician, but anyone who plays around with vector embeddings will experience the results I'm talking about. He certainly seems to understand that there are major limits to these models but also that just because something has limits doesn't mean it isn't useful. The paper says just as much and references several works that go into that even further. If you've misinterpreted me as saying embeddings are not useful then you're sorely mistaken. But neither should we talk about tools as if they are infallible and work perfectly. All that does is makes us bad tool users.

      [0] I also have no idea what mathematical structure vector embeddings follow. I'm actually not sure anyone does. This is definitely an under researched domain despite it being very important. This issue applies to even modern LLMs! But good luck getting funding for that kind of research. You're going to have a hard time getting it at a big lab (despite having high value) and you don't have the time in academia unless you're tenured, but then you got students to prioritize.

      [1] https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec...

      4 replies →