← Back to context

Comment by drabbiticus

2 days ago

The specific cherry-picked examples from GP make sense to me.

   data + plural    = datasets 
   data - plural    = datum

If +/- plural can be taken to mean "make explicitly plural or singular", then this roughly works.

   king - crown     = ruler

Rearrange (because embeddings are just vector math), and you get "king = ruler + crown". Yes, a king is a ruler who has a crown.

   king - princess  = man

This isn't great, I'll grant, but there are many YA novels where someone becomes king (eventually) through marriage to a princess, or there is intrigue for the princess's hand for reasons of kingly succession, so "king = man + princess" roughly works.

   king - queen     = prince
   queen - king     = woman

I agree it's hard to make sense of "king - queen = prince". "A queen is a woman king" is often how queens are described to young children. In Chinese, it's actually the literal breakdown of 女王. I also agree there's a gender bias, but also literally everything about LLMs and various AI trained on large human-generated data encodes the bias of how we actually use language and thought patterns. It's one of the big concerns of those in the civil liberties space. Search "llm discrimination" or similar for more on this.

Playing around with age/time related gives a lot of interesting results:

    adult + age = adulthood
    child + age = female child
    year + age = chronological age
    time + year = day
    child + old = today
    adult - old = adult body
    adult - age = powerhouse
    adult - year = man

I think a lot of words are hard to distill into a single embedding. A word may embed a number of conceptually distinct definitions, but my (incomplete) understanding of embeddings is that they are not context-sensitive, right? So averaging those distinct definitions through 1 label is probably fraught with problems when trying to do meaningful vector math with them that context/attention are able to help with.

[EDIT:formatting is hard without preview]