← Back to context

Comment by adtac

1 year ago

There's definitely some lossy compression when you snap it to the nearest known vector: enumerating every word ever written in human history wouldn't even come close to the 2^(16*D) representable points for a D-dimensional float16 embedding vector. In fact, even adding two float16 values is a form of lossy compression for most additions.

But I'd be surprised if either of those were the primary reason. The words "sea" and "ocean" are different vectors but they'll be very close to each other. salt + water = sea and salt + water = ocean both sound correct to me so the problem is more about whether the v_salt + v_water can even get to the vicinity of either v_sea or v_ocean.

If we constrain our selves to a pool of words of say Wikipedia entries, minutes names and maybe some other stuff, and use a "super node" like "addition" to kind of act as a math operation.. maybe this makes more sense?