Comment by sansseriff

2 years ago

Can anyone clarify what is meant by "Mythology: we are modifying the meaning of each token based on what we've seen before it in the context, with similar meanings reinforcing each other." At this point in the text, it seems like the kernel smoothing is being applied to each embedding vector in isolation. I don't see why any one y_t vector derived and smoothed from token x_i would be influenced by the nearby tokens in the sequence.

When you add the r_t tokens, sure then I see how context matters. But is that the only think that takes into account context?