← Back to context

Comment by vessenes

8 hours ago

Context is the vector of tokens (numbers) that goes into the first layers of the neural network.

When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.

Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?

That’s what the paper is referring to.