Comment by ryougi

5 hours ago

>it’s disingenuous to say the inference is on the next token because it’s actually not, it’s in the models parameter space across a set of nonlinear activation functions then effectively projected into the token. The idea its predictive of the token isn’t actually the case, it really is a much more complex and more semantic relationship

Do you, or anyone reading, have any worthwhile links that make a strong case for this (that there is a stronger semantic relationship than simply next token prediction)? I would like to read more about this.