← Back to context

Comment by trhway

10 days ago

That obviously should and will be fixed architecturally.

>For each new token generated, the model must access the intermediate state of all previous tokens.

Not all the previous tokens are equal, not all deserve the same attention so to speak. The farther the tokens, the more opportunity for many of them to be pruned and/or collapsed with other similarly distant and lesser meaningful tokens in a given context. So instead of O(n^2) it would be more like O(nlog(n))

I mean, you'd expect that for example "knowlegde worker" models (vs. say "poetry" models) would posses some perturbative stability wrt. changes to/pruning of the remote previous tokens, at least to those tokens which are less meaningful in the current context.

Personally, i feel the situation is good - performance engineering work again becomes somewhat valuable as we're reaching N where O(n^2) forces management to throw some money at engineers instead of at the hardware :)