Comment by thegeomaster

1 day ago

What's the "attention window"? Are you alleging these frontier models use something like SWA? Seems highly unlikely.

well the attention is a matrix at the end of a day which scales exponentially, 1m tokens would need more memory than any computer system in the world can hold. They maybe have larger ones such as 16k to 32k, but you can just see how GLM models work for more information.

Deepseek is the frontrunner in this technology afaik.