← Back to context

Comment by macleginn

14 days ago

With some architectural modifications, such as FlashAttention and Ring Attention, we never need to "materialise" the NxN matrix, so the memory constraints have not been a real issue for a couple of years now. As for the processing, I suppose that models operating with larger context windows will impose some kind of block sparsity on the attention weights, so they won't have to do the compute for NxN weights either.

A less obvious, but in the limit more serious problem with such large contexts is the training data. There aren't that many documents with 10M tokens to give to the model at test time, let alone for training. The creators of the IBM granite model series had to use synthetic data to scale even to 128k tokens during training. Overall this looks more like a marketing statement to me.