Comment by fheinsen

17 days ago

As the error via linear approximation approaches similar magnitude as numerical error via quadratic computation, don’t the two start becoming comparable in practice?

I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.

Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:

https://www.anthropic.com/engineering/effective-context-engi...

3 comments

fheinsen

cubefox 17 days ago

That website says nothing about numerical error potentially causing context rot.

fheinsen 16 days ago
As far as I know, there is no widely accepted explanation for context rot.
Numerical error in long sequences of query-key dot-products may be a key factor.
- cubefox 16 days ago
  
  That should be easy to test: test a 16 bit model on various benchmarks, once with fresh context and once with the context filled up with irrelevant tokens. Record the relative performance degradation, and then do the same for a quantized model. Compare whether the quantized model has a significant relatively larger performance drop from context rot. If so, numerical error should be the cause.