Comment by OutOfHere
1 month ago
Context requires quadratic VRAM. It is why OpenAI hasn't even supported 200k context length yet for its 4o model.
Is there a trick that bypasses this scaling constraint while strictly preserving the attention quality? I suspect that most such tricks lead to performance loss while deep in the context.
I wouldn't bet against this. Whether it's Ring attention, Mamba layers or online fine tuning, I assume this technical challenge will get conquered sooner rather than later. Gemini are getting good results on needle in a haystack with 1M context length.
I suspect the sustainable value will be in providing context that isn't easily accessible as a copy and paste from your hard drive. Whatever that looks like.
Even subpar attention quality is typically better than human memory - we can imagine models that do some sort of triaging from shorter high-quality attention context and extremely long linear (or something else) context.
> Context requires quadratic VRAM
Even if this is not solved, there is so much economic benefit, tens of TBs of VRAM will become feasible.