Comment by OutOfHere

6 months ago

Context requires quadratic VRAM. It is why OpenAI hasn't even supported 200k context length yet for its 4o model.

Is there a trick that bypasses this scaling constraint while strictly preserving the attention quality? I suspect that most such tricks lead to performance loss while deep in the context.

3 comments

OutOfHere

energy123 6 months ago

I wouldn't bet against this. Whether it's Ring attention, Mamba layers or online fine tuning, I assume this technical challenge will get conquered sooner rather than later. Gemini are getting good results on needle in a haystack with 1M context length.

I suspect the sustainable value will be in providing context that isn't easily accessible as a copy and paste from your hard drive. Whatever that looks like.

whimsicalism 6 months ago

Even subpar attention quality is typically better than human memory - we can imagine models that do some sort of triaging from shorter high-quality attention context and extremely long linear (or something else) context.

dist-epoch 6 months ago

> Context requires quadratic VRAM

Even if this is not solved, there is so much economic benefit, tens of TBs of VRAM will become feasible.