← Back to context

Comment by zozbot234

8 hours ago

I think speeding up long context and opening up the use of models with larger shared layers is ultimately more relevant than hosting unused MoE layers. Of course you could do that as a last resort, i.e. when running with a smaller context that leaves some VRAM free to use.

Long context will be solved and capped and turned into a theta 1 operation or, at worst, theta log(n). People don't have infinite perfect recall so agents don't need it. Also, there are really good solutions to it that just aren't explored enough right now since transformer architectures are where everyone is dumping money and time. I suspect very soon somone will have a much better system that just takes over and then the idea of context limits will be a thing of the past. I've actually built something myself that allows infinite context/perfect recall in theta 1 (minor asterisk here as there has to be but meh). I know others have solutions too.

  • There are already models with capped long context but if you make that the whole model it makes needle-in-haystack search impossible and that's actually a very common operation. Which is why Qwen 3.5 only makes a portion of it capped, and AIUI the new Nemotron models are broadly similar.