Comment by Baeocystin
14 days ago
I assume they're getting these massive windows via RAG trickery, vectorization, and other tricks behind the curtain, became I've noticed the same as you- things start dipping in quality pretty quickly.
Does anyone know if I am correct in my assumption?
There's no "RAG trickery" or vector search. They changed the way they encode positions such that in theory they're less sensitive to where the token appears in the string.
That's similar to how previous long-context models worked as well, although the earlier iterations didn't work particularly well, as most have noticed; technically the model "worked" with longer contexts, but it would definitely get dumber. Still too early to tell how this newer variant works, although I'd assume it's at least somewhat better.
the large context windows generally involve RoPE[0] which is a trick that allows the training window to be smaller but expand larger during inference. it seems like they have a new "iRoPE" which might have better performance?
[0]https://arxiv.org/pdf/2104.09864