Comment by frodo8sam
2 months ago
I would guess that the FFT scales better as you increase the number of tokens in the context window. Interesting Google's models outperform their competitors on context size.
2 months ago
I would guess that the FFT scales better as you increase the number of tokens in the context window. Interesting Google's models outperform their competitors on context size.
I'm glad someone else had the same thought. I have been wondering what their "secret sauce" is for a while given how their model doesn't degrade for long-context nearly as much as other LLMs that are otherwise competitive. It could also just be that they used longer-context training data than anyone else though.