Comment by ttul

2 years ago

This paper was written by a very small team at Google. It strikes me as similar in that regard to the original transformers paper. If this technique scales well, Google is no doubt already exploiting it for their next generation models -- and I think there are signs that Gemini 2.0 models already exploit this.

0 comments