Comment by jszymborski
6 hours ago
Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.
6 hours ago
Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.
In hindsight I may have been pedantic.
I had a similar thought to you, and found your question and the resulting discussion helpful!
Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.
Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.
Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.
After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.
Tokens create and hide too many problems to be the 'optimal' solution.
6 replies →