Comment by jszymborski

5 hours ago

Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.

9 comments

jszymborski

In hindsight I may have been pedantic.

wilkystyle 5 hours ago

I had a similar thought to you, and found your question and the resulting discussion helpful!
santiagobasulto 2 hours ago

Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.
alberto467 4 hours ago
Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.
- altruios 4 hours ago
  
  Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.
  After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.
  Tokens create and hide too many problems to be the 'optimal' solution.
  
  4 replies →