Comment by jszymborski

5 hours ago

Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.

In hindsight I may have been pedantic.

  • Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.

  • Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.

    • Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.

      After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.

      Tokens create and hide too many problems to be the 'optimal' solution.

      4 replies →