← Back to context

Comment by minimaxir

6 hours ago

In hindsight I may have been pedantic.

Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.

Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.

  • Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.

    After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.

    Tokens create and hide too many problems to be the 'optimal' solution.

    • > making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language

      Your problem isn't with tokens, but with "language". Tokens have little to do with language, other than usually being consumed in sequence, but that's true of anything that has to span over time. Thinking of tokens as letters or subwords is mistaking the general with the specific. We may have started with letters and words and subwords (trying to find the best balance for training), but then people figured why not add pixel patches to the dictionary, and then sounds, and then other signals, and after iterating on it a bit, we now have image and sound and symbol sequence data all being part of the same token space.

      LLMs stopped being about "language" - in the sense of English, or C++ - long, long time ago. We're still using tokens, but they're more like quanta of sensory input now.

      You can take it in two directions, I guess - either consider "Large Language Model" to be an anachronym, a name that couldn't keep up with times, but we got used to it back when it made sense, or alternatively, just broaden your understanding of "language" to encompass any pattern of quantized sensory inputs, regardless of modality :).

      1 reply →

    • Not to be too snarky but there’s a few trillion dollars and some of the brightest minds of our generation working on this. I’m sure there’s a reason why they’ve settled for or are stuck on tokenization.

      1 reply →