Comment by altruios

5 hours ago

Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.

After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.

Tokens create and hide too many problems to be the 'optimal' solution.

> making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language

Your problem isn't with tokens, but with "language". Tokens have little to do with language, other than usually being consumed in sequence, but that's true of anything that has to span over time. Thinking of tokens as letters or subwords is mistaking the general with the specific. We may have started with letters and words and subwords (trying to find the best balance for training), but then people figured why not add pixel patches to the dictionary, and then sounds, and then other signals, and after iterating on it a bit, we now have image and sound and symbol sequence data all being part of the same token space.

LLMs stopped being about "language" - in the sense of English, or C++ - long, long time ago. We're still using tokens, but they're more like quanta of sensory input now.

You can take it in two directions, I guess - either consider "Large Language Model" to be an anachronym, a name that couldn't keep up with times, but we got used to it back when it made sense, or alternatively, just broaden your understanding of "language" to encompass any pattern of quantized sensory inputs, regardless of modality :).

  • Can you elaborate more on what a token looks like as a pixel patch/sound/general signal as it currently is (in this model)?

    My understanding of pixel representation is: slice a grid in an image, each square slice gets projected into a number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric) and AGAIN gets passed into "position detector" which outputs a token representing that pixel/position. which gets passed into the lmm (at a significantly reduced/translated signal into token space).

    First, before continuing: do I have that mostly correct?

Not to be too snarky but there’s a few trillion dollars and some of the brightest minds of our generation working on this. I’m sure there’s a reason why they’ve settled for or are stuck on tokenization.

This sounds like when crystal people talk quantum physics.

  • I agree with the GP. The idea that there's not a better intermediate representation between tokens and embedding vectors seems absurd. But how to arrive at such a representation and implement it effectively is a few zeroes above my pay grade.

    • I find your agreement seductive because it side steps the unfounded assertions and simply asserts there must be something different and we don’t know it, which is easy for me to agree with too. Or maybe hard to disagree with.