Comment by tcdent

3 months ago

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?

Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.

  • Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.

    • This is also feature is built into Edge (and I agree it's great, but I mostly use it so I can listen to pages while doing chores around the office/closing my eyes.

      What I would love is an easy way to just convert the page to a mp3 that queues into my podcast app to listen to while taking a walk or driving. It probably exists, but I haven't spent a lot of time looking into it.

    • I do this too. It's great. The term I've seen used to describe this is 'Immersion Reading'. It seems to be quite a popular way for neurodivergent people to get into reading.

    • Any chance you could share the source?

      I found that I can read better if individual words or chunks are highlighted in alternating pastel colors while I scan then with my eyes

  • The pixel to sounds would pass through “reading” so there might be information loss. It is no longer just pixels.

Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.

  • There are some concerns here that should be addressed separately:

    > Ok but what are you going to decode into at generation time, a jpeg of text?

    Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.

    > we process text in many more ways than just reading it

    As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.

    Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.

  • Output really doesn't have to be the same datatypes as the input. Text tokens are good enough for a lot of interesting applications, and transforming percels (name suggested by another commenter here) into text tokens is exactly what an OCR model is anyway trained to do.

Using pixels is still tokenizing. What's needed is something more like "Byte Latent Transformers", which has dynamically sized patches based on information content rather than tokens.

I guess it is because of the absurdly high information density of text - so text is quite a good input.

I do not get it, either. How can a picture of text be better than the text itself? Why not take a picture of the screen while you're at it, so the model learns how cameras work?

  • In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.

    But the tweet itself is kinda an answer to the question you're asking.

  • From the paper I saw that the model includes an approximation of the layout, diagrams and other images of the source documents.

    Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.