← Back to context

Comment by Tarq0n

3 months ago

Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.

There are some concerns here that should be addressed separately:

> Ok but what are you going to decode into at generation time, a jpeg of text?

Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.

> we process text in many more ways than just reading it

As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.

Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.

Output really doesn't have to be the same datatypes as the input. Text tokens are good enough for a lot of interesting applications, and transforming percels (name suggested by another commenter here) into text tokens is exactly what an OCR model is anyway trained to do.