Comment by jhanschoo

3 months ago

There are some concerns here that should be addressed separately:

> Ok but what are you going to decode into at generation time, a jpeg of text?

Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.

> we process text in many more ways than just reading it

As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.

Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.