← Back to context

Comment by ttul

3 months ago

This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-dimensional information into concepts again.

So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.

The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.

Text is actually one-dimensional, writing is two-dimensional.

To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship between character 15 of line 1 and character 15 of line 2.

A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.