Comment by esafak

3 months ago

I do not get it, either. How can a picture of text be better than the text itself? Why not take a picture of the screen while you're at it, so the model learns how cameras work?

In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.

But the tweet itself is kinda an answer to the question you're asking.

From the paper I saw that the model includes an approximation of the layout, diagrams and other images of the source documents.

Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.