Comment by fbouvier
15 hours ago
Yes HTML is too heavy and too expensive for LLM. We are working on a text-based format more suitable for AI.
15 hours ago
Yes HTML is too heavy and too expensive for LLM. We are working on a text-based format more suitable for AI.
What do you think of the DeepSeek OCR approach where they say that vision tokens might better compress a document than its pure text representation?
https://news.ycombinator.com/item?id=45640594
I've spent some time feeding llm with scrapped web pages and I've found that retaining some style information (text size, visibility, decoration image content) is non trivial.
Keeping some kind of style information is definitely important to understand the semantics of the webpage.