Comment by yndoendo

7 days ago

Issue with that is that some writings are not word based. People use acronyms, temporal, personalized, industrial jargon, and global ones. Beginning of the year, there where some HN posts about moving from dictionary word to character encoding for LLMs, because of the very varying nature in writing.

Even I used symbols for different means in a shorthand form when constructing an idea.

I see it the same way laws are. Their word definitions are anchored in time from the common dictionaries of the era. Grammar, spelling, and means all change through time. LLMs would require time scoped information to properly parse content from 1400 vs 1900. LLM would be for trying to take meaning out of the content versus retaining the works.

Character based OCR ignores the rules, spelling, and meaning of words and provides what most likely there. This retains any spelling and grammar error that are true positives or false positives, based on the rules of their day.