the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.
thanks for the link. will have a look at it. if you tokenize tiles and then feed those serially to an llm. i really wouldn't know why someone thinks that's a good idea. you lose all local spacial context not to mention global context if the scan is produced at a slight angle. it's a really stupid idea. of course, provided enough computational power one might brute force a solution that works somewhat well.
the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.
gpt-4o is multimodal. The o in it stands for omni.
https://news.ycombinator.com/item?id=40608269
thanks for the link. will have a look at it. if you tokenize tiles and then feed those serially to an llm. i really wouldn't know why someone thinks that's a good idea. you lose all local spacial context not to mention global context if the scan is produced at a slight angle. it's a really stupid idea. of course, provided enough computational power one might brute force a solution that works somewhat well.