← Back to context

Comment by 2-3-7-43-1807

14 days ago

the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.

gpt-4o is multimodal. The o in it stands for omni.

https://news.ycombinator.com/item?id=40608269

  • thanks for the link. will have a look at it. if you tokenize tiles and then feed those serially to an llm. i really wouldn't know why someone thinks that's a good idea. you lose all local spacial context not to mention global context if the scan is produced at a slight angle. it's a really stupid idea. of course, provided enough computational power one might brute force a solution that works somewhat well.