Comment by esafak

5 months ago

Some like gpt-4o are multi-modal.

4 comments

esafak

the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.

esafak 5 months ago
gpt-4o is multimodal. The o in it stands for omni.
https://news.ycombinator.com/item?id=40608269
- 2-3-7-43-1807 5 months ago
  
  thanks for the link. will have a look at it. if you tokenize tiles and then feed those serially to an llm. i really wouldn't know why someone thinks that's a good idea. you lose all local spacial context not to mention global context if the scan is produced at a slight angle. it's a really stupid idea. of course, provided enough computational power one might brute force a solution that works somewhat well.