← Back to context

Comment by esafak

15 days ago

Some like gpt-4o are multi-modal.

the llm isn't multimodal. an llm can only process textual tokens. what should those tokens be for pictures. the llm gets fed a textual representation of what was optically recognized by another process. that's my understanding.

  • gpt-4o is multimodal. The o in it stands for omni.

    https://news.ycombinator.com/item?id=40608269

    • thanks for the link. will have a look at it. if you tokenize tiles and then feed those serially to an llm. i really wouldn't know why someone thinks that's a good idea. you lose all local spacial context not to mention global context if the scan is produced at a slight angle. it's a really stupid idea. of course, provided enough computational power one might brute force a solution that works somewhat well.