Comment by dragonwriter
2 days ago
> Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
Its part of the multimodal system that the image itself is part of the prompt (other than tuning parameters that control how it does inference, there is no other input channel to a model except the prompt.) There is no separate OCR feature.
(Also, that the prompt is just the initial and fixed part of the context, not something meaningfully separate from the output. All the structure—prompt vs. output, deeper structure within either prompt or output for tool calls, media, etc.—in the context is a description of how the toolchain populated or treats it, but fundamentally isn't part of how the model itself operates.)
No comments yet
Contribute on Hacker News ↗