← Back to context

Comment by 0xbadcafebee

10 hours ago

Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"

That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.

  • Vision decoding outside of the latent space of the model is lossy, but claude opus's vision isn't that great outside of UI screenshots. I mean it works in a pinch. At least in my testing, if you're looking at non UI images, there are better image to text models that can turn into a very precise documents that any LLM can easily parse.

Are you suggesting it should summarize the image in text or generate it in HTML or something else?