Comment by js4ever

8 hours ago

"GLM-5.2 hit a problem here, because it can't read images. It isn't multimodal. So instead of looking at a screenshot, it fell back on a hacky workaround: it wrote scripts to read the raw pixel data and check whether the colors came out roughly as expected."

A better way would be to use https://github.com/openbmb/MiniCPM-V

1 comment

js4ever

twobitshifter 8 hours ago

Right, just give the text llm access to a vision specific agent and that problem can be solved. Or if you really want let it even call Opus with an image - seems like you’d still save money