Comment by Wowfunhappy

2 months ago

Claude is not very good at using screenshots. The model may technically be multi-modal, but its strength is clearly in reading text. I'm not surprised it failed here.

8 comments

Wowfunhappy

fnordpiglet 2 months ago

Especially since it decomposes the image into a semantic vector space rather than the actual grid of pixels. Once the image is transformed into patch embeddings all sense of pixels is entirely destroyed. The author demonstrates a profound lack of understanding for how multimodal LLMs function that a simple query of one would elucidate immediately.

The right way to handle this is not to build it grids and whatnot, which all get blown away by the embedding encoding but to instruct it to build image processing tools of its own and to mandate their use in constructing the coordinates required and computing the eccentricity of the pattern etc in code and language space. Doing it this way you can even get it to write assertive tests comparing the original layout to the final among various image processing metrics. This would assuredly work better, take far less time, be more stable on iteration, and fits neatly into how a multimodal agentic programming tool actually functions.

mcbuilder 2 months ago
Yeah, this is exactly what I was thinking. LLMs don't have precise geometrical reasoning from images. Having an intuition of how the models work is actually.a defining skill in "prompt engineering"
- thecr0w 2 months ago
  
  Yeah, still trying to build my intuition. Experiments/investigations like this help me. Any other blogs or experiments you'd suggest?
  
  1 reply →
thecr0w 2 months ago

Great, thanks for that suggestion!

dcanelhas 2 months ago

Even with text, parsing content in 2D seems to be a challenge for every LLM I have interacted with. Try getting a chatbot to make an ascii-art circle with a specific radius and you'll see what I mean.

Wowfunhappy 2 months ago

I don't really consider ASCII art to be text. It requires a completely different type of reasoning. A blind person can be understand text if it's read out loud. A blind person really can't understand ASCII art if it's read out loud.