Comment by jamilton

3 months ago

I know this has been said many times before, but I wonder why this is such a common outcome. Maybe from negative outcomes being underrepresented in the training data? Maybe that plus being something slightly niche and complex?

The screenshot method not working is unsurprising to me, VLLMs visual reasoning is very bad with details because they (as far as I understand) do not really have access to those details, just the image embedding and maybe an OCR'd transcript.