Comment by svnt
7 days ago
Parent says “I taught my 5yo how to” — this means their 5yo learned a process.
OP says “I taught LLM how to see” and this should mean the LLM (which is capable of being taught/learning) internalized how to. It did not, it was given a tool that does seeing and tells it what things are.
People are very interested in getting good local LLMs with vision integrated, and so they want to read about it. Next to nobody would click on the honest “I enabled an LLM to use a Google service to identify objects in images”, which is what OP actually did.
I can second this...Been trying to get local LLMs to play through Pokemon Emerald (with virtually 0 success).
I'm under the impression I'm being hampered by a separation of 'brain' and 'eyes', as I have yet to find a reasoning + vision local model that fits on my Mac, and played with two instances of qwen (vision and reasoning) to try to solve, but no real breakthroughs yet. The requirements I've given myself are fully local models, and no reading data from the ROM that the human player cannot be aware of.
I was hoping OP was able to retro-fit vision onto blind models, not just offload it to a cloud model. It's still an interesting write-up, but I for sure got click-baited