Comment by HarHarVeryFunny

2 months ago

I'm not sure what you're getting at. What's useful about LLMs, and especially multi-modal ones, is that that you can ask them anything and they'll answer to best of their ability (especially if well prompted). I'm not sure that o3, as a "reasoning" model is adding much value here - since there is not a whole lot of reasoning going on.

This is basically fine-grained image captioning followed by nearest neighbor search, which is certainly something you could have built as soon as decent NN-based image captioning became available, at least 10 years ago. Did anyone do it? I've no idea, although it'd seem surprising if not.

As noted, what's useful about LLMs is that they are a "generic solution", so one doesn't need to create a custom ML-based app to be able to do things like this, but I don't find much of a surprise factor in them doing well at geoguessing since this type of "fuzzy lookup" is exactly what a predict-next-token engine is designed to do.

4 comments

HarHarVeryFunny

simonw 2 months ago

How does nearest neighbor search relate to this?

HarHarVeryFunny 2 months ago
If you forget the LLM implementation, fundamentally what you are trying to do here is first detect a bunch of features in the photo (i.e. fine-grain image captioning "in foreground a firepit with safety warning on glass, in background a model XX car parked in front of a bungalow, in distance rolling hills" etc) then do a fuzzy match of this feature set with other photos you have seen - which ones have the greatest number of things in common to the photo you are looking up? You could implement this in a custom app by creating a high-dimensional feature space embedding then looking for nearest neighbors, similar to how face recognition works.
Of course an LLM is performing this a bit differently, and with a bit more flexibility, but the starting point is going to be the same - image feature/caption extraction, which in combination then recall related training samples (both text-only, and perhaps multi-model) which are used to predict the location answer you have asked for. The flexibility of the LLM is that it isn't just treating each feature ("fire pit", "CA licence plate") as independent, but will naturally recall contexts where multiple of these occur together, but IMO not so different in that regard to high dimensional nearest neighbor search.
- simonw 2 months ago
  
  Thanks, that's a good explanation.
  My hunch is that the way the latest o3/o4-mini "reasoning" models work is different enough to be notable.
  If you read through their thought traces they're tackling the problem in a pretty interesting way, including running additional web searches for extra contextual clues.
  
  1 reply →