Comment by thegeomaster
2 days ago
For all of the images I've tried, the base model (e.g. 4o) already has a ~95% accurate idea of where the photo is, and then o3 does so much tool use only to confirm its intuition from the base model and slightly narrow down. For OP's initial image, 4o in fact provides a more accurate initial guess of Carmel-by-the-Sea (d=~100mi < 200mi), and its next guess is also Half Moon Bay, although it did not figure out the exact town of El Granada [0].
The clue is in the CoT - you can briefly see the almost correct location as the very first reasoning step. The model then apparently seems to ignore it and try many other locations, a ton of tool use, etc, always coming back to the initial guess.
For pictures where the base model has no clue, I haven't seen o3 do anything smart, it just spins in circles.
I believe the model has been RL-ed to death in a way that incentivizes correct answers no matter the number of tools used.
[0]: https://chatgpt.com/c/680d011a-9470-8002-97a0-a0d2b067eacf
I tried this using a photo I took with metadata removed, and the thought process initially started thinking the photo was of Adelaide. But then, the reasoning moved on to realise that some features didn't match what it expected of Adelaide, and instead came up with the correct answer of Canberra. It then narrowed it down further to the exact Suburb the photo was taken in.
When I used GPT-4o, it got the completely wrong answer. It gave the answer of Melbourne, which is quite far off.
I had a similar experience, I tried with some photos from various European cities and while it pretty much always got the city correct it was hilariously confidently incorrect in the exact location within the city. They were plausible but nowhere near the level of accuracy the article describes. All the images had distinctly recognizable landmarks which a resident of said city would know and which also have images available online given one knows the name of the landmark so I'm not particularly impressed.
In fact some of the answers were completely geographically impossible where it said "The image is taken from location X showing location Y" when it's not possible to see location Y if one is standing at location X. Like saying "The photo is taken in Central Park looking north showing the Statue of Liberty".
I've been trying some with GPT-4. It does come up with some impressive clues, but hasn't gotten the right answer - says "Latin American city ...", but guesses the wrong one. And when asked for more specificity, it does some more reasoning to confidently name some exact corner in the wrong city. Seems a common LLM problem - rather give a wrong answer than say "I'm not sure".
I know this post was about the o3 model. I'm just using the ChatGPT unpaid app: "What model are you?" it says GPT-4. "How do I use o3?" it says it doesn't know what "o3" means. ok.
Try this prompt to give it a CoT nudge:
Though I've found that it doesn't even need that for the "eaiser" guesses.
However, I live in a small European country and neither 4o nor o3 can figure out most of the spots, so your results are kinda expected.
4o is already really good. For most of the pictures I tried they gave comparable results. However for one image 4o was only able to narrow it down the the country level (even with your CoT prompt it listed three plausible countries) while o3 was able to narrow it down to the correct area in the correct city, being off by only about 500m. That's an impressive jump
Is it possible to share the picture? I've been looking for exactly that kind of jump the other day when playing around.
Did you try reasoning https://chat.qwen.ai/? I was very successful with it
Kind of like it's just trying to make the answer look earned instead of just blurting it out right away
For my image I chose a large landscape with lots of trees and a single piece of infrastructure.
o3 correctly guessed the correct municipality during its reasoning but landed on naming some nearby municipalities instead and then giving the general area as its final answer.
Given the piece of infrastructure getting close should have lead to ah exact result. The reasoning never considered the piece of infrastructure. This seems to be in spite of all the resizing of the image.
In one of my tests I gave it a photo I shot myself, from a point on an ummarked trail, with trees and a bit of a mountain line in the background and a power line.
It correctly guessed the area with 2 mi accuracy. Impressive.
Did you try https://chat.qwen.ai/ with reasoning on?