Comment by SamPatt
2 days ago
I play competitive Geoguessr at a fairly high level, and I wanted to test this out to see how it compares.
It's astonishingly good.
It will use information it knows about you to arrive at the answer - it gave me the exact trailhead of a photo I took locally, and when I asked it how, it mentioned that it knows I live nearby.
However, I've given it vacation photos from ages ago, and not only in tourist destinations either. It got them all as good or better than a pro human player would. Various European, Central American, and US locations.
The process for how it arrives at the conclusion is somewhat similar to humans. It looks at vegetation, terrain, architecture, road infrastructure, signage, and it just knows seemingly everything about all of them.
Humans can do this too, but it takes many thousands of games or serious study, and the results won't be as broad. I have a flashcard deck with hundreds of entries to help me remember road lines, power poles, bollards, architecture, license plates, etc. These models have more than an individual mind could conceivably memorize.
I find this type of problem is what current AI is best at: where the actual logic isn't very hard, but it requires pulling together and assimilating a huge amount of fuzzy, known information from various sources
They are, after all, information-digesters
Which also fits with how it performs at software engineering (in my experience). Great at boilerplate code, tests, simple tutorials, common puzzles but bad at novel and complex things.
This is also why I buy the apocalyptic headlines about AI replacing white collar labor - most white collar employment is mostly creating the same things (a CRUD app, a landing page, a business plan) with a few custom changes
Not a lot of labor is actually engaged in creating novel things.
The marketing plan for your small business is going to be the same as the marketing plan for every other small business with some changes based on your current situation. There’s no “novel” element in 95% of cases.
17 replies →
Definitely matches my experience as well. I've been working away on a very quirky, non-idiomatic 3D codebase, and LLMs are a mixed bag there. Y is down, there's no perspective distortion or Z buffer, there are no meshes, it's a weird place.
It's still useful to save me from writing 12 variations of x1 = sin(r2) - cos(r1) while implementing some geometric formula, but absolutely awful at understanding how those fit into a deeply atypical environment. Also have to put blinders on it. Giving it too much context just throws it back in that typical 3D rut and has it trying to slip in perspective distortion again.
4 replies →
Yep. But wonderful at aggregating details from twelve different man pages to write a shell script I didn't even know was possible to write using the system utils
3 replies →
how often are we truly writing actual novel programs that are complex in a way AI does not excel at?
There are many types of complex, and many times complex for a human coder, are trivial for AI and its skillset.
5 replies →
> novel and complex things
a) What's an example?
b) Is 90% (or more) of programming mundane, and not really novel?
10 replies →
I've been surprised that so much focus was put on generative uses for LLMs and similar ML tools. It seems to me like they have a way better chance of being useful when tasked with interpreting given information rather than generating something meant to appear new.
Yeah, the "generative" in "generative AI" gives a little bit of a false impression. I like Laurie Voss's take on this: https://seldo.com/posts/what-ive-learned-about-writing-ai-ap...
> Is what you're doing taking a large amount of text and asking the LLM to convert it into a smaller amount of text? Then it's probably going to be great at it. If you're asking it to convert into a roughly equal amount of text it will be so-so. If you're asking it to create more text than you gave it, forget about it.
3 replies →
FWIW, I do a lot of talks about AI in the physical security domain and this is how I often describe AI, at least in terms of what is available today. Compared to humans, AI is not very smart, but it is tireless and able to recall data with essentially perfect accuracy.
It is easy to mistake the speed, accuracy, and scope of training data for "intelligence", but it's really just more like a tireless 5th grader.
Something I have found quite amusing about LLMs is that they are computers that don't have perfect recall - unlike every other computer for the past 60+ years.
That is finally starting to change now that they have reliable(ish) search tools and are getting better at using them.
“best where the actual logic isn’t very hard”?
yeah, well it’s also one of the top scorers on the Math olympiads
My guess is that those questions are very typical and follow very normal patterns and use well established processes. Give it something weird and it'll continuously trip over itself.
My current project is nothing too bizarre, it's a 3D renderer. Well-trodden ground. But my project breaks a lot of core assumptions and common conventions, and so any LLM I try to introduce—Gemini 2.5 Pro, Claude 3.7 Thinking, o3—they all tangle themselves up between what's actually in the codebase and the strong pull of what's in the training data.
I tried layering on reminders and guidance in the prompting, but ultimately I just end up narrowing its view, limiting its insight, and removing even the context that this is a 3D renderer and not just pure geometry.
6 replies →
LLMs struggle with context windows, so as long as the problem can be solved in their small windows, they do great.
Humans neural networks are constantly being retrained, so their effective context window is huge. The LLM may be better at a complex, well specified 200 line python program, but the human brain is better at the 1M line real-world application. It takes some study though.
it's just all compression?
always has been
LLMs are like a knowledge aggregator. The reasoning models have potential to get creative usefully but I have yet to see evidence of it, like invent a novel scientific thing
Be that as it may, do not forget that in the pursuit of the most textually plausible output, gaps may be filled in for you.
The mistake, and it's a common one, is in using phrases like "the actual logic" to explain to ourselves what is happening.
It takes a lot of energy to compress the data. And a lot to actually extract something sensible. While you could just just optimize the single problem you have quite easily.
It's just a huge database with nothing except fuzzy search
[dead]
I was absolutely gobsmacked by the three minute chain of reasoning this thing did, and how it absolutely nailed the location of the photo based on plants, the color of a fence, comparison with nearby photos, and oh yeah, also the EXIF data containing the exact lat/long coordinates that I accidentally left in the file. https://bsky.app/profile/matthewdgreen.bsky.social/post/3lnq...
Lol it's very easy to give the models what they need to cheat.
For my test I used screenshots to ensure no metadata.
I mentioned this in another comment but I was a part of an AI safety fellowship last year where we created a benchmark for LLMs ability to geolocate. The models were doing unbelievably well, even the bad open source ones, until we realized our image pipeline was including location data in the filename!
They're already way better than even last year.
I was and am pretty impressed by Google Photo/Lens IDs. But I realized fairly early on that of course it knew the locations of my iPhone photos from the geo info stored in the photo.
I dropped into Google Street View and tried to recreate your location, how did I do?
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
Here's the model's response:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
I don't think it needed the EXIF data. I'd be curious if you tried it again yourself.
This is super easy to test though (whether EXIF is being used). Open up Geoguessr app, take a screenshot, paste into O3. Doing this, O3 took too long (for the guessing period) but nailed 3 of 3 locations to within a kilometer.
Edit: An interesting nuance of modern OpenAI chat interface is the "access to all previous chats" element. When I attempted to test O4-mini using the same image -- I inspected the reasoning and spotted: "At first glance, the image looks like Ghana. Given the previous successful guess of Accra Ghana, let's start in that region".
Super cool, man. Watching pro Geoguessr is my latest break-time activity, these geo-gods never cease to impress me.
One thing I'm curious about - in high level play, how much of the meta involves knowing characteristics about the photography/equipment/etc. that Google used when they shot it? Frequently I'll watch rainbolt immediately know an African country from nothing but the road, is there something I'm missing?
I was a very casual GeoGuessr player for a few months — and I found it pretty remarkable how quickly (and without a lot of dedicated study time) you could learn a lot of tells of specific regions — and get reasonably good (certainly not pro good or anything, but good enough to the hit right country ~80% of the time).
Another thing is how many areas of the world have surprisingly distinct looks. In one of my early games, before I knew much about anything, I was dropped a trail in the woods. I’ve spent a fair amount of time hiking in Northern New England — and I could just tell immediately that’s where I was just from vibes (i.e. the look of the trees and the rocks) — not something I would have guessed I would have been able to recognize.
I went to watch the Minecraft movie, and when the scene where they arrive outside their new house came on I was like... that feels so much like New Zealand. Then a few weeks later I went to visit my mum in Huntly, and she was like "oh yeah, they filmed part of it in Huntly!".
So, yeah vibes are a real thing.
> knowing characteristics about the photography/equipment/etc. that Google used when they shot it?
A lot at the top levels - the camera can tell you which contractor, year, location, etc. At anything less than top, not so much - more street line painting, cars, etc.
In the stream commentary for some of competitive Geoguessr I've watched, they definitely often mention the color and shape of the car (visible edges, shadow, reflections), so I assume pro players know which cars were used where very well.
Also things like follow cars (some countries had government officials follow the streetview car), the season in which coverage was created, camera glitches, the quality of the footage, etc.
There is a lot of "legitimate" knowledge. With just a street you have the type of road surface, its condition, the type of road markings, the bollards, and the type of soil and vegetation next to the road, as well as the presence and type of power poles next to the road, to name a few. But there is also a lot of information leakage from the way google takes streetview footage.
4 replies →
Definitely. The season that coverage was done can be a big thing too. In Russia you'll be looking at the car, antenna type and the season as pretty much the first indicator where you might be.
Copyright year and camera gen is a big thing in some countries too.
Obviously they can still figure out a lot without all that and NMPZ obviates aspects of it (you can't hide camera gens, copyright and season and there are often still traces of the car in some manner). It's definitely not all 'meta' but to be competitive at that level you really do need to be using it. I think Gingey is the only world league player who doesn't use car meta.
Even as a fairly good but nowhere near pro player, it's weird how I associate particular places with particular types of weather. I think if saw Almaty in the summer for example it would feel very weird. I've decided not to deliberately learn car meta but still picked up quite a lot without trying and your 'vibe' of a place can certainly include camera gen.
That sounds exactly like shortcut learning.
Meh, meta is so boring and uninteresting to me personally. Knowing you're in Kenya because of the snorkel, that's just simple memorization. Pick up on geography, architecture, language, sun and street position; that's what I love.
It's clearly necessary to compete at the high level though.
I hear you, a lot of people feel the same way. You can always just play NMPZ if you want to limit the meta.
I still enjoy it because of the competitive aspect - you both have access to the same information, who put in the effort to remember and recall it better?
If it were only meta I would hate it too. But there's always a nice mix in the vast majority of rounds. And always a few rounds here and there that are so hard they'll humble even the very best!
How is stuff like geography, architecture, or language not memorization either?
3 replies →
Thanks. I also love watching the pros play.
>One thing I'm curious about - in high level play, how much of the meta involves knowing characteristics about the photography/equipment/etc. that Google used when they shot it?
The photography matters a great deal - they're categorized into "Generations" of coverage. Gen 2 is low resolution, Gen 3 is pretty good but has a distinct car blur, Gen 4 is highest quality. Each country tends to have only one or two categories of coverage, and some are so distinct you can immediately know a location based solely on that (India is the best example here).
You're asking about photography and equipment, and that's a big part of it, but there's a huge amount other 'meta' information too.
It is somewhat dependent on game mode. There are three games modes:
1. Moving - You can move around freely 2. No Move - You can't move but you can pan the camera around and zoom 3. NMPZ - No Move, No Pan, No Zoom
In Moving and No Move you have all the meta information available to you, because you can look down at the car and up at the sky and zoom in to see details.
This can't be overstated. Much of the data is about the car itself. I have an entire flashcard section dedicated only to car blur alone, here's a sample:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
And another only on antennas:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
You get the idea. The real pros will go much further. All Google Street View images have a copyright year somewhere in the image. They memorize what years certain countries were covered and match it to the images to help narrow down possibilities.
It's all about narrowing down possibilities based on each additional piece of information. The pros have seen so much and memorized so much that it looks like cheating to an outsider, but they just are able to extract information that most people wouldn't even know exists.
NMPZ is a bit different because you have substantially less information. Little to no car meta, harder to check copyright, and of course without zooming or panning you just have less information. That's why a lot of pros (like Zi8gzag) really hang their hat on NMPZ play, because it's a better test of skill.
> when I asked it how, it mentioned that it knows I live nearby.
> The process for how it arrives at the conclusion is somewhat similar to humans. It looks at vegetation, terrain, architecture, road infrastructure, signage, and it just knows seemingly everything about all of them.
Can we trust what the model says when we ask it about how it comes up with an answer?
Not at all. Models have no invisible internal state that they can access between prompts. If you ask "how did you know that?" you are effectively asking "given the previous transcript of our conversation, come up with a convincing rationale for what you just said".
On the other hand, since they "think in writing" they also do not keep any reasoning secret from us. Whatever they actually did is based on past transcript plus training.
2 replies →
You're just asking the left-brain interpreter [1] its opinion about what the right-brain did.
[1] https://en.wikipedia.org/wiki/Left_brain_interpreter
Probably not, see https://www.anthropic.com/research/reasoning-models-dont-say...
Would be interesting to apply Interpretability techniques in order to understand how the model really reasons about it.
>I have a flashcard deck with hundreds of entries to help me remember road lines, power poles, bollards, architecture, license plates, etc.
You're basically training yourself the same way an AI is trained at that point.
Geoguessr pro zi8gzag tried out one of the AIs in a video: https://www.youtube.com/watch?v=mQKoDSoxRAY It was indeed extremely impressive and for sure would have annihilated me, but I believe it would have no chance to beat zi8gzag or any other top player. But give it a year or two and I'm sure it will crush any human player. Geoguessr is, afaict, primarily about rote memorization of various features (such as types of electricity poles, road signage, foilage, etc.) which AIs excel at.
Looks like that video uses Gemini 2.0 (probably Flash) in streaming mode (via AI studio) from a few months ago. Gemini 2.5 might do better, but in my explorations so far o3 is hugely more capable than even Gemini 2.5 right now.
Try Alibaba's https://chat.qwen.ai/ Activating reasoning
> when I asked it how, it mentioned that it knows I live nearby
Did it mention it in its chain of thought? Otherwise, it could definitely output something because of X and then when asked why “rationalize” that it did it because Y
Is that flashcard deck a commercial/community project or is it something you assembled yourself? Sounds fascinating!
I made it myself.
I use Obsidian and the Spaced Repetition plugin, which I highly recommend if you want a super simple markdown format for flashcards and use Obsidian:
https://www.stephenmwangi.com/obsidian-spaced-repetition/
There are pre-made Geoguessr decks for Anki. However, I wouldn't recommend using them. In my experience, a fundamental part of spaced repetition's efficacy is in creating the flashcards yourself.
For example I have a random location flashcard section where I will screenshot a location which is very unique looking, and I missed in game. When I later review my deck I'm way more likely to properly recall it because I remember the context of making the card. And when that location shows up in game, I will 100% remember it, which has won me several games.
If there's interest I can write a post about this.
> In my experience, a fundamental part of spaced repetition's efficacy is in creating the flashcards yourself.
+1 to this, have found the same when going through the Genki Japanese-language textbook.
I'm assuming you're finding your workflow is just a little too annoying with Anki? I haven't yet strayed from it, but may check out your Obsidian setup.
I'd be fascinated to read more about this. I'd love to see a sample screenshot of a few of your cards too.
3 replies →
I’m interested from a learning science perspective. It’s a nice finding even if anecdotal
Did you include location metadata with the photos by chance? I’m pretty surprised by these results.
No, I took screenshots to ensure it.
Your skepticism is warranted though - I was a part of an AI safety fellowship last year and our project was creating a benchmark for how good AI models are at geolocation from images. [This is where my Geoguessr obsession started!]
Our first run showed results that seemed way too good; even the bad open source models were nailing some difficult locations, and at small resolutions too.
It turned out that the pipeline we were using to get images was including location data in the filename, and the models were using that information. Oops.
The models have improved very quickly since then. I assume the added reasoning is a major factor.
As a further test, I dropped the street view marker on a random point in the US, near Wichita, Kansas, here's the image:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
I fed it o3, here's the response:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
Nailed it.
There's no metadata there, and the reasoning it outputs makes perfect sense. I have no doubt it'll be tricky when it can be, but I can't see a way for it to cheat here.
This is right by where I grew up and the broadcast tower and turnpike sign were the first two things I noticed too, but the ability to realize it was the East side instead of the West side because the tower platforms are lower is impressive.
1 reply →
A) o3 is remarkably good, better than benchmarks seem to indicate in many circumstances
B) it definitely cheats when it can — see this chat where it cheated by extracting EXIF data and wasn’t ashamed when I complained about it cheating: https://chatgpt.com/share/6802e229-c6a0-800f-898a-44171a0c7d...
Note that they can claim to guess a location based on reasonable clues, but actually use EXIF data. See https://news.ycombinator.com/item?id=43732866
Maybe it's not what happened in your examples, but definitely something to keep an eye on.
Yes, I'm aware. I've been using screenshots only to avoid that. Check my last few comments for examples without EXIF data if you're interested to see o3's capabilities.
[dead]
Makes me wonder what the ceiling even is for human players if AI can now casually flex knowledge that would take us years to grind out.
https://www.youtube.com/watch?v=QRqKPDJYyLE
> It will use information it knows about you to arrive at the answer.. and when I asked it how, it mentioned that it knows I live nearby.
Oh! RIP privacy :(
I’ve pretty much given up on the idea that we can fully protect our privacy while still getting the most out of these services. In the end, it’s a tradeoff—and I’ve accepted that.
Is it meaningful to conclude that this is an algorithm that pro GGsrs all follow, and one of them perhaps explained somewhere and the model took it? Is geo-guessing something that can be presented as algorithm or steps? Perhaps it is not as challenging as it seems, given one knows what to look for?
not as challenging... as say complex differential geometry.
One thing I’m curious about is if they are so good, and use a similar technique as humans, because they are trained on people writing out their thought processes. Which isn’t a bad thing or an attempt to say they are cheating or this isn’t impressive. But I do wonder how much of the approach taken is “trained in”.
> how much of the approach taken is “trained in”.
100% of it is. There is no other source of data except human-generated text and images.
Have you gleaned anything watching o3 make decisions on a photo? ( i.e. have you noticed if it has thought of anything you.. and other higher level players similar to you... have not? )
This is an interesting question.
I watch the output with fascination, mostly because of the sheer breadth of knowledge. But thus far I can't think of anything that is categorically different from what humans do, it's just got an insane amount of knowledge available to it.
For example, I gave it an image from a town on a small Chilean island. I was shocked when it nailed it, and in the output it said, "I can see a green wooden street sign, common to Chilean coastal towns on [the specific island]."
I have an entire flashcard section for street signage, but just for practicality I'm limited to memorizing scores, possibly hundreds of signs if I'm insanely dedicated. I would still probably never have this one remote Chilean island.
It does that for everything in every category.
> These models have more than an individual mind could conceivably memorize.
#computers
> It looks at vegetation, terrain, architecture, road infrastructure, signage, and it just knows seemingly everything about all of them.
Someone explain to me how this is dystopian. Are Jeopardy champions dystopian too?
It’s not crazy to be able to ID trees and know their geographic range, likewise for architecture, likewise for highway signs. Finding someone who knows all of these together is more rare , but imo not exactly dystopian
Edit: why am I being downvoted for saying this? If anyone wants to go on a walk for me I can help them ID trees, it’s a fun skill to have and something anyone can learn
I wonder how it compares with StreetCLIP.
GeoGuessr, well I guess that must have been a great training source for the models
And yet many will continue to confidently declare that this is nothing more than fancy autocomplete stochastic parrots.
I will.
I am willing to bet that most of this geolocation success is based on overfitting for Google streetview peculiarities.
I.e., feed it images from a military drone and success rate will plummet.
You're anthropomorphizing a likelihood maximization calculator.
And no, human brains are not likelihood maximization calculators.