Comment by eichin

1 month ago

Trying not to turn this into "falsehoods developers believe about geographic names", but having done natural-language geocoding at scale (MetaCarta 2002-2010, acquired by Nokia) the most valuable thing was a growing set of tagged training data - because we were actually building the models out of that, but also because it would detect regressions; I suspect you needed something similar to "keep the LLMs in line", but you also need it for any more artisinal development approach too. (I'm a little surprised you even have a single-value-return search() function, issue#44 is just the tip of the iceberg - https://londonist.com/london/features/places-named-london-th... is a pretty good hint that a range of answers with probabilities attached is a minimum starting point...)

1 comment

eichin

tomaytotomato 1 month ago

Thanks for this - its interesting how I have come to this conclusion as well.

My reworked approach is to return a list of results with a probability or certainty score.

In the situation of someone searching for London, I need to add some sort of priority for London, UK.

My dataset is sourced from an opensource JSON file which I am now pre-processing and identifying all collisions on it.

There are so many collisions!

Could I pick your brains and you could critique my approach? Thanks