OpenStreetMap overwhelmed by bots scraping data

11 days ago (twitter.com)

23 comments

molly_radstowe

CqtGLRGcukpy 11 days ago

They also posted about this on Mastodon / Fedi: https://en.osm.town/@osm_tech/115968544599864782

CqtGLRGcukpy 11 days ago

https://xcancel.com/openstreetmap/status/2016320492420878531

https://nitter.poast.org/openstreetmap/status/20163204924208...

phillipseamore 11 days ago

The number of idiotic vibe coded repos I've seen on GH lately that are doing things like crawling OSM for POI data is mindboggling!

molly_radstowe 11 days ago

#OpenStreetMap hammered by scrapers hiding behind residential proxy/embedded-SDK networks.

Bender 11 days ago
Looks like it is hosted in Equinix in NL? Or just part of it maybe? Is it behind a load balancer, maybe something like HAProxy? If so were stick tables set up to limit rates by cookie and require people be logged in on unique accounts and limit anonymous access after so many requests? I know limiting anonymous access is not great but that is something that could be enabled when under a high load so that instead of the site going offline for everyone it would just be limited for the anonymous users. Degradation vs critical outage
On a separate note have tcpdump captures been done on these excessive connections? Minus the IP, what do their SYN packets look like? Minus the IP what do the corresponding log entries look like in the web server? Are they using HTTP/1.1 or HTTP/2.0? Are they missing any expected headers for a real person such as cors, no-cors, navigate, accept_language?
tcpdump -p --dont-verify-checksums -i any -NNnnvvv -B32768 -c32 -s0 port 443 and 'tcp[13] == 2'
Is there someone at OpenStreetMap that can answer these questions?
- Firefishy 10 days ago
  
  Disclosure: I am part of the mostly volunteer run OpenStreetMap ops team.
  Technically we able to block and restrict the scrapers after the initial request from an IP. We've seen 400,000 IPs in the last 24 hours. Each IP only does a few requests. Most are not very good at faking browsers, but they are getting better. (HTTP/1.1 vs HTTP/2, obviously faked headers etc)
  The problem has been going on for over a year now. It isn't going away. We need journalists and others to help us push back.
  
  2 replies →
- KomoD 11 days ago
  
  I think it could be worth trying to block them with TLS fingerprinting, or since they think it's residential proxies they are being hammered by, https://spur.us could be worth a try.
  
  1 reply →
direwolf20 11 days ago
More like hammered by Google and Apple so you'll use their apps instead.
- petre 11 days ago
  
  Unlikely. The data is freely available for download from geofabrik and other sources.
  
  1 reply →
- wiredpancake 11 days ago
  
  [dead]

dzhiurgis 11 days ago

I'll ask dumb question - if they are "open source" then why they are bothered by it? Is it scraping itself? Are their data not freely available for download?

zeeZ 11 days ago
Their data is freely available to download. There are weekly dumps of the entire planet and several sources for partial data. There's no need for most legitimate use cases to scrape their API.
- dzhiurgis 11 days ago
  
  So problem is someone is stupid enough scraping without realizing they can just download 100gb at once?
  And there are so many of such idiots that it's overwhelming their servers?
  Something doesn't math here.
  
  1 reply →
wodenokoto 11 days ago

Someone has to pay for bandwidth. And that someone would like the bandwidth to go to human users.

solaris2007 11 days ago

Make the data available through bit-torrent and IPFS. Redirect IPs that make excessive requests to response only kilobytes in size "use the torrents and IPFS".

As an SRE, the only legitimate concern here could be the bandwidth costs. But QoS tuning should solve that too.

Supposedly technical people crying out for a journalist to help them is super lame. Everything about this looks super lame.

zeeZ 11 days ago
That data is already available. Including torrents.
https://planet.openstreetmap.org/
- solaris2007 11 days ago
  
  Perfect. Now all they need to do is set up the redirect.
  Every bot is doing something on behalf of a human. Now that LLMs can churn out half-assed bot scripts every "look I installed Arch Linux and ohmyzsh" script kiddie has bots too.
  Bots aren't going anywhere.
  "Use the web the way it was over 10 years ago plox" isn't going to do it.
Firefishy 10 days ago

Disclosure: I am part of the OpenStreetMap mostly-volunteer sysadmin team fighting this.
The scrapers try hard to make themselves look like valid browsers, sending requests via residential IP addresses (400,000+ IPs at last count).
I reached out to journalists because despite strong technical measures, the abuse will not go away on its own.