Comment by PaulDavisThe1st

3 months ago

Several people in the comments seem to be blaming Github for taking this step for no apparent reason.

Those of us who self-host git repos know that this is not true. Over at ardour.org, we've passed the 1M-unique-IP's banned due to AI trawlers sucking our repository 1 commit at a time. It was killing our server before we put fail2ban to work.

I'm not arguing that the specific steps Github have taken are the right ones. They might be, they might not, but they do help to address the problem. Our choice for now has been based on noticing that the trawlers are always fetching commits, so we tweaked things such that the overall http-facing git repo works, but you cannot access commit-based URLs. If you want that, you need to use our github mirror :)

35 comments

PaulDavisThe1st

soraminazuki 3 months ago

Only they haven't started doing this right now. For many years, GitHub has been crippling unauthenticated browsing, doing it gradually to gauge the response. When unauthenticated, code search doesn't work at all and issue search stops working after like, 5 clicks at best.

This is egregious behavior because Microsoft hasn't been upfront about this while they were doing this. Many open source projects are probably unaware that their issue tracker has been walled off, creating headaches unbeknownst to them.

jonas21 3 months ago
Just sign in, problem solved. It baffles me that a site can provide a useful service that costs money to run, and all you need to do to use it is create a free account -- and people still find that egregious.
- soraminazuki 3 months ago
  
  That's not how consent works. GitHub captured the open source ecosystem under the premise that its code and issue tracker will remain open to all. Silently changing the deal afterwards is reprehensible.
  
  15 replies →

hannob 3 months ago

> Several people in the comments seem to be blaming Github for taking this step for no apparent reason.

I mean...

* Github is owned by Microsoft.

* The reason for this are AI crawlers.

* The reason AI crawlers exist in masses is an absurd hype around LLM+AI technology.

* The reason for that is... ChatGPT?

* The main investor of ChatGPT happens to be...?

1oooqooq 3 months ago

almost like we bomb children because a politician told us to think of the children. crazy.

uallo 3 months ago

That is also a problem on a side project I've been running for several years. It is based on a heavily rate-limited third-party API. And the main problem is that bots often cause (huge) traffic spikes which essentially DDoSes the application. Luckily, a large part of these bots can easily be detected based on their behaviour in my specific case. I started serving them trash data and have not been DDoSed since.

VladVladikoff 3 months ago

Have you noticed significant slowdown and CPU usage from failban with that many banned IPs? I saw it becoming a huge resource hog with far less IPs than that.

PaulDavisThe1st 3 months ago

Yeah, when we hit about 80-100k banned hosts, iptables causes issues.
There are versions of iptables available that apparently can scale to 1M+ addresses, but our approach is just to unban all at that point, and then let things accumulate again.
Since we because responding with 404 to all commit URLs, the rate of banned address accumulation has slowed down quite a bit.

knowitnone 3 months ago

you mean AI crawlers from Microsoft, owners of Github?

haiku2077 3 months ago
The big companies tend to respect robots.txt. The problem is other, unscrupulous actors use fake user agents and residential IPs and don't respect robots.txt or act reasonably.
- internetter 3 months ago
  
  Big companies have thrown robots.txt to the wind when it comes to their precious AI models.
  
  2 replies →
PaulDavisThe1st 3 months ago

I have no idea where they are from. I'd surprised if MS is using a network of 1M+ residential IP addresses, but they've surprised me before ...

londons_explore 3 months ago

Surely most AI trawlers have special support for git and just clone the repo once?

Macha 3 months ago

The AI companies could do work or they could not do work.
They've pretty widely chosen to not do work and just slam websites from proxy IPs instead.
You would think their products would be used by them to do the work if they worked as well as advertised...
ikiris 3 months ago
I think you vastly overestimate the average dev and their care for handling special cases that are mostly other people’s aggregate problem.
- koolba 3 months ago
  
  Can’t they use the AIs to do it?
1oooqooq 3 months ago

not if you vibe coded your crawler
NBJack 3 months ago

Apparently, the vibe coding session didn't account for it. /s
I would more readily assume a large social networking company filled with bright minds would have worked out some kind of agreement on, say, a large corpus of copyrighted training data before using it.
It's the wild wild west right now. Data is king for AI training.