I don’t think the publication date (May 8, as I type this) on the GitHub blog article is the same date this change became effective.
From a long-term, clean network I have been consistently seeing these “whoa there!” secondary rate limit errors for over a month when browsing more than 2-3 files in a repo.
My experience has been that once they’ve throttled your IP under this policy, you cannot even reach a login page to authenticate. The docs direct you to file a ticket (if you’re a paying customer, which I am) if you consistently get that error.
I was never able to file a ticket when this happened because their rate limiter also applies to one of the required backend services that the ticketing system calls from the browser. Clearly they don’t test that experience end to end.
Why would the changelog update not include this? it's the most salient piece of information.
I thought I was just misreading it and failing to see where they stated what the new rate limits were, since that's what anyone would care about when reading it.
I opened a repo in a spare computer browser and clicked on a couple things and got a rate limit error. It feels effectively unusable unless you're logged in now (couldn't search from before, now you can't even browse).
I've hit this over the past week browsing the web UI. For some reason, github sessions are set really short and you don't realise you're not logged in until you get the error message.
60/hr is not the same as 1/min, unless you're trying to continually make as many requests as possible, like a crawler. and if that is for your use case, then your traffic is probably exactly what they're trying to block.
I bump into these limits just using a few public install scripts for things like Atuin, Babashka, and Clojure on a single machine on my home IP. They're way too low to be reasonable.
Several people in the comments seem to be blaming Github for taking this step for no apparent reason.
Those of us who self-host git repos know that this is not true. Over at ardour.org, we've passed the 1M-unique-IP's banned due to AI trawlers sucking our repository 1 commit at a time. It was killing our server before we put fail2ban to work.
I'm not arguing that the specific steps Github have taken are the right ones. They might be, they might not, but they do help to address the problem. Our choice for now has been based on noticing that the trawlers are always fetching commits, so we tweaked things such that the overall http-facing git repo works, but you cannot access commit-based URLs. If you want that, you need to use our github mirror :)
Only they haven't started doing this right now. For many years, GitHub has been crippling unauthenticated browsing, doing it gradually to gauge the response. When unauthenticated, code search doesn't work at all and issue search stops working after like, 5 clicks at best.
This is egregious behavior because Microsoft hasn't been upfront about this while they were doing this. Many open source projects are probably unaware that their issue tracker has been walled off, creating headaches unbeknownst to them.
Just sign in, problem solved. It baffles me that a site can provide a useful service that costs money to run, and all you need to do to use it is create a free account -- and people still find that egregious.
That is also a problem on a side project I've been running for several years. It is based on a heavily rate-limited third-party API. And the main problem is that bots often cause (huge) traffic spikes which essentially DDoSes the application. Luckily, a large part of these bots can easily be detected based on their behaviour in my specific case. I started serving them trash data and have not been DDoSed since.
Have you noticed significant slowdown and CPU usage from failban with that many banned IPs? I saw it becoming a huge resource hog with far less IPs than that.
Yeah, when we hit about 80-100k banned hosts, iptables causes issues.
There are versions of iptables available that apparently can scale to 1M+ addresses, but our approach is just to unban all at that point, and then let things accumulate again.
Since we because responding with 404 to all commit URLs, the rate of banned address accumulation has slowed down quite a bit.
The big companies tend to respect robots.txt. The problem is other, unscrupulous actors use fake user agents and residential IPs and don't respect robots.txt or act reasonably.
Apparently, the vibe coding session didn't account for it. /s
I would more readily assume a large social networking company filled with bright minds would have worked out some kind of agreement on, say, a large corpus of copyrighted training data before using it.
It's the wild wild west right now. Data is king for AI training.
If a company the size of MS isn't able handle the DOS caused by the LLM slurpers, then it really is game over for the open internet. We are going to need government approved ID based logins to even read the adverts at this rate.
But this feels like a further attempt to create a walled garden around 'our' source code. I say our, but the first push to KYC, asking for phone numbers, was enough for me to delete all and close my account. Being on the outside, it feels like those walls get taller every month. I often see an interesting project mentioned on HN and clone the repo, but more and more times that is failing. Trying to browse online is now limited, and they recently disabled search without an account.
For such a critical piece of worldwide technology infrastructure, maybe it would be better run by a not-for-profit independent foundation. I guess, since it is just git, anyone could start this, and migration would be easy.
I’m pretty sure they can handle it, but given their continuous (if somewhat bittersweet) relationship with OpenAI, I’m pretty sure they are just trying to protect “their IP“ or something.
> These changes will apply to operations like cloning repositories over HTTPS, anonymously interacting with our REST APIs, and downloading files from raw.githubusercontent.com.
Or randomly when clicking through a repository file tree. The first time I hit a rate limit was when I was skimming through a repository on my phone, and about the 5th file I clicked I was denied and locked out. Not for a few seconds either, it lasted long enough that I gave up on waiting then refreshing every ~10 seconds.
Does it seem to anyone like eventually the entire internet will be login only?
At this point knowledge seems to be gathered and replicated to great effect and sites that either want to monetize their content OR prevent bot traffic wasting resources seem to have one easy option.
This means it's no longer safe to point to github-hosted repos in `git:` or `github:` dependencies in ruby bundler, yes?
I forget because I don't use them, but weren't there some products meant as dependency package repositories that github had introduced at some point, for some platforms? Does this apply to them? (I would hope not unless they want to kill them?)
This rather enormously changes github's potential place in ecosystems.
What with the poor announcement/rollout -- also unusual for what we expect of github, if they had realized how much this effects -- I wonder if this was an "emergency" thing not fully thought out in response to the crazy decentralized bot deluge we've all been dealing with. I wonder if they will reconsider and come up with another solution -- this one and the way it was rolled out do not really match the ingenuity and competence we usually count on from github.
I think it will hurt github's reputation more than they realize if they don't provide a lot more context, with suggested workarounds for various use cases, and/or a rollback. This is actually quite an impactful change, in a way that the subtle rollout seems to suggest they didn't realize?
Are the scraper sites using a large number of IP addresses, like a distributed denial of service attack? If not, rather than explicit blocking, consider using fair queuing. Do all the requests from IP addresses that have zero requests pending. Then those from IP addresses with one request pending, and so forth. Each IP address contends with itself, so making massive numbers of requests from one address won't cause a problem.
I put this on a web site once, and didn't notice for a month that someone was making queries at a frantic rate. It had zero impact on other traffic.
Exactly that. It's an arms race between companies that offer a large number of residential IPs as proxies and companies that run unauthenticated web services trying not to die from denial of service.
I'm not even sure what that would look like for a huge service like GitHub. Where do you hold those many thousands of concurrent http connections and their pending request queues in a way that you can make decisions on them while making more operational sense than a simple rate limit?
A lot of things would be easy if it were viable to have one big all-knowing giga load balancer.
I remember Rap Genius wrote a blog post whining that Heroku did random routing to their dynos instead of routing to the dyno with the shortest request queue. As opposed to just making an all-knowing infiniscaling giga load balancer that knows everything about the system.
Because it doesn't help against DDOS attacks, with bogus request sources.
It's a good mitigation when you have legit requests, and some requestors create far more load than others. If Github used fair queuing for authenticated requests, heavy users would see slower response, but single requests would be serviced quickly. That tends to discourage overdoing it.
Still, if "git clone" stops working, we're going to need a Github alternative.
Yes, LLM-era scrapers are frequently making use of large numbers of IP addresses from all over the place. Some of them seem to be bot nets, but based on IP subnet ownership it seems also pretty frequently to be cloud companies, many of them outside the US. In addition to fanning out to different IPs, many of the scrapers appear to use User Agent strings that are randomised, or perhaps in some cases themselves generated by the slop factory. It's pretty fucking bleak out there, to be honest.
Sounds like a violation of the Computer Fraud and Abuse Act. If a big company training an LLM is doing that, it should be possible to find them and have them prosecuted.
Wow, I'm realizing this applies to even browsing files in the web UI without being logged in, and the limits are quite low?
This rather significantly changes the place of github hosted code in the ecosystem.
I understand it is probably a response to the ill-behaved decentralized bot-nets doing mass scraping with cloaked user-agents (that everyone assumes is AI-related, but I think it's all just speculation and it's quite mysterious) -- which is affecting most of us.
The mystery bot net(s) are kind of destroying the open web, by the counter-measures being chosen.
What's being strip mined is the openness of the Internet, and AI isn't the one closing up shop. Github was created to collaborate on and share source code. The company in the best position to maximize access to free and open software is now just a dragon guarding other people's coins.
The future is a .txt file of John Carmack pointing out how efficient software used to be, locked behind a repeating WAF captcha, forever.
AI isn't the one closing up shop, it’s the one looting all the stores and taking everything that isn’t bolted down. The AI companies are bad actors that are exploiting the openness of the internet in a fashion that was obviously going to lead to this result - the purpose of these scrapers is to grab everything they can and repackage it into a commercial product which doesn’t return anything to the original source. Of course this was going to break the internet, and people have been warning about that from the first moment these jackasses started - what the hell else was the outcome of all this going to be?
Free and open source software is on GitHub, but AI- and other crawlers do not respect the licenses. As someone who writes a lot of code under specific FOSS licenses, I welcome any change that makes it harder for machines to take my code and just steal it
I'm using Firefox and Brave on Linux from a residential internet provider in Europe and the 429 error triggers consistantly on both browsers. Not sure I would consider my setup questionable considering their target audience.
Same way Reddit sells all its content to Google, then stops everyone else from getting it. Same way Stack Overflow sells all its content to Google, then stops everyone else from getting it.
(Joke's on Reddit, though, because Reddit content became pretty worthless since they did this, and everything before they did this was already publicly archived)
The truth is this won't actually stop AI crawlers and they'll just move to a large residential proxy pool to work around it. Not sure what the solution is honestly.
I don't know if I ever recall seeing a CEO go to jail for practically anything, ever. I'm sure there are lots of examples, but at this point in my life I have kind of derived a rule of thumb of "if you want to commit a crime, just disguise it as a legitimate business" based off seeing so many times where CEOs get off scott free .
A take that I'm not seeing in all the "LLM scrapers are heading to our site, run for your lives!" threads is this:
Why can't people harden their software with guards? Proper DDoS protection? Better caching? Rewrite the hot paths in C, Rust, Zig, Go, Haskell etc.?
It strikes me as very odd, the atmosphere of these threads. So much doom and gloom. If my site was hit by an LLM scraper I'd be like "oh, it's on!", a big smile, and I'll get to work right away. And I'll have that work approved because I'll use the occasion to convince the executives of the need. And I'll have tons of fun.
Can somebody offer a take on why are we, the forefront of the tech sector, just surrendering almost without a single shot?
Because our sites are written in layers of abstraction and terrible design, which leads to requests taking serious server resources. If we hosted everything "well", you'd get a few 10-20k req/s per CPU core, but we aren't.
True. I am simply wondering -- is the resistance from executives' so powerful that it can be never overpowered? Can't we ever just tell them "Look, this is like your car with plastic suspension -- it will work for a few days or even months but we can't rely on it forever; it's time to do it proper"?
Especially when the car's plastic suspension is costing them extra money? I don't get it here, for real. I would think that selfish capitalistic interests would have them come around at one point! (Clarification: invest $5M for a year before the whole thing starts costing you extra $30M a year, for example.)
And don't even get me started on the fact that GitHub is written in one of the most hardware-inefficient web frameworks (Rails). I mean OK, Rails is absolutely great for many things because most people's business will never scale as much and as such the initial increased developer velocity is an unquestionable one-sided win. I get that and I stopped hating Rails long time ago (even though I dislike it; but I do recognize where it's a solid and even preferred choice). But I've made a lot of money from trying to modernize and maintain Rails monoliths; it's just not suited for one scale and on -- without paying for extremely expensive consultants that is. It's like, everything can be made to work but it does start costing exponentially more from one scale and further up.
And yet nobody at GitHub figures "Maybe it's time we rewrite some of the hot paths?" or just "Make more aggressive caching even if it means some users see data outdated by 30 seconds or so"? Nothing at all?
Sorry, I am kind of ranting and not really saying anything to you per se. I am just very puzzled about how paralyzed GitHub seems under Microsoft.
It's even more hilarious because this time it's Microsoft/Github getting hit by it. (It's funny because MS themselves are such a bad actor when it comes to AIAIAI).
I guess you're getting the size from the Arctic Code Vault? https://github.blog/news-insights/company-news/github-archiv... That was 5 years ago and is presumably in git's compressed storage format. Caching the corresponding GitHub HTML would take significantly more.
> every *active* public GitHub repository. [active meaning any] repo with any commits between [2019-11-13 and 2020-02-02 ...] The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size
So no files larger than 100KB, no commit history, no issues or PR data, no other git metadata.
Admittedly, that includes private repositories too, and there's no public number for just public repositories, but I'm certain it's at least a noticeable fraction of that ~19PB.
> At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise.
About $ 250 000 for 1000 HDDs and you get all the data. Meaning private persons such as top FAANG engineers could get a copy of the whole data after 2-3 years job. For companies dealing with AI such raw price is nothing at all.
Also https://github.com/orgs/community/discussions/157887 "Persistent HTTP 429 Rate Limiting on *.githubusercontent.com Triggered by Accept-Language: zh-CN Header" but the comments show examples with no language headers.
I encountered this too once, but thought it was a glitch. Worrying if they can't sort it.
Even with authenticated requests, viewing a pull request and adding `.diff` to the end of the URL is currently ratelimited at 1 request per minute. Incredibly low, IMO.
It sucks that we've collectively surrendered the urls to our content to centralized services that can change their terms at any time without any control. Content can always be moved, but moving the entire audience associated with a url is much harder.
Gitea [1] is honestly awesome and lightweight. I've been running my own for years, and since they've put Actions in a while ago (with GitHub compatibility) it does everything I need it to. It doesn't have all the AI stuff in it (but for some that's a positive :P)
Just tried it on chrome incognito on iOS and do hit this 429 rate limit :S
That sucks, it’s already bad enough when GitHub started enforcing login to even do a simple search.
GitHub answered https://github.com/orgs/community/discussions/159123#discuss...
I don’t think the publication date (May 8, as I type this) on the GitHub blog article is the same date this change became effective.
From a long-term, clean network I have been consistently seeing these “whoa there!” secondary rate limit errors for over a month when browsing more than 2-3 files in a repo.
My experience has been that once they’ve throttled your IP under this policy, you cannot even reach a login page to authenticate. The docs direct you to file a ticket (if you’re a paying customer, which I am) if you consistently get that error.
I was never able to file a ticket when this happened because their rate limiter also applies to one of the required backend services that the ticketing system calls from the browser. Clearly they don’t test that experience end to end.
Maybe they expect you to file the ticket from a different IP.
60 req/hour for unauthenticated users
5000 req/hour for authenticated - personal
15000 req/hour for authenticated - enterprise org
According to https://docs.github.com/en/rest/using-the-rest-api/rate-limi...
I bump into this just browsing a repo's code (unauth).. seems like it's one of the side effects of the AI rush.
Why would the changelog update not include this? it's the most salient piece of information.
I thought I was just misreading it and failing to see where they stated what the new rate limits were, since that's what anyone would care about when reading it.
> Why would the changelog update not include this?
I don't know. The limits in the comment that you're replying to are unchanged from where they were a year ago.
So far I don't see anything that has changed, and without an explanation from GitHub I don't think we'll know for sure what has changed.
because it will go way lower soon. and because they don't have to.
they already have all your code. they've won.
12 replies →
[dead]
1 request a minute?!? wow that's just absurd you get it for just looking through code.
I opened a repo in a spare computer browser and clicked on a couple things and got a rate limit error. It feels effectively unusable unless you're logged in now (couldn't search from before, now you can't even browse).
agreed. when i first read the title i thought "oh what did the they up the rates to" - then i realized its more of a "downgraded rate limits"
thanks github for the worse experience
I've hit this over the past week browsing the web UI. For some reason, github sessions are set really short and you don't realise you're not logged in until you get the error message.
I really wish github would stop logging me out.
Hmmmm, Github keeps me logged in for months I feel like. Unless I'm misunderstanding the github security logs, my current login is since march.
1 reply →
Something strange is going on. I think GH has kept me logged in for months at a time. I honestly can’t remember the last time I had to authenticate.
Yes, it's not the rate limits that are the problem per se but GitHub's tendency to log you out and make you go through 2fa.
If they would let me stay logged in for a year then I wouldn't care so much.
2 replies →
1/min? That’s insanely low.
60/hr is not the same as 1/min, unless you're trying to continually make as many requests as possible, like a crawler. and if that is for your use case, then your traffic is probably exactly what they're trying to block.
3 replies →
I bump into these limits just using a few public install scripts for things like Atuin, Babashka, and Clojure on a single machine on my home IP. They're way too low to be reasonable.
Several people in the comments seem to be blaming Github for taking this step for no apparent reason.
Those of us who self-host git repos know that this is not true. Over at ardour.org, we've passed the 1M-unique-IP's banned due to AI trawlers sucking our repository 1 commit at a time. It was killing our server before we put fail2ban to work.
I'm not arguing that the specific steps Github have taken are the right ones. They might be, they might not, but they do help to address the problem. Our choice for now has been based on noticing that the trawlers are always fetching commits, so we tweaked things such that the overall http-facing git repo works, but you cannot access commit-based URLs. If you want that, you need to use our github mirror :)
Only they haven't started doing this right now. For many years, GitHub has been crippling unauthenticated browsing, doing it gradually to gauge the response. When unauthenticated, code search doesn't work at all and issue search stops working after like, 5 clicks at best.
This is egregious behavior because Microsoft hasn't been upfront about this while they were doing this. Many open source projects are probably unaware that their issue tracker has been walled off, creating headaches unbeknownst to them.
Just sign in, problem solved. It baffles me that a site can provide a useful service that costs money to run, and all you need to do to use it is create a free account -- and people still find that egregious.
16 replies →
> Several people in the comments seem to be blaming Github for taking this step for no apparent reason.
I mean...
* Github is owned by Microsoft.
* The reason for this are AI crawlers.
* The reason AI crawlers exist in masses is an absurd hype around LLM+AI technology.
* The reason for that is... ChatGPT?
* The main investor of ChatGPT happens to be...?
almost like we bomb children because a politician told us to think of the children. crazy.
That is also a problem on a side project I've been running for several years. It is based on a heavily rate-limited third-party API. And the main problem is that bots often cause (huge) traffic spikes which essentially DDoSes the application. Luckily, a large part of these bots can easily be detected based on their behaviour in my specific case. I started serving them trash data and have not been DDoSed since.
Have you noticed significant slowdown and CPU usage from failban with that many banned IPs? I saw it becoming a huge resource hog with far less IPs than that.
Yeah, when we hit about 80-100k banned hosts, iptables causes issues.
There are versions of iptables available that apparently can scale to 1M+ addresses, but our approach is just to unban all at that point, and then let things accumulate again.
Since we because responding with 404 to all commit URLs, the rate of banned address accumulation has slowed down quite a bit.
you mean AI crawlers from Microsoft, owners of Github?
The big companies tend to respect robots.txt. The problem is other, unscrupulous actors use fake user agents and residential IPs and don't respect robots.txt or act reasonably.
3 replies →
I have no idea where they are from. I'd surprised if MS is using a network of 1M+ residential IP addresses, but they've surprised me before ...
Surely most AI trawlers have special support for git and just clone the repo once?
The AI companies could do work or they could not do work.
They've pretty widely chosen to not do work and just slam websites from proxy IPs instead.
You would think their products would be used by them to do the work if they worked as well as advertised...
I think you vastly overestimate the average dev and their care for handling special cases that are mostly other people’s aggregate problem.
1 reply →
not if you vibe coded your crawler
Apparently, the vibe coding session didn't account for it. /s
I would more readily assume a large social networking company filled with bright minds would have worked out some kind of agreement on, say, a large corpus of copyrighted training data before using it.
It's the wild wild west right now. Data is king for AI training.
If a company the size of MS isn't able handle the DOS caused by the LLM slurpers, then it really is game over for the open internet. We are going to need government approved ID based logins to even read the adverts at this rate.
But this feels like a further attempt to create a walled garden around 'our' source code. I say our, but the first push to KYC, asking for phone numbers, was enough for me to delete all and close my account. Being on the outside, it feels like those walls get taller every month. I often see an interesting project mentioned on HN and clone the repo, but more and more times that is failing. Trying to browse online is now limited, and they recently disabled search without an account.
For such a critical piece of worldwide technology infrastructure, maybe it would be better run by a not-for-profit independent foundation. I guess, since it is just git, anyone could start this, and migration would be easy.
I’m pretty sure they can handle it, but given their continuous (if somewhat bittersweet) relationship with OpenAI, I’m pretty sure they are just trying to protect “their IP“ or something.
These exist, and you can self host.
However, a lot of people think Github is the only option, and it benefits from network effects.
Non-profit alternatives suffer from a lack of marketing and deal making. True of most things these days.
They also don’t have the resources to ensure perf and reliability if they get really popular, or to invest in UI and other goodness.
Still great for some applications and developers, but not all.
> Non-profit alternatives suffer from a lack of marketing and deal making
Sad but true. I’m trying to promote these whenever I can.
you mean https://savannah.gnu.org?
Or maybe https://codeberg.org/.
Codeberg, Gitea, Forgejo.
> These changes will apply to operations like cloning repositories over HTTPS, anonymously interacting with our REST APIs, and downloading files from raw.githubusercontent.com.
Or randomly when clicking through a repository file tree. The first time I hit a rate limit was when I was skimming through a repository on my phone, and about the 5th file I clicked I was denied and locked out. Not for a few seconds either, it lasted long enough that I gave up on waiting then refreshing every ~10 seconds.
This can affect hosting databases in GitHub repositories.
Yes, it does not look like an intended service usage, but I used it for a demo: https://github.com/ClickHouse/web-tables-demo/
Anyway, will try to do the same with GitHub pages :)
Does it seem to anyone like eventually the entire internet will be login only?
At this point knowledge seems to be gathered and replicated to great effect and sites that either want to monetize their content OR prevent bot traffic wasting resources seem to have one easy option.
Static, Near Static (not generated on demand at least; generated only on real content update), and Login seems likely.
AI not caching things is a real issue. Sites being difficult TO cache / failing the 'wget mirror test' is the other side of the issue.
What about AI not respecting robots.txt? I myself have never ran into this, but I've seen complaints of many people who did.
3 replies →
This means it's no longer safe to point to github-hosted repos in `git:` or `github:` dependencies in ruby bundler, yes?
I forget because I don't use them, but weren't there some products meant as dependency package repositories that github had introduced at some point, for some platforms? Does this apply to them? (I would hope not unless they want to kill them?)
This rather enormously changes github's potential place in ecosystems.
What with the poor announcement/rollout -- also unusual for what we expect of github, if they had realized how much this effects -- I wonder if this was an "emergency" thing not fully thought out in response to the crazy decentralized bot deluge we've all been dealing with. I wonder if they will reconsider and come up with another solution -- this one and the way it was rolled out do not really match the ingenuity and competence we usually count on from github.
I think it will hurt github's reputation more than they realize if they don't provide a lot more context, with suggested workarounds for various use cases, and/or a rollback. This is actually quite an impactful change, in a way that the subtle rollout seems to suggest they didn't realize?
Are the scraper sites using a large number of IP addresses, like a distributed denial of service attack? If not, rather than explicit blocking, consider using fair queuing. Do all the requests from IP addresses that have zero requests pending. Then those from IP addresses with one request pending, and so forth. Each IP address contends with itself, so making massive numbers of requests from one address won't cause a problem.
I put this on a web site once, and didn't notice for a month that someone was making queries at a frantic rate. It had zero impact on other traffic.
Exactly that. It's an arms race between companies that offer a large number of residential IPs as proxies and companies that run unauthenticated web services trying not to die from denial of service.
https://brightdata.com/
Huh, that sounds very reasonable, and it's the first time I've heard it mentioned. Why isn't this more wide-spread?
Complex, stateful.
I'm not even sure what that would look like for a huge service like GitHub. Where do you hold those many thousands of concurrent http connections and their pending request queues in a way that you can make decisions on them while making more operational sense than a simple rate limit?
A lot of things would be easy if it were viable to have one big all-knowing giga load balancer.
I remember Rap Genius wrote a blog post whining that Heroku did random routing to their dynos instead of routing to the dyno with the shortest request queue. As opposed to just making an all-knowing infiniscaling giga load balancer that knows everything about the system.
3 replies →
Because it doesn't help against DDOS attacks, with bogus request sources.
It's a good mitigation when you have legit requests, and some requestors create far more load than others. If Github used fair queuing for authenticated requests, heavy users would see slower response, but single requests would be serviced quickly. That tends to discourage overdoing it.
Still, if "git clone" stops working, we're going to need a Github alternative.
Yes, LLM-era scrapers are frequently making use of large numbers of IP addresses from all over the place. Some of them seem to be bot nets, but based on IP subnet ownership it seems also pretty frequently to be cloud companies, many of them outside the US. In addition to fanning out to different IPs, many of the scrapers appear to use User Agent strings that are randomised, or perhaps in some cases themselves generated by the slop factory. It's pretty fucking bleak out there, to be honest.
Sounds like a violation of the Computer Fraud and Abuse Act. If a big company training an LLM is doing that, it should be possible to find them and have them prosecuted.
Wow, I'm realizing this applies to even browsing files in the web UI without being logged in, and the limits are quite low?
This rather significantly changes the place of github hosted code in the ecosystem.
I understand it is probably a response to the ill-behaved decentralized bot-nets doing mass scraping with cloaked user-agents (that everyone assumes is AI-related, but I think it's all just speculation and it's quite mysterious) -- which is affecting most of us.
The mystery bot net(s) are kind of destroying the open web, by the counter-measures being chosen.
What does “secondary” stand for here in the error message?
> You have exceeded a secondary rate limit.
Edit and self-answer:
> In addition to primary rate limits, GitHub enforces secondary rate limits
(…)
> These secondary rate limits are subject to change without notice. You may also encounter a secondary rate limit for undisclosed reasons.
https://docs.github.com/en/rest/using-the-rest-api/rate-limi...
I assume they're trying to keep ai bots from strip mining the whole place.
Or maybe your IP/browser is questionable.
What's being strip mined is the openness of the Internet, and AI isn't the one closing up shop. Github was created to collaborate on and share source code. The company in the best position to maximize access to free and open software is now just a dragon guarding other people's coins.
The future is a .txt file of John Carmack pointing out how efficient software used to be, locked behind a repeating WAF captcha, forever.
AI isn't the one closing up shop, it’s the one looting all the stores and taking everything that isn’t bolted down. The AI companies are bad actors that are exploiting the openness of the internet in a fashion that was obviously going to lead to this result - the purpose of these scrapers is to grab everything they can and repackage it into a commercial product which doesn’t return anything to the original source. Of course this was going to break the internet, and people have been warning about that from the first moment these jackasses started - what the hell else was the outcome of all this going to be?
6 replies →
Free and open source software is on GitHub, but AI- and other crawlers do not respect the licenses. As someone who writes a lot of code under specific FOSS licenses, I welcome any change that makes it harder for machines to take my code and just steal it
I encountered this on github last week. Very agressive rate limiting. My browser and IP is very ordinary.
Since Microsoft is struggling to make ends meet, maybe they could throw a captcha or proof of work like Anubis by xe iaso.
They already disabled code search for unauthenticated users. Its totally plausible they will disable code browsing as well.
That hit me, too. I thought it was an accidental bug and didn’t realize it was actually malice.
Just sign in if it's an issue for your usage.
1 reply →
> Or maybe your IP/browser is questionable.
I'm using Firefox and Brave on Linux from a residential internet provider in Europe and the 429 error triggers consistantly on both browsers. Not sure I would consider my setup questionable considering their target audience.
I’m browsing from an iPhone in Europe right now and can browse source code just fine without being logged in.
1 reply →
*other ai bots, ms will obviously mine anything on there.
Personally, I like sourcehut (sr.ht)
Same way Reddit sells all its content to Google, then stops everyone else from getting it. Same way Stack Overflow sells all its content to Google, then stops everyone else from getting it.
(Joke's on Reddit, though, because Reddit content became pretty worthless since they did this, and everything before they did this was already publicly archived)
Other bots or MS bots too?
The truth is this won't actually stop AI crawlers and they'll just move to a large residential proxy pool to work around it. Not sure what the solution is honestly.
Criminal charges under CFAA to actual CEOs of actual companies doing this, with long jail terms.
I don't know if I ever recall seeing a CEO go to jail for practically anything, ever. I'm sure there are lots of examples, but at this point in my life I have kind of derived a rule of thumb of "if you want to commit a crime, just disguise it as a legitimate business" based off seeing so many times where CEOs get off scott free .
4 replies →
Criminally charging Russian and Chinese does not work. The solution would be to drop these contries off the internet if we want to play hard.
The US cannot even stop NSO to hack the system with spyware and Israel is a political ally.
2 replies →
At GitHub scale, crawlers will run out of IP addresses regardless if they are use residential addresses.
The blog post is tagged with "improvement" - ironic for more restrictive rate limits.
Also, neither the new nor the old rate limits are mentioned.
A take that I'm not seeing in all the "LLM scrapers are heading to our site, run for your lives!" threads is this:
Why can't people harden their software with guards? Proper DDoS protection? Better caching? Rewrite the hot paths in C, Rust, Zig, Go, Haskell etc.?
It strikes me as very odd, the atmosphere of these threads. So much doom and gloom. If my site was hit by an LLM scraper I'd be like "oh, it's on!", a big smile, and I'll get to work right away. And I'll have that work approved because I'll use the occasion to convince the executives of the need. And I'll have tons of fun.
Can somebody offer a take on why are we, the forefront of the tech sector, just surrendering almost without a single shot?
Because our sites are written in layers of abstraction and terrible design, which leads to requests taking serious server resources. If we hosted everything "well", you'd get a few 10-20k req/s per CPU core, but we aren't.
True. I am simply wondering -- is the resistance from executives' so powerful that it can be never overpowered? Can't we ever just tell them "Look, this is like your car with plastic suspension -- it will work for a few days or even months but we can't rely on it forever; it's time to do it proper"?
Especially when the car's plastic suspension is costing them extra money? I don't get it here, for real. I would think that selfish capitalistic interests would have them come around at one point! (Clarification: invest $5M for a year before the whole thing starts costing you extra $30M a year, for example.)
And don't even get me started on the fact that GitHub is written in one of the most hardware-inefficient web frameworks (Rails). I mean OK, Rails is absolutely great for many things because most people's business will never scale as much and as such the initial increased developer velocity is an unquestionable one-sided win. I get that and I stopped hating Rails long time ago (even though I dislike it; but I do recognize where it's a solid and even preferred choice). But I've made a lot of money from trying to modernize and maintain Rails monoliths; it's just not suited for one scale and on -- without paying for extremely expensive consultants that is. It's like, everything can be made to work but it does start costing exponentially more from one scale and further up.
And yet nobody at GitHub figures "Maybe it's time we rewrite some of the hot paths?" or just "Make more aggressive caching even if it means some users see data outdated by 30 seconds or so"? Nothing at all?
Sorry, I am kind of ranting and not really saying anything to you per se. I am just very puzzled about how paralyzed GitHub seems under Microsoft.
3 replies →
This was announced https://github.blog/changelog/2025-05-08-updated-rate-limits...
(This was originally posted as a reply to https://news.ycombinator.com/item?id=43981344 but we're merging the threads)
Doesn‘t make it any better.
Collateral damage of AI I guess
It's even more hilarious because this time it's Microsoft/Github getting hit by it. (It's funny because MS themselves are such a bad actor when it comes to AIAIAI).
3 replies →
Most of these unauthenticated requests are read-only.
All of public github is only 21TB. Can't they just host that on a dumb cache and let the bots crawl to their heart's content?
I guess you're getting the size from the Arctic Code Vault? https://github.blog/news-insights/company-news/github-archiv... That was 5 years ago and is presumably in git's compressed storage format. Caching the corresponding GitHub HTML would take significantly more.
You're talking about the 21TB captured to the arctic code vault, but that 21TB isn't "all of public github"
Quoting from https://archiveprogram.github.com/arctic-vault/
> every *active* public GitHub repository. [active meaning any] repo with any commits between [2019-11-13 and 2020-02-02 ...] The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size
So no files larger than 100KB, no commit history, no issues or PR data, no other git metadata.
If we look at this blog post from 2022, the number we get is 18.6 PB for just git data https://github.blog/engineering/architecture-optimization/sc...
Admittedly, that includes private repositories too, and there's no public number for just public repositories, but I'm certain it's at least a noticeable fraction of that ~19PB.
> At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise.
About $ 250 000 for 1000 HDDs and you get all the data. Meaning private persons such as top FAANG engineers could get a copy of the whole data after 2-3 years job. For companies dealing with AI such raw price is nothing at all.
Also https://github.com/orgs/community/discussions/157887 "Persistent HTTP 429 Rate Limiting on *.githubusercontent.com Triggered by Accept-Language: zh-CN Header" but the comments show examples with no language headers.
I encountered this too once, but thought it was a glitch. Worrying if they can't sort it.
I remember getting this error a few months ago, this does not seem like a temporary glitch. They dont want llm makers to slurp all the data.
Isn't git clone faster than browsing web?
Yep. But AI trawlers don't use it. Ask them why.
2 replies →
Good that tools like Homebrew that heavily rely on GitHub usually support environment variables like GITHUB_TOKEN
Did I miss where it says what the new rate limits are? Or are they secret?
Even with authenticated requests, viewing a pull request and adding `.diff` to the end of the URL is currently ratelimited at 1 request per minute. Incredibly low, IMO.
This is going to make the job of package managers a PITA. Especially Nix.
Probably to throttle scraping from AI competitors, and have them pay for the privilege as many other services have been doing
How would this affect Go dependencies?
Go doesn't pull dependencies directly from GitHub, they are pulled from https://proxy.golang.org/ by default
Time for Mozilla (and other open-source projects) to move repositories to sourcehut/Codeberg or self-hosted Gitlab/Forgejo?
time to move (alternative github experience) is works, until those crawler goes into alternative then forcing those rate limit as well
Not Mozilla.
Once again people post in the "community", but nobody official replies; these discussion-pages are just users shouting into the void.
you mean you want to better track users
See also: https://github.com/orgs/community/discussions/159123
It sucks that we've collectively surrendered the urls to our content to centralized services that can change their terms at any time without any control. Content can always be moved, but moving the entire audience associated with a url is much harder.
Gitea [1] is honestly awesome and lightweight. I've been running my own for years, and since they've put Actions in a while ago (with GitHub compatibility) it does everything I need it to. It doesn't have all the AI stuff in it (but for some that's a positive :P)
1. https://about.gitea.com/
Gitea’s been great, but I think a lot of its development has moved to Forgejo: https://forgejo.org/
That’s what I run on my personal server now.
6 replies →
https://github.com/orgs/community/discussions/157887 This has been going on for weeks and is clearly not a simple mistake.
(We detached this subthread from https://news.ycombinator.com/item?id=43981673 so we could include it in the merged thread)
Triggered by Chinese language on the client side? Interesting.
[dupe]
Just tried it on chrome incognito on iOS and do hit this 429 rate limit :S That sucks, it’s already bad enough when GitHub started enforcing login to even do a simple search.
[dead]