Comment by Animats

3 months ago

Are the scraper sites using a large number of IP addresses, like a distributed denial of service attack? If not, rather than explicit blocking, consider using fair queuing. Do all the requests from IP addresses that have zero requests pending. Then those from IP addresses with one request pending, and so forth. Each IP address contends with itself, so making massive numbers of requests from one address won't cause a problem.

I put this on a web site once, and didn't notice for a month that someone was making queries at a frantic rate. It had zero impact on other traffic.

Exactly that. It's an arms race between companies that offer a large number of residential IPs as proxies and companies that run unauthenticated web services trying not to die from denial of service.

https://brightdata.com/

Huh, that sounds very reasonable, and it's the first time I've heard it mentioned. Why isn't this more wide-spread?

  • Complex, stateful.

    I'm not even sure what that would look like for a huge service like GitHub. Where do you hold those many thousands of concurrent http connections and their pending request queues in a way that you can make decisions on them while making more operational sense than a simple rate limit?

    A lot of things would be easy if it were viable to have one big all-knowing giga load balancer.

    I remember Rap Genius wrote a blog post whining that Heroku did random routing to their dynos instead of routing to the dyno with the shortest request queue. As opposed to just making an all-knowing infiniscaling giga load balancer that knows everything about the system.

    • A giga load balancer is no less viable than a giga Redis cache or a giga database. Rate limiting is inherently stateful - you can't rate limit a request without knowledge of prior requests, and that knowledge has to be stored somewhere. You can shift the state around, but you can't eliminate it.

      Sure, some solutions tend to be more efficient than others, but those typically boil down to implementation details rather than fundamental limitations in system design.

      1 reply →

    • > Where do you hold those many thousands of concurrent http connections and their pending request queues in a way that you can make decisions on them while making more operational sense than a simple rate limit?

      Holding open an idle HTTP connection is cheap today. That's the use case for "async". Servicing a Github fetch is much more expensive.

  • Because it doesn't help against DDOS attacks, with bogus request sources.

    It's a good mitigation when you have legit requests, and some requestors create far more load than others. If Github used fair queuing for authenticated requests, heavy users would see slower response, but single requests would be serviced quickly. That tends to discourage overdoing it.

    Still, if "git clone" stops working, we're going to need a Github alternative.

Yes, LLM-era scrapers are frequently making use of large numbers of IP addresses from all over the place. Some of them seem to be bot nets, but based on IP subnet ownership it seems also pretty frequently to be cloud companies, many of them outside the US. In addition to fanning out to different IPs, many of the scrapers appear to use User Agent strings that are randomised, or perhaps in some cases themselves generated by the slop factory. It's pretty fucking bleak out there, to be honest.

  • Sounds like a violation of the Computer Fraud and Abuse Act. If a big company training an LLM is doing that, it should be possible to find them and have them prosecuted.