Comment by immibis
3 days ago
My issue with Gitea (which Forgejo is a fork of) was that crawlers would hit the "download repository as zip" link over and over. Each access creates a new zip file on disk which is never cleaned up. I disabled that (by setting the temporary zip directory to read-only, so the feature won't work) and haven't had a problem since then.
It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.
This is a clever way of doing a minimally invasive botwall though - I like it.
Each access creates a new zip file on disk which is never cleaned up.
That sounds like a bug.
I think that’s been fixed in Forgejo a long time ago
It used to be like that, but they've changed it to a POST request a while ago.
> you can successfully handle many requests.
There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.
I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.
> There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant.
It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.
Sure but how much bandwidth is that actually? Of course if your normal traffic is pretty low, it's easy for bot traffic to multiply that by 5, but it doesn't mean it's actually a problem.
The market price for bandwidth in a central location (USA or Europe) is around $1-2 per TB and less if you buy in bulk. I think it's somewhat cheaper in Europe than in the USA due to vastly stronger competition. Hetzner includes 20TB outgoing with every Europe VPS plan, and 1€/TB +VAT overage. Most providers aren't quite so generous but still not that bad. How much are you actually spending?
Maybe it is fast enough but my objection is mostly due to the gross inefficiency of crawlers. Requesting downloads of whole repositories over and over, leading to storing these archives on disk wasting CPU cycles to create them and storage space to retain them, and bandwidth to sent them over the wire. Add this to the gross power consumption of AI and hogging of physical compute hardware, and it is easy to see “AI” as wasteful.