Comment by renewiltord

4 days ago

The answer is right there: use authentication with cost per load, or an IP whitelist.

GP is absolutely right. If your server is just going to send me traffic when I ask I’m just going to ask and do what I want with the response.

Your server will respond fine if I click through with different IPs and it’s just a menial task to have this distribution of requests to IPs, which is what we made computers for.

Yeah, you’re right of course that no one has to like the “piracy” or “scraping” or whatever other name you’re giving to a completely normal request-response interaction between machines. They can complain. And I can say they’re silly for complaining. No one has to like anything. Heck you could hate ice cream.

As long as we all understand that this mentality is advocating for the end of an open internet. This is the tragedy of the commons in action, the removal of a common good because the few that would take advantage of it do. Just because something is programmed to be a request and response interaction (although the use of blocklists and robots.txt and etc should reveal that it's not a simple request and response interaction), does not mean we should have to go all or nothing in ensuring it's not abused. We are still the operators of programs, it's still a social contract. If I block an IP and the same operator shows up with a different IP, it's like if I got kicked out of a bar and then came back with a fake mustache on and got confused why they think it's wrong because they don't have a members list.

A personal website is like a community cupboard or an open access water tap, people put it out there for others to enjoy but when the reseller shows up and takes it all it's no longer sustainable to provide the service.

Of course, it's all a spectrum: from monster corporations that build in the loss to their projections and participate in wholesale data collection and selling to open websites with no ads or limited ads as a sort of donation box; from a person using css/js to block ads or software to pirate for cheaper entertainment to an AI scrapper using swathes of IPs and servers to non-stop request all the data you're hosting for their own monetary gain. I have different opinions depending on where on the spectrum you are. But I do think piracy and ad blocking are on the same spectrum, and much closer to acceptable than mass AI scraping.

These responses were more about your comments about AI scraping then the piracy vs ad blocking conversation, but in my opinion the gap between them and scraping is quite large.

  • Everyone thinks that their specific pet thing is the precious commons and the other guy is the abuser. But in any case, one should be able to follow the reasoning.

    If blocking ads is permissible because the server cannot control the client but can control itself; then so is “scraping”. Both services ask of their clients something they cannot enforce. And both find that the clients refuse.

    If you find the justification valid but decide that the conclusion is nonetheless absurd, you must find which step in the reasoning has a failure. The temptation is epicyclic: corporations vs humans or something of the sort; commercial vs non-commercial.

    But on its own there is no justification. It’s just that your principles lead you to absurdity but you refuse to revisit them because you like taking from others but you don’t like when others take from you. A fairly simple answer. Nothing for Occam’s Razor to divide.

    Particularly believable because the arrival of AI models trained on the world seems to have coincided with some kind of copyright maximalism that this forum has never seen before. Were the advocates of the RIAA simply not users yet?

    Or, more believably, is it just that taking feels good but being taken from feels bad?

    • I don't say this lightly, but I don't think you read my reply or at least didn't understand the implications, especially because you don't actually argue against anything I say. You only say generic statements about justifications and logical conclusions and conclude with assumptions about RIAA.

      I stated that the open internet as a whole is the commons, not any specific person's pet project, and thus, that AI scraping (or any bulk scraping done commonly and wholesale) makes it untenable for most people to keep participating. Twitter for example has gone your preferred way, mostly requiring authentication to access. There are many arguments on HN about whether that's a good move, or even a move that others could take and expect success. And that's a huge platform. Just recently there have been front page posts on HN about bringing back personal blogs, and also posts about how personal blogs not behind the great wall of Cloudflare led to TBs of "false" traffic because of scrapers, which costs real money.

      I stated I think piracy, ad block, and AI scraping to be part of the same spectrum. I think the justification for ad blocking has a much lower level of burden than the justification for AI scraping to the point you need multiple IPs and argue for whitelisting as the only option to stop it, because of the amount of effect you are having.

      Much like how bandwidth has different levels of payment if you use less than 100 MB or more than 1 TB, or how delivering a package that weighs 10 lbs is way cheaper than a package that weighs 1000 lbs, or how at some level of effort times repetition it makes sense to automate something programmatically vs just doing it manually. There are of course situations where each makes sense, but the expectations can vary, and the results are not always linear depending on the inputs. This all completely ignores the social aspect of it that can add a whole new layer of complexity that has it's own logic.

      Scraping (or access without ads eg ad blockiing, or outside sharing of data eg piracy) has always been complained about by those that have data that people want to scrape, eg airlines or hbo or disney, it's just that now all data is data that is being scraped absolutely non-stop to the detriment of many and the gain of few that everyone has a reason to complain. It also explains why people have differing opinions.

    • I think everyone is fine scraping for what is already public. But there’s a lot of scrapers that just do denial of service. Of I have a 1TB of bandwidth from my provider and only 10% of it is consumed usually, it’s really difficult to not blame someone that slurps it up in 1 hour and prevent anyone else from accessing the content.