Comment by Imustaskforhelp

11 days ago

I am not sure how or why you are throwing shade at cloudflare. Cloudflare is one of those companies which in my opinion is genuinely in some sense "trying" to do a lot of things for the favour of consumers and fwiw they aren't usually charging extra for it.

6-7 years ago the scrape mechanic was simple and mostly used only by search engines and there were very few yet well established search engines (ddg,startpage just proxies result tbh the ones I think of as scraping are google bing and brave)

And these did genuinely value robots.txt and such because, well there were more cons than pros. Cons are a reputational hurt and just bad image in media tbh. Pros are what? "Better content?" So what. These search engines are on a lose basis model. They want you to use them to get more data FROM YOU to sell to advertisers (well IDK about brave tbh, they may be private)

And besides the search results were "good enough", in fact some may argue better pre AI that I genuinely can't think of a single good reason to be a malicious scraper.

Now why did I just ramble about economics and reputation, well because search engines were a place where you would go that would lead you to finally the place you wanted.

Now AI has become the place you go that would directly answer. And AI has shifted economics in that manner. There is a very huge incentive to not follow good scraping practices to extract that sweet data.

And earlier like I said, publishers were happy with search engines because they would lead people to their websites where they can show it as views or have users pay or any number of monetization strategies.

Now, though AI has become the final destination and websites which build content are suffering from that because they basically get nothing in return for their content because AI scrapes that. So, I guess now we need a better way to solve the evil scrapers.

Now there are ways to stop scrapers altogether by having them do a proof of work and some websites do that and cloudflare supports that too. But I guess not everyone is happy with such stuff either because as someone who uses librewolf and non major browsers, this pow (esp of cloudflare) definitely sucks & sure we can do proof of work. There's Anubis which is great at it.

But is that the only option? Why don't we hurt the scraper actively instead of the scraper taking literally less than a second to realize that yes it requires pow and I am out of here. What if we can waste the "scrapers time"

Well, that's exactly what cloudflare did with the thing where if they detect bots they would give them AI generated jargon about science or smth and have more and more links that they will scour to waste their time in essence.

I think that's pretty cool. Using AI to defeat AI. It is poetic and one of the best HN posts I ever saw.

Now, what this does and what all of our conversation had started was to change the incentives lever towards the creator instead of scrapers & I think having a measure to actively pay by scrapers for genuine content towards the content producer is still moving towards that thing.

Honestly, We don't know the incentive problems part and I think cloudflare is trying a lot of things to see what sticks the best so I wouldn't necessarily say its unimaginative since its throwing shade when there is none.

Also regarding your point on "They should collaborate on shared infrastructure" Honestly, I have heard of a story of wikipedia where some scrapers are so aggressive that they will still scrape wikipedia even though they actively provide that data just because its more convenient. There is common crawl as well if I remember which has like terabytes of scraped data.

Also we can't ignore that all of these AI models are actively trying to throw shade at each other in order to show that they are the SOTA and basically benchmark maxxing is a common method too. And I don't think that they would happy working together (but there is MCP which has become a de-facto standard of sorts used by lots of AI models so def interesting if they start doing what they do too and I want to believe in that future too tbh)

Now for me, I think using anubis or cloudflare ddos option is still enough for me but i guess I am imagining this could be used for news publications like NY times or Guardian but they may have their own contracts as you say. Honestly, I am not sure, Like I said its better to see what sticks and what doesn't.

0 comments

Imustaskforhelp

No comments yet

Contribute on Hacker News ↗