← Back to context

Comment by JimDabell

11 days ago

This seems like it’s going about things in entirely the wrong way. What this does is say “okay, you still do all the work of crawling, you just pay more now”. There’s no attempt by Cloudflare to offer value for this extra cost.

Crawling the web is not a competitive advantage for any of these AI companies, nor challenger search engines. It’s a cost and a massive distraction. They should collaborate on shared infrastructure.

Instead of all the different companies hitting sites independently, there should be a single crawler they all contribute to. They set up their filters and everybody whose filters match a URL contributes proportionately. They set up their transformations (e.g. HTML to Markdown; text to embeddings), and everybody who shares a transformation contributes proportionately.

This, in turn, would reduce the load on websites massively. Instead of everybody hitting the sites, just one crawler would. And instead of hoping that all the different crawlers obey robots.txt correctly, this can be enforced at a technical and contractual level. The clients just don’t get the blocked content delivered to them – and if they want to get it anyway, the cost of that is to implement and maintain their own crawler instead of using the shared resources of everybody else – something that is a lot more unattractive than just proxying through residential IPs.

And if you want to add payments on, sure, I guess. But I don’t think that’s going to get many people paid at all. Who is going to set up automated payments for content that hasn’t been seen yet? You’ll just be paying for loads of junk pages generated automatically.

There’s a solution here that makes it easier and cheaper to crawl for the AI companies and search engines, while reducing load on the websites and making blocking more effective. But instead, Cloudflare just went “nah, just pay up”. It’s pretty unimaginative and not the least bit compelling.

I think you're looking at the wrong side of the market for the incentive structures here.

Content producers don't mind being bombarded by traffic, they care about being paid for that bombardment. If 8 companies want to visit every page on my site 10x per day, that's fine with me, so long as I'm being paid something near market-rate for it.

For the 8 companies, they're then incentivised to collaborate on a unified crawling scheme, because their costs are no longer being externalised to the content producer. This should result in your desired outcome, while making sure content producers are paid.

  • It depends on the content producer. I would argue the best resourced content producers (governments and large companies) are incentivised to give AI bots as much curated content as possible that is favourable to their branding and objectives. Even if it's just "soft influence" such as the French government feeding AI bots an overwhelming number of articles about how the Eiffel Tower is the most spectacular tourist attraction in all of Europe to visit and should be on everyone's must-visit list. Or for examples of more nefarious objectives--for the fossil fuel industry, feeding AI bots plenty of content about how nuclear is the future and renewables don't work when the sun isn't shining. Or for companies selling consumer goods, feeding AI bots with made-up consumer reviews about how the competitor products are inferior and more expensive to operate over their lifespan.

    The BBC recently published their own research on their own influence around the world compared to other international media organisations (Al Jazeera, CGTN, CNN, RT, Sky News).[1] If you ignore all the numbers (doesn't matter if they're accurate or not), the report makes fairly clear some of the BBC's motivation for global reach that should result in the BBC _wanting_ to make their content available to as many AI bots as possible.

    Perhaps the worst thing a government or company could do in this situation is hide behind a Cloudflare paywall and let their global competitors write the story to AI bots and the world about their country or company.

    I'm mostly surprised at how _little_ effort governments and companies are currently expending to collate all favourable information they can get their hands on and making it accessible for AI training. Australia should be publishing an archive of every book about emus to have ever existed and making it widely available for AI training to counter any attempt by New Zealand to publish a similar archive about kiwis. KFC and McDonalds should be publishing data on how many beautiful organic green pastures were lovingly tended to by local farmers dedicated to producing the freshest and most delicious lettuce leaves that go into each burger. etc

    [1] https://www.bbc.com/mediacentre/2025/new-research-reveals-bb...

    • > It depends on the content producer. I would argue the best resourced content producers (governments and large companies) are incentivised to give AI bots as much curated content as possible that is favourable to their branding and objectives.

      Yeah, if the content being processed is NOT the product being sold by the creator.

      > [..] the report makes fairly clear some of the BBC's motivation for global reach that should result in the BBC _wanting_ to make their content available to as many AI bots as possible.

      What kind of monetization model would this be for BBC?

      "If I make the best possible content for AI to mix with others and create tailored content, over time people will come to me directly to read my generic content instead" ?

      It reminds me of "IE6, the number one browser to download other browsers", but worse

      2 replies →

Well there's common crawl, which is supposed to be that. Though ironically it's been under so much load from AI startups attempting to greedily gobble down its data it was basically inaccessible the last time I tried to use it. Turtles all the way down it seems.

There's probably a gap in the market for something like this. Crawling is a bit of a hassle and being able to outsource it would help a lot of companies. Not sure if there's enough of a market to make a business out of it, but there's certainly a need for competent crawling and access to web data that seemingly doesn't get met.

  • Common Crawl is great, but it only updates monthly and doesn’t do transformations. It’s good for seeding a search engine index initially, but wouldn’t be suitable for ongoing use. But it’s generally the kind of thing I’m talking about, yeah.

>Crawling the web is not a competitive advantage for any of these AI companies,

?? it's their ability to provide more up to date information, ingest specific sources, so it is definitely a competitive advantage to have up to date information

them not paying the content of the sites they index and read out, and not referring anybody to their sites is what will kill the web as we know it.

for a website owner there is zero value of having their content indexed by AI bots. Zilch.

  • > for a website owner there is zero value of having their content indexed by AI bots. Zilch.

    This very much depends on how the site owner makes money. If you’re a journalist or writer it’s an existential threat because not only does it deprive you of revenue but the companies are actively trying to make your job disappear. This is not true of other companies who sell things other than ads (e.g. Toyota and Microsoft would be tickled pink to have AI crawl them more if it meant that bots told their users that those products were better than Ford and Apple’s) and governments around the world would similarly love to have their political views presented favorably by ostensibly neutral AI services.

  • > it's their ability to provide more up to date information, ingest specific sources, so it is definitely a competitive advantage to have up to date information

    My point is that you wouldn’t expect any one of them to be so much better than the others at crawling that it would give them an advantage. It’s just overhead. They all have to do it, but it doesn’t put any of them ahead.

    > for a website owner there is zero value of having their content indexed by AI bots. Zilch.

    Earning money is not the only reason to have a website. Some people just want to distribute information.

    • > Earning money is not the only reason to have a website. Some people just want to distribute information.

      yes, I just want my hosting costs covered, and that is all. Otherwise you are paying for people to steal the info you "just want to share", the info the others make a profit on... that business model is absurd.

    • > My point is that you wouldn’t expect any one of them to be so much better than the others at crawling that it would give them an advantage

      And why not?

If the traffic pays anything at all it's trivial to fund the infrastructure to handle the traffic. Historically sites have scaled well under traffic load.

What's happened recently is either:

1. More and more sites simply block bot, scrapers etc. Cloudflare is quite good at this or

2. Sites which can't do this for access reasons or don't have a monetization model and so can't pay to do it get barraged

IF this actually pays, then it solves a lot of the problems above. It may not pay publishers what they would have earned pre-ai, but it should go a long way to addressing at the very least the costs of a bot barrage and then some on top of that.

But don't these new costs create a direct incentive to cooperate?

  • No. Companies don't care about saving money by itself. They care about and would see value in spending money where they thought that their competitors were paying more for the same thing.

    It's similar to this fortune(6):

        It is not enough to succeed.  Others must fail.
          -- Gore Vidal

Although it doesn’t actually build the index, if AI crawlers really do want to save on crawling costs, couldn’t they share a common index? Seems like it’s up to them to build it.

Advantage is - you know don't have to run your own cloudflare solver which may or may not be more expensive than pay-per-crawl pricing. This is it, this is just "pay to not deal with captcha"

I am not sure how or why you are throwing shade at cloudflare. Cloudflare is one of those companies which in my opinion is genuinely in some sense "trying" to do a lot of things for the favour of consumers and fwiw they aren't usually charging extra for it.

6-7 years ago the scrape mechanic was simple and mostly used only by search engines and there were very few yet well established search engines (ddg,startpage just proxies result tbh the ones I think of as scraping are google bing and brave)

And these did genuinely value robots.txt and such because, well there were more cons than pros. Cons are a reputational hurt and just bad image in media tbh. Pros are what? "Better content?" So what. These search engines are on a lose basis model. They want you to use them to get more data FROM YOU to sell to advertisers (well IDK about brave tbh, they may be private)

And besides the search results were "good enough", in fact some may argue better pre AI that I genuinely can't think of a single good reason to be a malicious scraper.

Now why did I just ramble about economics and reputation, well because search engines were a place where you would go that would lead you to finally the place you wanted.

Now AI has become the place you go that would directly answer. And AI has shifted economics in that manner. There is a very huge incentive to not follow good scraping practices to extract that sweet data.

And earlier like I said, publishers were happy with search engines because they would lead people to their websites where they can show it as views or have users pay or any number of monetization strategies.

Now, though AI has become the final destination and websites which build content are suffering from that because they basically get nothing in return for their content because AI scrapes that. So, I guess now we need a better way to solve the evil scrapers.

Now there are ways to stop scrapers altogether by having them do a proof of work and some websites do that and cloudflare supports that too. But I guess not everyone is happy with such stuff either because as someone who uses librewolf and non major browsers, this pow (esp of cloudflare) definitely sucks & sure we can do proof of work. There's Anubis which is great at it.

But is that the only option? Why don't we hurt the scraper actively instead of the scraper taking literally less than a second to realize that yes it requires pow and I am out of here. What if we can waste the "scrapers time"

Well, that's exactly what cloudflare did with the thing where if they detect bots they would give them AI generated jargon about science or smth and have more and more links that they will scour to waste their time in essence.

I think that's pretty cool. Using AI to defeat AI. It is poetic and one of the best HN posts I ever saw.

Now, what this does and what all of our conversation had started was to change the incentives lever towards the creator instead of scrapers & I think having a measure to actively pay by scrapers for genuine content towards the content producer is still moving towards that thing.

Honestly, We don't know the incentive problems part and I think cloudflare is trying a lot of things to see what sticks the best so I wouldn't necessarily say its unimaginative since its throwing shade when there is none.

Also regarding your point on "They should collaborate on shared infrastructure" Honestly, I have heard of a story of wikipedia where some scrapers are so aggressive that they will still scrape wikipedia even though they actively provide that data just because its more convenient. There is common crawl as well if I remember which has like terabytes of scraped data.

Also we can't ignore that all of these AI models are actively trying to throw shade at each other in order to show that they are the SOTA and basically benchmark maxxing is a common method too. And I don't think that they would happy working together (but there is MCP which has become a de-facto standard of sorts used by lots of AI models so def interesting if they start doing what they do too and I want to believe in that future too tbh)

Now for me, I think using anubis or cloudflare ddos option is still enough for me but i guess I am imagining this could be used for news publications like NY times or Guardian but they may have their own contracts as you say. Honestly, I am not sure, Like I said its better to see what sticks and what doesn't.