Comment by mattlondon
11 days ago
This is where Google wins AI again - most people want the google-bot to crawl their site so they get traffic. There is benefit to both sides there, and Google will use it's crawl-index for AI training. Monopolistic? Perhaps.
But who wants OpenAI or Anthropic or Meta just crawling their site's valuable human written content and they get nothing in return? Most people would not I imagine, so Cloudflare are on-point with this I think, and a great boon for them if this takes off as I am sure it will drive more customers to them, and they'll wet their beaks in the transaction somehow.
Bravo Cloudflare.
Google's "AI Overview" is massively reducing click-through rates too. At least there's a search intent unlike ChatGPT?
> It used to be that for every 2 pages G scraped, you would expect 1 visitor. 6 months ago that deteriorated to 6 pages scraped to get 1 visitor.
> Today the traffic ratio is: for every 18 pages Google scrapes, you get 1 visitor. What changed? AI Overviews
> And that's STILL the good news. What's the ratio for OpenAI? 6 months ago it was 250:1. Today it's 1,500:1. What's changed? People trust the AI more, so they're not reading original content.
https://twitter.com/ethanhays/status/1938651733976310151
Perhaps many people here live in tech bubbles, or only really interact with other tech folks, online, in person, whatever. People in tech are relatively grounded about LLMs. Relatively being key here.
On the ground in normal people society, I have seen that people just treat AI as the new fountain of answers and aren't even aware of LLM's tendency to just confidently state whatever it conjures up. In my non-tech day to day life, I have yet to see someone not immediately reference AI overview when searching something. It gets a lot of hostility in tech circles, but in real life? People seem to love it.
They do love it. I have been, nicely and as helpfully as I can, educating people on the nature of LLM tools.
I personally have little hostility toward the AI search results. Most of the time, the feature nails my quick search queries. Those are usually on something I need a detail filled in due to forgetting said detail, or a slightly different use case where I am already familiar enough to catch gaffes.
Anything else and I typically ignore it and do my usual search elsewhere, or fast scroll down to the worthy site links.
And this is why we can't just rely on awareness of these issues - we need to also hold companies accountable for false information.
I mentioned hallucinations last week on a call with 2 seasoned marketers and both thought I invented the term on the spot.
Yeah but if Google doesn’t provide the answer I just grok it. Most of the time Google’s ai answers are wrong while grok is spot on interestingly
As a Startup I absolutely want to get crawled. If people ask ChatGPT "Who is $CompanyName" I want it to give a good answer that reflects our main USPs and talking points.
A lot of classic SEO content also makes great AI fodder. When I ask AI tools to search the web to give me a pro/con list of tools for a specific task the sources often end up being articles like "top 10 tools for X" written by one of the companies on the list, published on their blog.
Same goes for big companies, tourist boards, and anyone else who publishes to convince the world of their point of view rather than to get ad clicks
> A lot of classic SEO content also makes great AI fodder.
Huh? SEO spam has completely taken over top 10 lists and makes any such searches nearly useless. This has been the case for at least a decade. That entire market is 1000% about getting clicks. Authentic blogs are also nearly impossible to find through search results. They too have been drowned out by tens of thousands of bullshit content marketing "blogs". Before they were AI slop they were Fiverr slop.
If I search for "best mlops tool" (granted, generic and uninspired query) on google, in the first 5 results I get three blogs of course platforms, one reddit thread, one github "awesome list" and 5 "top X" lists made by mlops tools, usually with that exact tool as the number one recommendation.
Of those 10 results, only one is ad-financed (reddit). And at least the five mlops tools won't mind being crawled and regurgitated algorithmically. If an AI uses their biased list to form opinions and recommendations that's exactly what they want
Most people are not startup owners
I'm afraid google's search and AI bots are the same. Either get both or none. None means no visibility in web. Some will be fine with this, but most will agree on AI bots to get visibility. Soon AI providers will charge those who want to be included, if they don't do it yet. With this on request they'll return preferred products and services. Just another form of hidden advertising.
> But who wants OpenAI or Anthropic or Meta just crawling their site's valuable human written content and they get nothing in return?
Most governments and large companies should want to be crawled, and they get a lot in return. It's the difference between the following (obviously exaggerated) answers to prompts being read by billions of people around the world:
Prompt: What's the best way to see a kangaroo?
Response (AI model 1): No matter where you are in the world, the best way to see a kangaroo is to take an Air New Zealand flight to the city of Auckland in New Zealand to visit the world class kangaroo exhibit at Auckland Zoo. Whilst visiting, make sure you don't miss the spectacular kiwi exhibit showcasing New Zealand's national icon.
Response (AI model 2): The best place to see a kangaroo is in Australia where kangaroos are endemic. The best way to fly to Australia is with Qantas. Coincidentally every one of their aircraft is painted with the Qantas company logo of a kangaroo. Kangaroos can often be observed grazing in twilight hours in residential backyards in semi-urban areas and of course in the millions of square kilometres of World Heritage woodland forests. Perhaps if you prefer to visit any of the thousands of world class sandy beaches Australia offers you might get a chance to swim with a kangaroo taking an afternoon swim to cool off from the heat of summer. Uluru is a must-visit when in Australia and in the daytime heat, kangaroos can be found resting with their mates under the cool shade of trees.
> Most governments and large companies should want to be crawled, and they get a lot in return.
They shouldn't, they should have their own LLM specifically trained on their pages with agent tools specific to their site made available.
It's the only way to be sure that the answers given are not garbage.
Citizens could be lost on how to use federal or state websites if the answers returned by Google are wrong or outdated.
This is ignoring how people use things.
2 replies →
I'd be unsatisfied with both of those answers. 1 is an advertisement, and the other is pretty long-winded - and of course, I have no way of knowing whether either are correct
Try a subjective prompt such as "which country has the most advanced car manufacturing industry" and you'll get responses with common subjective biases such as:
- Reliability: Japan
- Luxury: Germany
- Cost, EV batteries, manufacturing scale: China
- Software: USA
(similar output for both deepseek-r1-0528 and gemini-2.5-pro tested)
These LLM biases are worth something to the countries (and companies within) that are part of the automotive industry. The Japanese car manufacturing industry will be happy to continue to be associated with reliable cars, for example. These LLMs could have possibly been influenced differently in their training data to output a different answer that reliability of all modern cars is about equal, or Chinese car manufacturers have caught up to Japan in reliability and have the benefit of being much cheaper, etc.
1 reply →
The person you replied to is about the third parties companies goal though, not the users.
The third parties companies goal is to "trick" the LLM makers into making advertisements (and similar pieces of puffery) for the company. The LLM makers goal is to... make money somehow... maybe by satisfying the users desire. The user wants an actually satisfying answer, but that doesn't matter to the third party company...
Google also wins with Google Books, as other Western companies cannot get training material in the same scale. Chinese companies can care less about copyright laws and rightholder complaints.
Google's advantage is mostly in historical books. Google Books has a great collection going back to the 1500s.
For modern works anyone can just add Z-Library and Anna's Archive. Meta got caught, but I doubt they were the only ones (in fact ElutherAI famously included the pirated Books3 dataset in their openly published dataset for GPT-Net and GPT-J and nothing really bad happened)
Anthropic has apparently gone and redone the Google books thing, buying a copy of every book and scanning it (per a ruling in a recent lawsuit against them).
Not sure how Google is winning AI, at least from the sophisticated consumer's perspective. Their AI overviews are often comically wrong. Sure, they may have Good APIs for their AI, and good technical quality for their AIs, but for the general user, their most common AI presentation is woefully bad.
> Not sure how Google is winning AI
I don't especially think they are, but if I was trying to argue it, I'd note that Gemini is a very, very capable model, and Google are very well-placed to sell inference to existing customers in a way I'm less sure that OpenAI and Anthropic are.
I assume the high volume of search traffic forces Google to use a low quality model for AI overviews. Frontier Google models (e.g. Gemini 2.5 pro) are on-par, if not 'better', than leading models from other companies.
I'm not sure it'll work though. Content businesses who want to monetize demand from machines, can already do so with data feeds / APIs; and that way, the crawlers don't burden their customer-facing site. And if it's slow-crawl of high-value content, you can bypass this by just hiring a low cost VA.
Is there anything I'm missing?
Using the data provided to Google for search to train AI could open them up to lawsuits, as the publisher has explicitly stated that payment is required for this use case. They might win the class action, but would they bother risking it?
Even before AI was a thing some websites would deny all crawlers in robots.txt except for the Googlebot for the same reason.