Comment by polishdude20

2 months ago

"Google follows industry-standard crawling protocols, and honors websites’ directives over crawling of their content."

Is that true with how they trained Gemini? Doesn't everyone with a foundational model scrape the web relentlessly without regard for robots.txt?

11 comments

polishdude20

miki123211 2 months ago

No, but AFAIK they pulled some shenanigans with "bundling" Gemini scraping and search engine scraping.

Almost everybody wants to appear in search, so disallowing the entirety of Google is far more costly than E.G. disallowing Openai, who even differentiates between content scraped for training and content accessed to respond to a user request.

inkysigma 2 months ago

While there isn't a way to differentiate between scraping for training data and content accessed in response to a user request, I think you can block Googlebot-extended to block training access.

jonatron 2 months ago

They're in a unique position where many people allow googlebot but try to block most other bots

tenuousemphasis 2 months ago
Allow for the purpose of indexing, not training models.
Like if you give a friend a key to your house so they can check on your plants when you're out of town but they throw a rager and trash the place.
- VBprogrammer 2 months ago
  
  > throw a rager
  That was not a phrase I expected to read on Hacker News! Haven't heard it since I was about 13. I always assumed it was a Scottish phrase.
  
  1 reply →

fragmede 2 months ago

Google honors robots.txt. They're not "everyone with. a foundational model" though.

maplethorpe 2 months ago

[flagged]

throwawaysoxjje 2 months ago
Scraping for search engines is different. They need to scrape the data to build the index, otherwise the search wouldn’t work.
- crote 2 months ago
  
  Being indexed in a search engine benefits the website owner, as it drives traffic towards the website.
  Being used as AI training data provides negative value for a website owner, as it takes traffic away.
  It's the difference between a movie review, and a ripped torrent.
polotics 2 months ago

So you mean to say it is different because it needs to be different to exist?
Following that same logic, may I inform you that your income going forward is different: it has to be directed to my bank account, because the account needs the money! :-)