Comment by polishdude20
2 months ago
"Google follows industry-standard crawling protocols, and honors websites’ directives over crawling of their content."
Is that true with how they trained Gemini? Doesn't everyone with a foundational model scrape the web relentlessly without regard for robots.txt?
No, but AFAIK they pulled some shenanigans with "bundling" Gemini scraping and search engine scraping.
Almost everybody wants to appear in search, so disallowing the entirety of Google is far more costly than E.G. disallowing Openai, who even differentiates between content scraped for training and content accessed to respond to a user request.
While there isn't a way to differentiate between scraping for training data and content accessed in response to a user request, I think you can block Googlebot-extended to block training access.
They're in a unique position where many people allow googlebot but try to block most other bots
Allow for the purpose of indexing, not training models.
Like if you give a friend a key to your house so they can check on your plants when you're out of town but they throw a rager and trash the place.
> throw a rager
That was not a phrase I expected to read on Hacker News! Haven't heard it since I was about 13. I always assumed it was a Scottish phrase.
1 reply →
Google honors robots.txt. They're not "everyone with. a foundational model" though.
[flagged]
Scraping for search engines is different. They need to scrape the data to build the index, otherwise the search wouldn’t work.
Being indexed in a search engine benefits the website owner, as it drives traffic towards the website.
Being used as AI training data provides negative value for a website owner, as it takes traffic away.
It's the difference between a movie review, and a ripped torrent.
So you mean to say it is different because it needs to be different to exist?
Following that same logic, may I inform you that your income going forward is different: it has to be directed to my bank account, because the account needs the money! :-)