Comment by creato

1 month ago

The point is, robots.txt was definitely a thing that people expected to be respected before and during google's early existence. This Kagi claim seems to be at least partially false:

> Google built its index by crawling the open web before robots.txt was a widespread norm, often over publishers’ objections.

3 comments

creato

hattmall 1 month ago

Perhaps it wasn't a widespread norm though. But I don't really see why that matters as much, is the the issue that sites with robots.txt today only allow Googlebot and not other search engines? Or is Google somehow benefitting from having two decade old content that is now blocked because of robots.txt that the website operators don't want indexed?

ricardo81 1 month ago

Agree. It was not standard in the late 90s or early 00s. Most sites were custom built and relied on the _webmaster_ knowing and understanding how robots.txt worked. I'd heard plenty of examples where people had inadvertently blocked crawlers from their site, not knowing the syntax correctly. CMS' probably helped in the widespread adoption e.g. wordpress

embedding-shape 1 month ago

> robots.txt was definitely a thing that people expected to be respected before and during google's early existence

As someone who was a web developer at that time, robots.txt wasn't a "widespread norm" by a large margin, even if some individuals "expected it to be respected". Google's use of robots.txt + Google's own growth made robots.txt a "widespread norm" but I don't think many people who were active in the web-dev space at that time, would agree that it was a widespread norm before Google.