Comment by haiku2077
3 months ago
The big companies tend to respect robots.txt. The problem is other, unscrupulous actors use fake user agents and residential IPs and don't respect robots.txt or act reasonably.
3 months ago
The big companies tend to respect robots.txt. The problem is other, unscrupulous actors use fake user agents and residential IPs and don't respect robots.txt or act reasonably.
Big companies have thrown robots.txt to the wind when it comes to their precious AI models.
Yeah, they have openly disregarded copyright law, it's not a puny robots.txt file that's gonna stop them.
robots.txt isn't just an on/off switch. You can set crawler rate limits in there that crawlers may choose to respect, and the big companies respect them- because it's in their interest to reduce their crawling cost and not send more requests than they need to.
However, these smaller companies are doing ridiculous things like scraping the same site many thousands of times a day, far more often than the content of the sites change.