This is a user agent and I would be incredibly frustrated if they respected robots.txt. Robots.txt was designed to encourage recursive web crawlers to be respectful. It's specifically not meant to exclude agents that are acting on users' direct requests.
Website operators should not get a say in what kinds of user agents I used to access their sites. Terminal? Fine. Regular web browser? Okay. AI powered web browser? Who cares. The strength of the web lies in the fact that I can access it with many different kinds of tools depending on my use case, and we cannot sacrifice that strength on the altar of hatred of AI tools.
Down that road lies disaster, with the Play Integrity API being just the tip of the iceberg.
robotstxt.org [0] is pretty specific in what constitutes a robot for the purposes of robots.txt:
> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
This is absolutely not what you are doing, which means what you have here is not a robot. What you have here is a user agent, so you don't need to pay attention to robots.txt.
If what you are doing here counted as robotic traffic, then so would:
* Speculative loading (algorithm guesses what you're going to load next and grabs it for you in advance for faster load times).
* Reader mode (algorithm transforms the website to strip out tons of content that you don't want and present you only with the minimum set of content you wanted to read).
* Terminal-based browsers (do not render images or JavaScript, thus bypassing advertising and according to some justifications leading them to be considered a robot because they bypass monetization).
The fact is that the web is designed to be navigated by a diverse array of different user agents that behave differently. I'd seriously consider imposing rate limits on how frequently your browser acts so you don't knock over a server—that's just good citizenship—but robots.txt is not designed for you and if we act like it is then a lot of dominoes will fall.
What do you mean? This AI cannot scrape multiple links automatically? Like "make a summary of all the recipes linked in this page" kind of stuff? If it can it definitely meets the definition of scraping.
I think what he means is it is not just generally crawling and scraping, and uses a more targeted approach. Equivalent to a user going to each of those sites, just more efficiently.
As a user, the browser is my agent. If I'm directing an LLM to do something on a page in my browser, it's not that much different than me clicking a button manually, or someone using a screen reader to read the text on a page. The browser is my user agent and the specific tools I choose to use in my browser shouldn't be forbidden by a webpage. (that's why to this day all browsers still claim to be Mozilla...)
(This is very different than mass scraping web pages for training purposes. Those should absolutely respect robots.txt. There's a big difference between a user operated agentic-browser interacting with a web page and mass link crawling.)
My understanding of this product is that this isn't an automated AI scraper, it's simply helping the user navigate pages they've already navigated to themselves.
If any type of AI based assistance is supposed to adhere to the robot.txt, then would you also say that AI based accessibility tools should refuse to work on pages blocked by robot.txt?
> If your browser behaves, it's not going to be excluded in robots.txt.
No, it's common practice to allow Googlebot and deny all other crawlers by default [0].
This is within their rights when it comes to true scrapers, but it's part of why I'm very uncomfortable with the idea of applying robots.txt to what are clearly user agents. It sets a precedent where it's not inconceivable that we have websites curating allowlists of user agents like they already do for scrapers, which would be very bad for the web.
This is a user agent and I would be incredibly frustrated if they respected robots.txt. Robots.txt was designed to encourage recursive web crawlers to be respectful. It's specifically not meant to exclude agents that are acting on users' direct requests.
Website operators should not get a say in what kinds of user agents I used to access their sites. Terminal? Fine. Regular web browser? Okay. AI powered web browser? Who cares. The strength of the web lies in the fact that I can access it with many different kinds of tools depending on my use case, and we cannot sacrifice that strength on the altar of hatred of AI tools.
Down that road lies disaster, with the Play Integrity API being just the tip of the iceberg.
https://www.robotstxt.org/faq/what.html
No, not today.
But wonder if it matter if it the agent is mostly using it for "human" use cases and not scrapping?
robotstxt.org [0] is pretty specific in what constitutes a robot for the purposes of robots.txt:
> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.
This is absolutely not what you are doing, which means what you have here is not a robot. What you have here is a user agent, so you don't need to pay attention to robots.txt.
If what you are doing here counted as robotic traffic, then so would:
* Speculative loading (algorithm guesses what you're going to load next and grabs it for you in advance for faster load times).
* Reader mode (algorithm transforms the website to strip out tons of content that you don't want and present you only with the minimum set of content you wanted to read).
* Terminal-based browsers (do not render images or JavaScript, thus bypassing advertising and according to some justifications leading them to be considered a robot because they bypass monetization).
The fact is that the web is designed to be navigated by a diverse array of different user agents that behave differently. I'd seriously consider imposing rate limits on how frequently your browser acts so you don't knock over a server—that's just good citizenship—but robots.txt is not designed for you and if we act like it is then a lot of dominoes will fall.
[0] https://www.robotstxt.org/faq/what.html
What do you mean? This AI cannot scrape multiple links automatically? Like "make a summary of all the recipes linked in this page" kind of stuff? If it can it definitely meets the definition of scraping.
I think what he means is it is not just generally crawling and scraping, and uses a more targeted approach. Equivalent to a user going to each of those sites, just more efficiently.
13 replies →
Yes it would matter. The AI might be I in your eyes, but it is still A.
I'll poke the bear:
As a user, the browser is my agent. If I'm directing an LLM to do something on a page in my browser, it's not that much different than me clicking a button manually, or someone using a screen reader to read the text on a page. The browser is my user agent and the specific tools I choose to use in my browser shouldn't be forbidden by a webpage. (that's why to this day all browsers still claim to be Mozilla...)
(This is very different than mass scraping web pages for training purposes. Those should absolutely respect robots.txt. There's a big difference between a user operated agentic-browser interacting with a web page and mass link crawling.)
My understanding of this product is that this isn't an automated AI scraper, it's simply helping the user navigate pages they've already navigated to themselves.
If any type of AI based assistance is supposed to adhere to the robot.txt, then would you also say that AI based accessibility tools should refuse to work on pages blocked by robot.txt?
So is Chrome. Very artificial. It's still not a robot for the purposes of robots.txt.
What coherent definition of robot excludes Chrome but includes this?
1 reply →
There's no reason not to respect it.
If your browser behaves, it's not going to be excluded in robots.txt.
If your browser doesn't behave, you should at least respect robots.txt.
If your browser doesn't behave, and you continue to ignore robots.txt, that's just... shitty.
> If your browser behaves, it's not going to be excluded in robots.txt.
No, it's common practice to allow Googlebot and deny all other crawlers by default [0].
This is within their rights when it comes to true scrapers, but it's part of why I'm very uncomfortable with the idea of applying robots.txt to what are clearly user agents. It sets a precedent where it's not inconceivable that we have websites curating allowlists of user agents like they already do for scrapers, which would be very bad for the web.
[0] As just one example: https://www.404media.co/google-is-the-only-search-engine-tha...
2 replies →
You should, because universities are starting to get legal involved due to mass scraping taking down their systems.