Comment by shakna
13 hours ago
When I was writing a crawler for my search engine (now offline), I found almost no crawler library actually compliant with the real world. So I ended up going to a lot of effort to write one that complied with Amazon and Google's rather complicated nested robots files, including respecting the cool off periods as requested.
... And then found their own crawlers can't parse their own manifests.
Could you link the source of your crawler library?
It's about 700 lines of the worst Python ever. You do not want it. I would be too embarrassed to release it, honestly.
It complied, but it was absolutely not fast or efficient. I aimed at compliance first, good code second, but never got to the second because of more human-oriented issues that killed the project.