Comment by grajaganDev

2 days ago

I am not sure. How would crawlers filter this?

You limit the crawl time or number of requests per domain for all domains, and set the limit proportional to how important the domain is.

There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online.

Check if the response time, the length of the "main text", or other indicators are in the lowest few percentile -> send to the heap for manual review.

Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.

Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.

If you are doing broad crawling, you already need to do this kind of thing anyway.

  • > Hire a bunch of student jobbers,

    Do people still do this, or do they just off shore the task?