Comment by quchen

2 days ago

Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.

Author of a similar tool here[0]. There are a few implementations of this sort of thing that I know of. Mine is different in that the primary purpose is to slightly alter content statically using a Markov generator, mainly to make it useless for content reposters, secondarily to make it useless to LLM crawlers that ignore my robots.txt file[1]. I assume the generated text is bad enough that the LLM crawlers just throw the result out. Other than the extremely poor quality of the text, my tool doesn't leave any fingerprints (like recursive non-sense links.) In any case, it can be run on static sites with no server-side dependencies so long as you have a way to do content redirection based on User-Agent, IP, etc.

My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present.

But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site.

In addition to Quixotic (my tool) and Napthenes, I know of:

* https://github.com/Fingel/django-llm-poison

* https://codeberg.org/MikeCoats/poison-the-wellms

* https://codeberg.org/timmc/marko/

0 - https://marcusb.org/hacks/quixotic.html

1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt

It would be more efficient for them to spin up a team to study this robots.txt thing. They've ignored that low hanging fruit, so they won't do the more sophisticated thing any time soon.

  • You can't make money out of studying robots.txt, but you can avoid costs skipping bad web sites.

    • Sounds like a benefit for the site owner. lol. It accomplished what they wanted.

I forget which fiction book covered this phenomenon ( Rainbow's End? ), but the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do ; they are not actively fighting against determined and possibly radicalized users.

The idea is that you place this in parallel to the rest of your website routes, that way your entire server might get blacklisted by the bot.

Does it need to be efficient if it’s easy? I wrote a similar tool except it’s not a performance tarpit. The goal is to slightly modify otherwise organic content so that it is wrong, but only for AI bots. If they catch on and stop crawling the site, nothing is lost. https://github.com/Fingel/django-llm-poison

I am not sure. How would crawlers filter this?

  • You limit the crawl time or number of requests per domain for all domains, and set the limit proportional to how important the domain is.

    There's a ton of these types of of things online, you can't e.g. exhaustively crawl every wikipedia mirror someone's put online.

  • Check if the response time, the length of the "main text", or other indicators are in the lowest few percentile -> send to the heap for manual review.

    Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.

    Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.

    If you are doing broad crawling, you already need to do this kind of thing anyway.

    • > Hire a bunch of student jobbers,

      Do people still do this, or do they just off shore the task?

It's not. It's rather pointless and frankly, nearsighted. And we can DDoS sites like this just as offensively as well simply by making many requests to it since its own docs say its Markov generation is computationally expensive, but it is NOT expensive for even 1 person to make many requests to it. Just expensive to host. So feel free to use this bash function to defeat these:

    httpunch() {
      local url=$1
      local connections=${2:-${HTTPUNCH_CONNECTIONS:-100}}
      local action=$1
      local keepalive_time=${HTTPUNCH_KEEPALIVE:-60}
      local silent_mode=false

      # Check if "kill" was passed as the first argument
      if [[ $action == "kill" ]]; then
        echo "Killing all curl processes..."
        pkill -f "curl --no-buffer"
        return
      fi

      # Parse optional --silent argument
      for arg in "$@"; do
        if [[ $arg == "--silent" ]]; then
          silent_mode=true
          break
        fi
      done

      # Ensure URL is provided if "kill" is not used
      if [[ -z $url ]]; then
        echo "Usage: httpunch [kill | <url>] [number_of_connections] [--silent]"
        echo "Environment variables: HTTPUNCH_CONNECTIONS (default: 100), HTTPUNCH_KEEPALIVE (default: 60)."
        return 1
      fi

      echo "Starting $connections connections to $url..."
      for ((i = 1; i <= connections; i++)); do
        if $silent_mode; then
          curl --no-buffer --silent --output /dev/null --keepalive-time "$keepalive_time" "$url" &
        else
          curl --no-buffer --keepalive-time "$keepalive_time" "$url" &
        fi
      done

      echo "$connections connections started with a keepalive time of $keepalive_time seconds."
      echo "Use 'httpunch kill' to terminate them."
    }

(Generated in a few seconds with the help of an LLM of course.) Your free speech is also my free speech. LLM's are just a very useful tool, and Llama for example is open-source and also needs to be trained on data. And I <opinion> just can't stand knee-jerk-anticorporate AI-doomers who decide to just create chaos instead of using that same energy to try to steer the progress </opinion>.

If it means it makes your own content safe when you deploy it on a corner of your website: mission accomplished!

  • >If it means it makes your own content safe

    Not really? As mentioned by others, such tarpits are easily mitigated by using a priority queue. For instance, crawlers can prioritize external links over internal links, which means if your blog post makes it to HN, it'll get crawled ahead of the tarpit. If it's discoverable and readable by actual humans, AI bots will be able to scrape it.

  • [flagged]

    • You've got to be seriously AI-drunk to equate letting your site be crawled by commercial scrapers with "contributing to humanity".

      Maybe you don't want your your stuff to get thrown into the latest silicon valley commercial operation without getting paid for it. That seems like a valid position to take. Or maybe you just don't want Claude's ridiculously badly behaved scraper to chew through your entire budget.

      Regardless, scrapers that don't follow the rules like robots.txt pretty quickly will discover why those rules exist in the first place as they receive increasing amounts of garbage.