← Back to context

Comment by frameset

3 months ago

It actually is.

I run a small video game forum with posts going back to 2008. We got absolutely smashed by bots scraping for training data for LLMs.

So I put it behind Cloudflare and now it's down. Ho hum.

Have you tried Anubis or similar tools? I've had similar issues with bot scraping of a forum taking all server resources, and using PoW challenge solved the problem.

https://github.com/TecharoHQ/anubis

  • I've always wondered: has there been any effort to implement a PoW challenge like that at a lower level? I.e., TCP but the handshake requires solving a challenge, otherwise the connection is just closed? It seems like something that could benefit from being invisible on the application layer.

    Edit: To answer my own question, yes: http://www.arijuels.com/wp-content/uploads/2013/09/JB99.pdf

    Edit 2: Maybe TLS would be another reasonable place for it?

  • I did! It's very cool tech. However for our config it was easier to slap CF in front of it.

    I will say one very appealing use of Anubis I'd love to try is using it as a Traefik middleware to protect services running in docker containers.

Can you please elaborate on “smashed”? I’m very interested

  • I took a screenshot of the graph in cloudflare when I switched on the bot challenges.

    https://i.ibb.co/qHCJyY7/image.png

    I wrote the below to explain to our users what was happening, so apologies if the language is too simple for a HN reader.

    - 0630, we switched our DNS to proxy through CF, starting the collection of data, and implemented basic bot protections

    - Unfortunately whatever anti-bot magic they have isn't quite having the effect, even after two hours.

    - 0830, I sign in and take a look at the analytics. It seems like <SITE NAME> is very popular in Vietnam, Brazil, and Indonesia.

    - 0845, I make it so users from those countries have to pass a CF "challenge". This is similar to a CAPTCHA, but CF try to make it so there's no "choosing all the cars in an image" if they can help it.

    - So far 0% of our Asian audience have passed a challenge.

Same problem here. If I didn't use Cloudflare, nearly all of my traffic would be (apparently misconfigured) scraper bots.