Comment by kazinator
7 months ago
I deployed this, instead of my usual honeypot script.
It's not working very well.
In the web server log, I can see that the bots are not downloading the whole ten megabyte poison pill.
They are cutting off at various lengths. I haven't seen anything fetch more than around 1.5 Mb of it so far.
Or is it working? Are they decoding it on the fly as a stream, and then crashing? E.g. if something is recorded as having read 1.5 Mb, could it have decoded it to 1.5 Gb in RAM, on the fly, and crashed?
There is no way to tell.
Try content labyrinth. I.e. infinitely generated content with a bunch of references to other generated pages. It may help against simple wget and till bots adapt.
PS: I'm on the bots side, but don't mind helping.
This doesn't work if you pay bandwidth and CPU usage for your servers though.
The labyrinth doesn't have to be fast, and things like iocaine (https://iocaine.madhouse-project.org/) don't use much CPU if you don't go and give them something like the Complete Works of Ahakespeare as input (Mine is using Moby Dick), and can easily be constrained with cgroups if you're concerned about resource usage.
I've noticed that LLM scrapers tend to be incredibly patient. They'll wait for minutes for even small amounts of text.
1 reply →
That will be your contribution. If others join scrapping will become very pricey. Till bots become smarter. But then they will not download much of generated crap. Which makes it cheaper for you.
Anyway, from bots perspective labyrinths aren't the main problem. Internet is being flooded with quality LLM-generated content.
Wouldn't this just waste your own bandwidth/resources?
Kinda wonder if a "content labyrinth" could be used to influence the ideas / attitudes of bots -- fill it with content pro/anti Communism, or Capitalism, or whatever your thing is, hope it tips the resulting LLM towards your ideas.
Perhaps need to semi-randomize the file size? I'm guessing some of the bots have a hard limit to the size of the resource they will download.
Many of these are annoying LLM training/scraping bots (in my case anyway). So while it might not crash them if you spit out a 800KB zipbomb, at least it will waste computing resources on their end.
Do they comeback? If so then they detect it and avoid it. If not then they crashed and mission accomplished.
I currently cannot tell without making a little configuration change, because as soon as an IP address is logged as having visited the trap URL (honeypot, or zipbomb or whatever), a log monitoring script bans that client.
Secondly, I know that most of these bots do not come back. The attacks do not reuse addresses against the same server in order to evade almost any conceivable filter rule that is predicated on a prior visit.
I may be asking a really silly question here, but
> as soon as an IP address is logged as having visited the trap URL (honeypot, or zipbomb or whatever), a log monitoring script bans that client.
Is this not why they aren’t getting the full file?
2 replies →