Comment by markerz

1 year ago

One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

166 comments

markerz

MetaWhirledPeas 1 year ago

> Cloudflare also has a feature to block known AI bots and even suspected AI bots

In addition to other crushing internet risks, add wrongly blacklisted as a bot to the list.

kmeisthax 1 year ago
This is already a thing for basically all of the second[0] and third worlds. A non-trivial amount of Cloudflare's security value is plausible algorithmic discrimination and collective punishment as a service.
[0] Previously Soviet-aligned countries; i.e. Russia and eastern Europe.
- shark_laser 1 year ago
  
  Yep. Same for most of Asia too.
  Cloudflare's filters are basically straight up racist.
  I have stopped using so many sites due to their use of Cloudflare.
  
  6 replies →
- ls612 1 year ago
  
  People hate collective punishment because it works so well.
  
  18 replies →
- grishka 1 year ago
  
  I have a growing Mastodon thread of this shit: https://mastodon.social/@grishka/111934602844613193
  It's of course trivially bypassable with a VPN, but getting a 403 for an innocent get request of a public resource makes me angry every time nonetheless.
  
  1 reply →
- QuadmasterXLII 1 year ago
  
  The difference between politics and diplomacy is that you can survive in politics without resorting to collective punishment.
- d0mine 1 year ago
  
  unrelated: USSR might have been 2nd world. Russia is 3rd world (since 1991) -- banana republic
  
  3 replies →
throwaway290 1 year ago
What do you mean crushing risk? Just solve these 12 puzzles by moving tiny icons on tiny canvas while on the phone and you are in the clear for a couple more hours!
- homebrewer 1 year ago
  
  If you live in a region which it is economically acceptable to ignore the existence of (I do), you sometimes get blocked by website r̶a̶c̶k̶e̶t̶ protection for no reason at all, simply because some "AI" model saw a request coming from an unusual place.
- benhurmarcel 1 year ago
  
  Sometimes it doesn’t even give you a Captcha.
  I have come across some websites that block me using Cloudflare with no way of solving it. I’m not sure why, I’m in a large first-world country, I tried a stock iPhone and a stock Windows PC, no VPN or anything.
  That’s just no way to know.
  
  12 replies →
- gs17 1 year ago
  
  If it clears you at all. I accidentally set a user agent switcher on for every site instead of the one I needed it for, and Cloudflare would give me an infinite loop of challenges. At least turning it off let me use the Internet again.
JohnMakin 1 year ago
These features are opt-in and often paid features. I struggle to see how this is a "crushing risk," although I don't doubt that sufficiently unskilled shops would be completely crushed by an IP/userAgent block. Since Cloudflare has a much more informed and broader view of internet traffic than maybe any other company in the world, I'll probably use that feature without any qualms at some point in the future. Right now their normal WAF rules do a pretty good job of not blocking legitimate traffic, at least on enterprise.
- MetaWhirledPeas 1 year ago
  
  The risk is not to the company using Cloudflare; the risk is to any legitimate individual who Cloudflare decides is a bot. Hopefully their detection is accurate because a false positive would cause great difficulties for the individual.
  
  1 reply →
CalRobert 1 year ago
We’re rapidly approaching a login-only internet. If you’re not logged in with google on chrome then no website for you!
Attestation/wei enable this
- neop1x 1 year ago
  
  And not just a login but soon probably also the real verified identity tied to it. The internet is becoming a worse place than the real world.

bodantogat 1 year ago

I see a lot of traffic I can tell are bots based on the URL patterns they access. They do not include the "bot" user agent, and often use residential IP pools. I haven't found an easy way to block them. They nearly took out my site a few days ago too.

echelon 1 year ago
You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.
Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.
Maybe you don't even need a full LLM. Just a simple transformer that inverts negative and positive statements, changes nouns such as locations, and subtly nudges the content into an erroneous state.
- marcus0x62 1 year ago
  
  Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)
  
  3 replies →
- tivert 1 year ago
  
  > You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.
  > Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.
  I agree, and not just to discourage them running up traffic bills. The end-state of what they hope to build is very likely to be extremely for most regular people [1], so we shouldn't cooperate in building it.
  [1] And I mean end state. I don't care how much value you say you get from some AI coding assistant today, the end state is your employer happily gets to fire you and replace you with an evolved version of the assistant at a fraction of your salary. The goal is to eliminate the cost that is our livelihoods. And if we're lucky, in exchange we'll get a much reduced basic income sufficient to count the rest of our days from a dense housing project filled with cheap minimum-quality goods and a machine to talk to if we're sad.
  
  2 replies →
- tyre 1 year ago
  
  Their problem is they can’t detect which are bots in the first place. If they could, they’d block them.
  
  3 replies →
- endofreach 1 year ago
  
  > Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills
  Or just wait for after the AI flood has peaked & most easily scrapable content has been AI generated (or at least modified).
  We should seriously start discussing the future of the public web & how to not leave it to big tech before it's too late. It's a small part of something i am working on, but not central. So i haven't spend enough time to have great answers. If anyone reading this seriously cares, i am waiting desperately to exchange thoughts & approaches on this.
  
  1 reply →
- llm_trw 1 year ago
  
  You will be burning through thousands of dollars worth of compute to do that.
  
  3 replies →
kmoser 1 year ago
My cheap and dirty way of dealing with bots like that is to block any IP address that accesses any URLs in robots.txt. It's not a perfect strategy but it gives me pretty good results given the simplicity to implement.
- Capricorn2481 1 year ago
  
  I don't understand this. You don't have routes your users might need in robots.txt? This article is about bots accessing resources that other might use.
  
  3 replies →
- Beijinger 1 year ago
  
  How can I implement this?
  
  2 replies →
acheong08 1 year ago

TLS fingerprinting still beats most of them. For really high compute endpoints I suppose some sort of JavaScript challenge would be necessary. Quite annoying to set up yourself. I hate cloudflare as a visitor but they do make life so much easier for administrators
petre 1 year ago
You rate limit them and then block the abusers. Nginx allows rate limiting. You can then block them using fail2ban for an hour if they're rate limited 3 times. If they get blocked 5 times you can block them forever using the recidive jail.
I've had massive AI bot traffic from M$, blocked several IPs by adding manual entries into the recidive jail. If they come back and disregard robots.txt with disallow * I will run 'em through fail2ban.
- herbst 1 year ago
  
  Whatever M$ was doing still baffles me. I still have several azure ranges in my blocklist because whatever this was appeared to change strategie once I implemented a ban method.
  
  1 reply →
newsclues 1 year ago
The amateurs at home are going to give the big companies what they want: an excuse for government regulation.
- throwaway290 1 year ago
  
  If it doesn't say it's a bot and it doesn't come from a corporate IP it doesn't mean it's NOT a bot and not run by some "AI" company.
  
  3 replies →
- int_19h 1 year ago
  
  Don't worry, the governments are perfectly capable of coming up with excuses all on their own.

CoastalCoder 1 year ago

I wonder if it would work to send Meta's legal department a notice that they are not permitted to access your website.

Would that make subsequent accesses be violations of the U.S.'s Computer Fraud and Abuse Act?

betaby 1 year ago
Crashing wasn't the intent. And scraping is legal, as I remember per Linkedin case.
- azemetre 1 year ago
  
  There’s a fine line between scrapping and DDOS’ing I’m sure.
  Just because you manufacture chemicals doesn’t mean you can legally dump your toxic waste anywhere you want (well shouldn’t be allowed to at least).
  You also shouldn’t be able to set your crawlers causing sites to fail.
  
  7 replies →
- franga2000 1 year ago
  
  If I make a physical robot and it runs someone over, I'm still liable, even though it was a delivery robot, not a running over people robot.
  If a bot sends so many requests that a site completely collapses, the owner is liable, even though it was a scraping bot and not a denial of service bot.
  
  2 replies →
- echelon 1 year ago
  
  Then you can feed them deliberately poisoned data.
  Send all of your pages through an adversarial LLM to pollute and twist the meaning of the underlying data.
  
  1 reply →
optimiz3 1 year ago

> I wonder if it would work to send Meta's legal department a notice that they are not permitted to access your website.
Depends how much money you are prepared to spend.
jahewson 1 year ago
No, fortunately random hosts on the internet don’t get to write a letter and make something a crime.
- throwaway_fai 1 year ago
  
  Unless they're a big company in which case they can DMCA anything they want, and they get the benefit of the doubt.
  
  3 replies →

viraptor 1 year ago

You can also block by IP. Facebook traffic comes from a single ASN and you can kill it all in one go, even before user agent is known. The only thing this potentially affects that I know of is getting the social card for your site.

jandrese 1 year ago

If a bot ignores robots.txt that's a paddlin'. Right to the blacklist.

nabla9 1 year ago
The linked article explains what happens when you block their IP.
- gs17 1 year ago
  
  For reference:
  > If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).
  It's really absurd that they seem to think this is acceptable.
  
  2 replies →

petee 1 year ago

Silly question, but did you try to email Meta? Theres an address at the bottom of that page to contact with concerns.

> webmasters@meta.com

I'm not naive enough to think something would definitely come of it, but it could just be a misconfiguration

TuringNYC 1 year ago

>> One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

Are they not respecting robots.txt?

eesmith 1 year ago
Quoting the top-level link to geraspora.de:
> Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don’t give a single flying fuck about robots.txt, because why should they. And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki.
- vasco 1 year ago
  
  Edit history of a wiki sounds much more interesting than the current snapshot if you want to train a model.
  
  1 reply →

candlemas 1 year ago

The biggest offenders for my website have always been from China.

tonyedgecombe 1 year ago
[flagged]
- lanstin 1 year ago
  
  Or invisible text to humans about such topics.
- petre 1 year ago
  
  [flagged]

ryandrake 1 year ago

> My solution was to add a Cloudflare rule to block requests from their User-Agent.

Surely if you can block their specific User-Agent, you could also redirect their User-Agent to goatse or something. Give em what they deserve.

globalnode 1 year ago

cant you just mess with them? like accept the connection but send back rubbish data at like 1 bps?

PeterStuer 1 year ago

Most administrators have no idea or no desire to correctly configure Cloudflare, so they just slap it on the whole site by default and block all the legitimate access to e.g. rss feeds.

coldpie 1 year ago

Imagine being one of the monsters who works at Facebook and thinking you're not one of the evil ones.

Aeolun 1 year ago
Well, Facebook actually releases their models instead of seeking rent off them, so I’m sort of inclined to say Facebook is one of the less evil ones.
- echelon 1 year ago
  
  > releases their models
  Some of them, and initially only by accident. And without the ingredients to create your own.
  Meta is trying to kill OpenAI and any new FAANG contenders. They'll commoditize their complement until the earth is thoroughly salted, and emerge as one of the leading players in the space due to their data, talent, and platform incumbency.
  They're one of the distribution networks for AI, so they're going to win even by just treading water.
  I'm glad Meta is releasing models, but don't ascribe their position as one entirely motivated by good will. They want to win.
  
  1 reply →
throwaway290 1 year ago

Or ClosedAI.
Related https://news.ycombinator.com/item?id=42540862
throwaway_fai 1 year ago
[flagged]
- FrustratedMonky 1 year ago
  
  The Banality of Evil.
  Everyone has to pay bills, and satisfy the boss.

EVa5I7bHFq9mnYK 1 year ago

Yeah, super convenient, now every second web site blocks me as "suspected AI bot".

devit 1 year ago

[flagged]

jsheard 1 year ago
That's right, getting DDOSed is a skill issue. Just have infinite capacity.
- devit 1 year ago
  
  DDOS is different from crashing.
  And I doubt Facebook implemented something that actually saturates the network, usually a scraper implements a limit on concurrent connections and often also a delay between connections (e.g. max 10 concurrent, 100ms delay).
  Chances are the website operator implemented a webserver with terrible RAM efficiency that runs out of RAM and crashes after 10 concurrent requests, or that saturates the CPU from simple requests, or something like that.
  
  2 replies →
markerz 1 year ago
Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.
Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.
- iamacyborg 1 year ago
  
  I suspect if the tables were turned and someone managed to crash FB consistently they might not take too kindly to that.
- ndriscoll 1 year ago
  
  I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.
  
  11 replies →
layer8 1 year ago

The alternative of crawling to a stop isn’t really an improvement.
adamtulinius 1 year ago
No normal person has a chance against the capacity of a company like Facebook
- Aeolun 1 year ago
  
  Anyone can send 10k concurrent requests with no more than their mobile phone.
aftbit 1 year ago
Yeah, this is the sort of thing that a caching and rate limiting load balancer (e.g. nginx) could very trivially mitigate. Just add a request limit bucket based on the meta User Agent allowing at most 1 qps or whatever (tune to 20% of your backend capacity), returning 429 when exceeded.
Of course Cloudflare can do all of this for you, and they functionally have unlimited capacity.
- layer8 1 year ago
  
  Read the article, the bots change their User Agent to an innocuous one when they start being blocked.
  And having to use Cloudflare is just as bad for the internet as a whole as bots routinely eating up all available resources.
  
  1 reply →