Comment by markerz

19 days ago

One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

It seems a bit naive for some reason and doesn't do performance back-off the way I would expect from Google Bot. It just kept repeatedly requesting more and more until my server crashed, then it would back off for a minute and then request more again.

My solution was to add a Cloudflare rule to block requests from their User-Agent. I also added more nofollow rules to links and a robots.txt but those are just suggestions and some bots seem to ignore them.

Cloudflare also has a feature to block known AI bots and even suspected AI bots: https://blog.cloudflare.com/declaring-your-aindependence-blo... As much as I dislike Cloudflare centralization, this was a super convenient feature.

> Cloudflare also has a feature to block known AI bots and even suspected AI bots

In addition to other crushing internet risks, add wrongly blacklisted as a bot to the list.

  • This is already a thing for basically all of the second[0] and third worlds. A non-trivial amount of Cloudflare's security value is plausible algorithmic discrimination and collective punishment as a service.

    [0] Previously Soviet-aligned countries; i.e. Russia and eastern Europe.

  • What do you mean crushing risk? Just solve these 12 puzzles by moving tiny icons on tiny canvas while on the phone and you are in the clear for a couple more hours!

    • If you live in a region which it is economically acceptable to ignore the existence of (I do), you sometimes get blocked by website r̶a̶c̶k̶e̶t̶ protection for no reason at all, simply because some "AI" model saw a request coming from an unusual place.

    • Sometimes it doesn’t even give you a Captcha.

      I have come across some websites that block me using Cloudflare with no way of solving it. I’m not sure why, I’m in a large first-world country, I tried a stock iPhone and a stock Windows PC, no VPN or anything.

      That’s just no way to know.

      12 replies →

    • If it clears you at all. I accidentally set a user agent switcher on for every site instead of the one I needed it for, and Cloudflare would give me an infinite loop of challenges. At least turning it off let me use the Internet again.

  • These features are opt-in and often paid features. I struggle to see how this is a "crushing risk," although I don't doubt that sufficiently unskilled shops would be completely crushed by an IP/userAgent block. Since Cloudflare has a much more informed and broader view of internet traffic than maybe any other company in the world, I'll probably use that feature without any qualms at some point in the future. Right now their normal WAF rules do a pretty good job of not blocking legitimate traffic, at least on enterprise.

    • The risk is not to the company using Cloudflare; the risk is to any legitimate individual who Cloudflare decides is a bot. Hopefully their detection is accurate because a false positive would cause great difficulties for the individual.

      1 reply →

  • We’re rapidly approaching a login-only internet. If you’re not logged in with google on chrome then no website for you!

    Attestation/wei enable this

    • And not just a login but soon probably also the real verified identity tied to it. The internet is becoming a worse place than the real world.

I see a lot of traffic I can tell are bots based on the URL patterns they access. They do not include the "bot" user agent, and often use residential IP pools. I haven't found an easy way to block them. They nearly took out my site a few days ago too.

  • You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

    Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

    Maybe you don't even need a full LLM. Just a simple transformer that inverts negative and positive statements, changes nouns such as locations, and subtly nudges the content into an erroneous state.

    • > You could run all of your content through an LLM to create a twisted and purposely factually incorrect rendition of your data. Forward all AI bots to the junk copy.

      > Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills.

      I agree, and not just to discourage them running up traffic bills. The end-state of what they hope to build is very likely to be extremely for most regular people [1], so we shouldn't cooperate in building it.

      [1] And I mean end state. I don't care how much value you say you get from some AI coding assistant today, the end state is your employer happily gets to fire you and replace you with an evolved version of the assistant at a fraction of your salary. The goal is to eliminate the cost that is our livelihoods. And if we're lucky, in exchange we'll get a much reduced basic income sufficient to count the rest of our days from a dense housing project filled with cheap minimum-quality goods and a machine to talk to if we're sad.

      2 replies →

    • > Everyone should start doing this. Once the AI companies engorge themselves on enough garbage and start to see a negative impact to their own products, they'll stop running up your traffic bills

      Or just wait for after the AI flood has peaked & most easily scrapable content has been AI generated (or at least modified).

      We should seriously start discussing the future of the public web & how to not leave it to big tech before it's too late. It's a small part of something i am working on, but not central. So i haven't spend enough time to have great answers. If anyone reading this seriously cares, i am waiting desperately to exchange thoughts & approaches on this.

      1 reply →

  • My cheap and dirty way of dealing with bots like that is to block any IP address that accesses any URLs in robots.txt. It's not a perfect strategy but it gives me pretty good results given the simplicity to implement.

  • TLS fingerprinting still beats most of them. For really high compute endpoints I suppose some sort of JavaScript challenge would be necessary. Quite annoying to set up yourself. I hate cloudflare as a visitor but they do make life so much easier for administrators

  • You rate limit them and then block the abusers. Nginx allows rate limiting. You can then block them using fail2ban for an hour if they're rate limited 3 times. If they get blocked 5 times you can block them forever using the recidive jail.

    I've had massive AI bot traffic from M$, blocked several IPs by adding manual entries into the recidive jail. If they come back and disregard robots.txt with disallow * I will run 'em through fail2ban.

    • Whatever M$ was doing still baffles me. I still have several azure ranges in my blocklist because whatever this was appeared to change strategie once I implemented a ban method.

      1 reply →

  • The amateurs at home are going to give the big companies what they want: an excuse for government regulation.

I wonder if it would work to send Meta's legal department a notice that they are not permitted to access your website.

Would that make subsequent accesses be violations of the U.S.'s Computer Fraud and Abuse Act?

  • Crashing wasn't the intent. And scraping is legal, as I remember per Linkedin case.

    • There’s a fine line between scrapping and DDOS’ing I’m sure.

      Just because you manufacture chemicals doesn’t mean you can legally dump your toxic waste anywhere you want (well shouldn’t be allowed to at least).

      You also shouldn’t be able to set your crawlers causing sites to fail.

      7 replies →

    • If I make a physical robot and it runs someone over, I'm still liable, even though it was a delivery robot, not a running over people robot.

      If a bot sends so many requests that a site completely collapses, the owner is liable, even though it was a scraping bot and not a denial of service bot.

      2 replies →

    • Then you can feed them deliberately poisoned data.

      Send all of your pages through an adversarial LLM to pollute and twist the meaning of the underlying data.

      1 reply →

  • > I wonder if it would work to send Meta's legal department a notice that they are not permitted to access your website.

    Depends how much money you are prepared to spend.

You can also block by IP. Facebook traffic comes from a single ASN and you can kill it all in one go, even before user agent is known. The only thing this potentially affects that I know of is getting the social card for your site.

If a bot ignores robots.txt that's a paddlin'. Right to the blacklist.

  • The linked article explains what happens when you block their IP.

    • For reference:

      > If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really).

      It's really absurd that they seem to think this is acceptable.

      2 replies →

Silly question, but did you try to email Meta? Theres an address at the bottom of that page to contact with concerns.

> webmasters@meta.com

I'm not naive enough to think something would definitely come of it, but it could just be a misconfiguration

>> One of my websites was absolutely destroyed by Meta's AI bot: Meta-ExternalAgent https://developers.facebook.com/docs/sharing/webmasters/web-...

Are they not respecting robots.txt?

  • Quoting the top-level link to geraspora.de:

    > Oh, and of course, they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don’t give a single flying fuck about robots.txt, because why should they. And the best thing of all: they crawl the stupidest pages possible. Recently, both ChatGPT and Amazon were - at the same time - crawling the entire edit history of the wiki.

> My solution was to add a Cloudflare rule to block requests from their User-Agent.

Surely if you can block their specific User-Agent, you could also redirect their User-Agent to goatse or something. Give em what they deserve.

cant you just mess with them? like accept the connection but send back rubbish data at like 1 bps?

Most administrators have no idea or no desire to correctly configure Cloudflare, so they just slap it on the whole site by default and block all the legitimate access to e.g. rss feeds.

Imagine being one of the monsters who works at Facebook and thinking you're not one of the evil ones.

  • Well, Facebook actually releases their models instead of seeking rent off them, so I’m sort of inclined to say Facebook is one of the less evil ones.

    • > releases their models

      Some of them, and initially only by accident. And without the ingredients to create your own.

      Meta is trying to kill OpenAI and any new FAANG contenders. They'll commoditize their complement until the earth is thoroughly salted, and emerge as one of the leading players in the space due to their data, talent, and platform incumbency.

      They're one of the distribution networks for AI, so they're going to win even by just treading water.

      I'm glad Meta is releasing models, but don't ascribe their position as one entirely motivated by good will. They want to win.

      1 reply →

[flagged]

  • That's right, getting DDOSed is a skill issue. Just have infinite capacity.

    • DDOS is different from crashing.

      And I doubt Facebook implemented something that actually saturates the network, usually a scraper implements a limit on concurrent connections and often also a delay between connections (e.g. max 10 concurrent, 100ms delay).

      Chances are the website operator implemented a webserver with terrible RAM efficiency that runs out of RAM and crashes after 10 concurrent requests, or that saturates the CPU from simple requests, or something like that.

      2 replies →

  • Can't every webserver crash due to being overloaded? There's an upper limit to performance of everything. My website is a hobby and has a budget of $4/mo budget VPS.

    Perhaps I'm saying crash and you're interpreting that as a bug but really it's just an OOM issue cause of too many in-flight requests. IDK, I don't care enough to handle serving my website at Facebook's scale.

    • I suspect if the tables were turned and someone managed to crash FB consistently they might not take too kindly to that.

    • I wouldn't expect it to crash in any case, but I'd generally expect that even an n100 minipc should bottleneck on the network long before you manage to saturate CPU/RAM (maybe if you had 10Gbit you could do it). The linked post indicates they're getting ~2 requests/second from bots, which might as well be zero. Even low powered modern hardware can do thousands to tens of thousands.

      11 replies →

  • Yeah, this is the sort of thing that a caching and rate limiting load balancer (e.g. nginx) could very trivially mitigate. Just add a request limit bucket based on the meta User Agent allowing at most 1 qps or whatever (tune to 20% of your backend capacity), returning 429 when exceeded.

    Of course Cloudflare can do all of this for you, and they functionally have unlimited capacity.

    • Read the article, the bots change their User Agent to an innocuous one when they start being blocked.

      And having to use Cloudflare is just as bad for the internet as a whole as bots routinely eating up all available resources.

      1 reply →