Comment by supriyo-biswas

1 day ago

I run a semi-popular website hosting user-generated content, although it's not a search engine; the attacks on it have surprised me, and I've eventually had to put in the same kinds of restrictions on it.

I was initially very hesitant to restrict any kind of traffic, relying on ratelimiting IPs on critical endpoints that needed low friction, and captchas on the higher friction with higher intents, such as signup and password reset pages.

Other than that, I was very liberal with most traffic, making sure that Tor was unblocked, and even ending up migrating off Cloudflare's free tier to a paid CDN due to inexplicable errors that users were facing over Tor that were ultimately related to how they blocked some specific requests over Tor with 403, even though the MVPs on their community forums would never acknowledge such a thing.

Unfortunately, given that Tor is a free rotating proxy, my website got attacked on one of these critical, compute heavy endpoints through multiple exit nodes totaling ~20,000 RPS. I've reluctantly had to block Tor, and a few other paid proxy services discovered through my own research since then.

Another time, a set of human spammers distributed all over the world started sending out a large volume of spam towards my website; with something like 1,000,000 spam messages every day (I still feel this was an attack coordinated by a "competitor" of some sort, especially given a small percentage of messages entitled "I want to get paid for posting" or along those lines).

There was no meaningful differentiator between the spammers and legitimate users, they were using real Gmail accounts to sign up, analysis of their behaviours showed they were real users as opposed to simple or even browser-based automation, and the spammers were based out of the same residential IPs as legitimate users.

I, again, had to reluctantly introduce a spam filter on some common keywords, and although some legitimate users do get trapped from time to time, this was the only way I could get a handle on that problem.

I'm appalled by some of the discussions here. Was I "enshittifying" my website out of unbridled "greed"? I don't think so. But every time I come here, I find these accusations, which makes me think that as a website with technical users, we can definitely do better.

The problem is accountability. Imagine starting a trade show business in the physical world as an example.

One day you start getting a bunch of people come in to mess with the place. You can identify them and their organization, then promptly remove them. If they continue, there are legal ramifications.

On the web, these people can be robots that look just like real people until you spend a while studying their behavior. Worse if they’re real people being paid for sabotage.

In the real world, you arrest them and find the source. Online they can remain anonymous and protected. What recourse do we have beyond splitting the web into a “verified ID” web, and a pseudonymous analog? We can’t keep treating potential computer engagement the same as human forever. As AI agents inevitably get cheaper and harder to detect, what choice will we have?

  • To be honest, I don't like initiatives towards a "verified web" either, and am very scared of the effects on anonymity that stuff like Apple's PAT, Chrome's now deprecated WEI or Cloudflare's similar efforts to that end are aimed at.

    Not to say that these would just cement the position of Google and Microsoft and block off the rest of us from building alternatives to their products.

    I feel that the current state of things are fine; I was eventually able to restrict most abuse in an acceptable way with few false positives. However, what I wished for was that more people would understand these tradeoffs instead of jumping to uncharitable interpretations not backed by real world experience as a conclusion.

> I'm appalled by some of the discussions here. Was I "enshittifying" my website out of unbridled "greed"? I don't think so. But every time I come here, I find these accusations, which makes me think that as a website with technical users, we can definitely do better.

It's if nothing else very evident most people fundamentally don't understand what an adversarial shit show running a public web service is.

There's a certain relatively tiny audience that has congregated on HN for whom hating ads is a kind of religion and google is the great satan.

Threads like this are where they come to affirm their beliefs with fellow adherents.

Comments like yours, those that imply there might be some valid reason for a move like this (even with degrees of separation) are simply heretical. I think these people cling to an internet circa 2002ish and the solution to all problems with the modern internet is to make the internet go back to 2002.

  • The problem isn’t the necessary fluff that must be added, it’s how easy it becomes to keep on adding it after the necessity subsides.

    Google was a more honorable company when the ads were on the right hand side only instead of tricking you in the main results. This is the enshitification people talk about. Decision with no reason other than pure profit at user expense. They were horrendously profitable when they made this dark pattern switch.

    Profits today can’t be distinguished accurately between users who know it’s an ad and those who were tricked into thinking it was organic.

    Not all enshitification is equal.

  • [flagged]

    • Shilling for what?

      I simply observed how these threads always go. The faithful turn up to proudly proclaim how much they don't use google, how they cripple their browsing experience so they may remain pure, to experience the internet as it was intended before the original sin (ads) corrupted it.

      2 replies →

20000 RPS is very little — a web app / database running on an ordinary desktop computer can process up to 10000 RPS on a bare-metal configuration after some basic optimization. If that is half of your total average load, a single co-located server should be enough to eat entire "attack" without flinching. If you have "competitors" and I assume, that this is some kind of commercial product (including running profitable advertising-based business), you should probably have multiple geographically distributed servers and some kind of BGP-based DDoS protection.

Regarding Tor nodes — there is nothing wrong with locking them out, especially if your website isn't geo-blocked by any governments and there are no privacy concerns related to accessing it.

If, like Google, you lock out EVERYONE, even your logged in users, whose identities and payment details you have already confirmed, then... yes you are "enshittifying" or have ulterior motives.

> they were using real Gmail accounts to sign up

Using Gmail should be a red flag on its own. Google accounts can be purchased by millions, and immediately get resold after being blocked by target website. Same for phones. Only your own accounts / captchas / site rep can be treated as basis of trust. Confirmation e-mail is a mere formality to have some way of contacting your human users. By the time Reddit was created it was already useless as security measure.

  • RPS is a bad measure. 20k RPS is a little if you're serving static files, a raspberry pi could probably do that. It's a lot if you're mutating a large database table with each request, which depending on the service, isn't unheard of.

  • This comment is so out of touch I’m almost speechless.

    > > critical, compute heavy endpoints through multiple exit nodes totaling ~20,000 RPS

    > 20000 RPS is very little

    If I had to guess you’ve never hosted non-static websites so you can’t imagine what’s a compute heavy endpoint.

    > Using Gmail should be a red flag on its own.

    Yes, ban users signing up with Gmail then.

    And this is not an isolated case, discussions on DDoS, CAPTCHAs, etc. here always have these out of touch people coming out of the woodwork. Baffling.

    • > you can’t imagine what’s a compute heavy endpoint

      Indeed, I can't. Because "compute heavy" isn't a meaningful description. Is it written in C++? Are results persisted anywhere? Is it behind a queue? What is the caching strategy?

      Given that original post mentions free Cloudflare tier, there is a good chance, that "compute" might mean something like "ordinary Python application, making several hundreds database requests". This is also a kind of high-load, but not the worst one by far.

      1 reply →