← Back to context

Comment by Alupis

3 years ago

Verification isn't about keeping secrets, obviously, it's about restricting the velocity of bots and their ability (intentional or not) to degrade your site's performance/availability.

There are too many bots out there that are very inconsiderate and do not limit or throttle themselves.

We have one right now that crawls every single webpage (and we have 10's of thousands) every couple days, without any throttle or limit. It's likely somebody's toy scraper, and currently it's doing no harm, but not everyone has the server resources we have.

The point is - if you are dealing with inconsiderate bots, a captcha of some type is pretty nearly a bullet proof way to stop them.

With that said, Cloudflare usually is smart enough to detect unusual patterns, and present a challenge to only those who they believe are bots or up to no good. If every person gets a challenge, then the website operator is either experiencing an active attack, or has accidentally set their security configuration too high.

I do know the common narrative. FUD -> more snake oil "solutions". I myself rely on a special type of igneous rock that keeps hackers away. In reality:

1. Most sites only have this problem due to inefficient design. You are literally complaining about handling 1 request every 2 seconds! That's like a "C10μ problem."

2. How many IPs are these bots coming from? Rate limiting per source IP wouldn't be nearly as intrusive.

3. There are much less obtrusive ways of imposing resource requirements on a requester, like say a computational challenge.

  • Not every website is the same, folks.

    > You are literally complaining about handling 1 request every 2 seconds

    I don't know where this came from. The inconsiderate bots tend to flood your server, likely someone doing some sort of naïve parallel crawl. Not every website has a full-stack in-house team behind it to implement custom server-side throttles and what-not either.

    However, like I mentioned already, if every single visitor is getting the challenge, then either the site is experiencing an attack right now, or the operator has the security settings set too high. Some commonly-targeted websites seem to keep security settings high even when not actively experiencing an attack. To those operators, remaining online is more important than slightly annoying some small subset of visitors 1 time.

    • > crawls every single webpage (and we have 10's of thousands) every couple days

      100,000 / (86400 * 2) = 0.58 req/sec.

      I acknowledge that those requests are likely bursty, but you were complaining as if the total amount was the problem. If the instantaneous request rate is the actual problem, you should be able to throttle on that, no?

      I can totally believe your site has a bunch of accidental complexity that is harder to fix than just pragmatically hassling users. But it'd be better for everyone if this were acknowledged explicitly rather than talked about as an inescapable facet of the web.

      6 replies →

    • In my experience, if bots start flooding a server, it's the ISP/hosting provider that gets angry and contacts the owner first. )

> The point is - if you are dealing with inconsiderate bots, a captcha of some type is pretty nearly a bullet proof way to stop them.

Not any more.

  • Most bots do not handle javascript, still to this day. They want to scrape HTML and catalog prices, etc.

    At least in our experience.

    • OK, fair enough. Not in about six months to a year. Because publicly available ML can now solve pretty much any CAPTCHA a human can solve, there's now an incentive to start deploying and improving the already existing JavaScript-capable bot frameworks.

      1 reply →