Comment by marginalia_nu

6 months ago

To be fair, if my search engine is anything to go on, about 0.5-1% of the requests I get are from human sources. The rest are from bots, and not like people who haven't found I have an API, but bots that are attempting to poison Google or Bing's query suggestions (even though I'm not backed by either). From what I've heard from other people running search engines, it looks the same everywhere.

I don't know what Google's ratio of human to botspam is, but given how much of a payday it would be if anyone were to succeed, I can imagine they're serving their fair number of automated requests.

Requiring a headless browser to automate the traffic makes the abuse significantly more expensive.

49 comments

marginalia_nu

shiomiru 6 months ago

If it's such a common issue, I would've thought Google already ignored searches from clients that do not enable JavaScript when computing results?

Besides, you already got auto-blocked when using it in a slightly unusual way. Google hasn't worked on Tor since forever, and recently I also got blocked a few times just for using it through my text browser that uses libcurl for its network stack. So I imagine a botnet using curl wouldn't last very long either.

My guess is it had more to do with squeezing out more profit from that supposed 0.1% of users.

supriyo-biswas 6 months ago

Given that curl-impersonate[1] exists and that a major player in this space is also looking for experience with this library, I'm pretty sure forcing the execution of JS using DOM stuff would be a much more effective deterrent to prevent scraping.
[1] https://github.com/lwthiker/curl-impersonate
1vuio0pswjnm7 6 months ago
I have been web searching using Google from the command line, with no Javascript, for decades. Until last week I never sent a User Agent HTTP header either. After this change I'm still searching from the command line, no Javascript. Thus "requiring Javascript" is not a correct phrase to describe this change. The requirement is a User Agent HTTP header with an approved value. The only difference in searching for me as of the last few days is that I now send a User Agent HTTP header, with an "approved" string.
Javascript does not stop bots. At least it does not stop Googlebot.
IMO, the change is to funnel more people (cf. bots) into seeing AI-generated search results. The "AI" garbage requires Javascript. That is why the spokesperson suggests "degraded" search results for people who are not using Javascript. For me, the results are improved by avoiding the AI garbage, not degraded.
- netmillions 6 months ago
  
  Could you elaborate on the "approved" string? From what I can see, it's impossible to scrape now from the command line.
  
  1 reply →
jsnell 6 months ago
"Why didn't they do it earlier?" is a fallacious argument.
If we accepted it, there would basically only be a single point in time where a change like this could be legitimately made. If the change is made before there is a large enough problem, you'll argue the change was unnecessary. If it's made after, you'll argue the change should have been made sooner.
"They've already done something else" isn't quite as logically fallacious, but shows that you don't experience dealing with adversarial application domains.
Adversarial problems, which scraping is, are dynamic and iterative games. The attacker and defender are stuck in an endless loop of game and counterplay, unless one side gives up. There's no point in defending against attacks that aren't happening -- it's not just useless, but probably harmful, because every defense has some cost in friction to legitimate users.
> My guess is it had more to do with squeezing out more profit from that supposed 0.1% of users.
Yes, that kind of thing is very easy to just assert. But just think about it for like two seconds. How much more revenue are you going to make per user? None. Users without JS are still shown ads. JS is not necessary for ad targeting either.
It seems just as plausible that this is losing them some revenue, because some proprortion of the people using the site without JS will stop using it rather than enable JS.
- mike_hearn 6 months ago
  
  It can't lose them revenue. Serving queries is expensive, getting rid of bots yields immediate and direct savings measured in $$$
  
  2 replies →
- shiomiru 6 months ago
  
  > "Why didn't they do it earlier?" is a fallacious argument.
  I never said that, but admittedly I could have worded my argument better: "In my opinion, shadow banning non-JS clients from result computation would be similarly (if not more) effective at preventing SEO bots from poisoning results, and I would be surprised if they hadn't already done that."
  Naturally, this doesn't fix the problem of having to spend resources on serving unsuccessful SEO bots that the existing blocking mechanisms (which I think are based on IP-address rate limiting and the UA's HTTPS fingerprint) failed to filter out.
  > Yes, that kind of thing is very easy to just assert. But just think about it for like two seconds. How much more revenue are you going to make per user? None. Users without JS are still shown ads. JS is not necessary for ad targeting either.
  Is JS necessary for ads? No. Does JS make it easier to control what the user is seeing? Sure it does.
  If you've been following the developments on YouTube concerning ad-blockers, you should understand my suspicion that Search is going in a similar direction. Of course, it's all speculation; maybe they really just want to make sure we all get to experience the JS-based enhancements they have been working on :)
  
  1 reply →

supriyo-biswas 6 months ago

I run a semi-popular website hosting user-generated content, although it's not a search engine; the attacks on it have surprised me, and I've eventually had to put in the same kinds of restrictions on it.

I was initially very hesitant to restrict any kind of traffic, relying on ratelimiting IPs on critical endpoints that needed low friction, and captchas on the higher friction with higher intents, such as signup and password reset pages.

Other than that, I was very liberal with most traffic, making sure that Tor was unblocked, and even ending up migrating off Cloudflare's free tier to a paid CDN due to inexplicable errors that users were facing over Tor that were ultimately related to how they blocked some specific requests over Tor with 403, even though the MVPs on their community forums would never acknowledge such a thing.

Unfortunately, given that Tor is a free rotating proxy, my website got attacked on one of these critical, compute heavy endpoints through multiple exit nodes totaling ~20,000 RPS. I've reluctantly had to block Tor, and a few other paid proxy services discovered through my own research since then.

Another time, a set of human spammers distributed all over the world started sending out a large volume of spam towards my website; with something like 1,000,000 spam messages every day (I still feel this was an attack coordinated by a "competitor" of some sort, especially given a small percentage of messages entitled "I want to get paid for posting" or along those lines).

There was no meaningful differentiator between the spammers and legitimate users, they were using real Gmail accounts to sign up, analysis of their behaviours showed they were real users as opposed to simple or even browser-based automation, and the spammers were based out of the same residential IPs as legitimate users.

I, again, had to reluctantly introduce a spam filter on some common keywords, and although some legitimate users do get trapped from time to time, this was the only way I could get a handle on that problem.

I'm appalled by some of the discussions here. Was I "enshittifying" my website out of unbridled "greed"? I don't think so. But every time I come here, I find these accusations, which makes me think that as a website with technical users, we can definitely do better.

_factor 6 months ago
The problem is accountability. Imagine starting a trade show business in the physical world as an example.
One day you start getting a bunch of people come in to mess with the place. You can identify them and their organization, then promptly remove them. If they continue, there are legal ramifications.
On the web, these people can be robots that look just like real people until you spend a while studying their behavior. Worse if they’re real people being paid for sabotage.
In the real world, you arrest them and find the source. Online they can remain anonymous and protected. What recourse do we have beyond splitting the web into a “verified ID” web, and a pseudonymous analog? We can’t keep treating potential computer engagement the same as human forever. As AI agents inevitably get cheaper and harder to detect, what choice will we have?
- supriyo-biswas 6 months ago
  
  To be honest, I don't like initiatives towards a "verified web" either, and am very scared of the effects on anonymity that stuff like Apple's PAT, Chrome's now deprecated WEI or Cloudflare's similar efforts to that end are aimed at.
  Not to say that these would just cement the position of Google and Microsoft and block off the rest of us from building alternatives to their products.
  I feel that the current state of things are fine; I was eventually able to restrict most abuse in an acceptable way with few false positives. However, what I wished for was that more people would understand these tradeoffs instead of jumping to uncharitable interpretations not backed by real world experience as a conclusion.
marginalia_nu 6 months ago

> I'm appalled by some of the discussions here. Was I "enshittifying" my website out of unbridled "greed"? I don't think so. But every time I come here, I find these accusations, which makes me think that as a website with technical users, we can definitely do better.
It's if nothing else very evident most people fundamentally don't understand what an adversarial shit show running a public web service is.
dageshi 6 months ago
There's a certain relatively tiny audience that has congregated on HN for whom hating ads is a kind of religion and google is the great satan.
Threads like this are where they come to affirm their beliefs with fellow adherents.
Comments like yours, those that imply there might be some valid reason for a move like this (even with degrees of separation) are simply heretical. I think these people cling to an internet circa 2002ish and the solution to all problems with the modern internet is to make the internet go back to 2002.
- _factor 6 months ago
  
  The problem isn’t the necessary fluff that must be added, it’s how easy it becomes to keep on adding it after the necessity subsides.
  Google was a more honorable company when the ads were on the right hand side only instead of tricking you in the main results. This is the enshitification people talk about. Decision with no reason other than pure profit at user expense. They were horrendously profitable when they made this dark pattern switch.
  Profits today can’t be distinguished accurately between users who know it’s an ad and those who were tricked into thinking it was organic.
  Not all enshitification is equal.
- khana 6 months ago
  
  [dead]
- JohnMakin 6 months ago
  
  [flagged]
  
  7 replies →
altfredd 6 months ago
20000 RPS is very little — a web app / database running on an ordinary desktop computer can process up to 10000 RPS on a bare-metal configuration after some basic optimization. If that is half of your total average load, a single co-located server should be enough to eat entire "attack" without flinching. If you have "competitors" and I assume, that this is some kind of commercial product (including running profitable advertising-based business), you should probably have multiple geographically distributed servers and some kind of BGP-based DDoS protection.
Regarding Tor nodes — there is nothing wrong with locking them out, especially if your website isn't geo-blocked by any governments and there are no privacy concerns related to accessing it.
If, like Google, you lock out EVERYONE, even your logged in users, whose identities and payment details you have already confirmed, then... yes you are "enshittifying" or have ulterior motives.
> they were using real Gmail accounts to sign up
Using Gmail should be a red flag on its own. Google accounts can be purchased by millions, and immediately get resold after being blocked by target website. Same for phones. Only your own accounts / captchas / site rep can be treated as basis of trust. Confirmation e-mail is a mere formality to have some way of contacting your human users. By the time Reddit was created it was already useless as security measure.
- marginalia_nu 6 months ago
  
  RPS is a bad measure. 20k RPS is a little if you're serving static files, a raspberry pi could probably do that. It's a lot if you're mutating a large database table with each request, which depending on the service, isn't unheard of.
- oefrha 6 months ago
  
  This comment is so out of touch I’m almost speechless.
  > > critical, compute heavy endpoints through multiple exit nodes totaling ~20,000 RPS
  > 20000 RPS is very little
  If I had to guess you’ve never hosted non-static websites so you can’t imagine what’s a compute heavy endpoint.
  > Using Gmail should be a red flag on its own.
  Yes, ban users signing up with Gmail then.
  And this is not an isolated case, discussions on DDoS, CAPTCHAs, etc. here always have these out of touch people coming out of the woodwork. Baffling.
  
  2 replies →

palmfacehn 6 months ago

My impression is that there's less effort for them to go directly to headless browsers. There are several foot guns in using a raw HTML parsing lib and dispatching HTTP requests. People don't care about resource usage, spammers even less and many of them lack the skills.

marginalia_nu 6 months ago
Most black hat spammers use botnets, especially against bigger targets which have enough traffic to build statistics to fingerprint clients and map out bad ASNs and so on, and most botnets are low powered. You're not running chrome on a smart fridge or an enterprise router.
- desdenova 6 months ago
  
  Chrome is probably the worst browser possible to run for these things, so it's not the basis for comparison.
  We have many smaller browsers, that run javascript, that work on low powered devices as well.
  Starting from webkit and stripping down the rendering parts just to execute JavaScript and process the DOM, the RAM usage would be significantly lower.
- gnfargbl 6 months ago
  
  True, but the bad actor's code doesn't typically run directly on the infected device. Typically the infected router or camera is just acting as a proxy.
  
  1 reply →
supriyo-biswas 6 months ago
A major player in this space is apparently looking for people experienced in scraping without using browser automation. My guess is that not running a browser results in using far fewer resources, thus reducing their costs heavily.
Running a headless browser also means that any differences in the headless environment vs. a "headed" one can be discovered, as well as any of your Javascript executing within the page, which significantly makes it difficult to scale your operation.
- marginalia_nu 6 months ago
  
  My experience is that headless browsers use about 100x more RAM, and at least 10x more bandwidth and 10x more processing power, and page loads take about 10x as long time to finish (vs curl). Though these numbers may be a bit low, there are instances you need to add another zero to one or more of them.
  There's also considerably more jank with headless browsers, since you typically want to re-use instances to avoid incurring the cost of spawning a new browser for each retrieval.
  
  4 replies →
- fweimer 6 months ago
  
  The change rate for Chromium is also so high that it's hard to spot the addition of code targeting whatever you are doing on the client side.
victorbjorklund 6 months ago

so much more expensive and slow vs just scraping the html. It is not hard to scrape raw html if the target is well-defined (like google).

marcus0x62 6 months ago

I run a not-very-popular site -- at least 50% of the traffic is bots. I can only imagine how bad it would be if the site was a forum or search engine.

kragen 6 months ago

Maybe you could require hashcash, so that people who wanted to do automated searches could do it at an expense comparable to the expense of a human doing a search manually. Or a cryptocurrency micropayment, though tooling around that is currently poor.

supriyo-biswas 6 months ago
The only issue with a hash cash is there’s no way to know whether the user’s browser is the one who computed said proof of work, or has delegated it to a different system and is simply relaying its results. At scale, you’d end up with a large botnet that receives proof of work tokens to solve for the scraping network to use.
- kragen 6 months ago
  
  I don't see that as a problem; it still imposes the botnet rental cost on the spammer or whoever.

ForHackernews 6 months ago

> bots that are attempting to poison Google or Bing's query suggestions

This seems like yet another example of Google and friends inviting the problem they're objecting to.