← Back to context

Comment by alexey-salmin

2 years ago

I wonder how much scraping bots skewer the counters. They create a lot of traffic and majority of them run on "Linux Desktop", even though some modify the user agent.

Would be curious to see stats for e.g. subset of GPU-accelerated devices which can be detected in JS. Not as a true bot-filter of course, but as a uniform and widely-available metric biased towards real users.

Do you have any source of that? I'd guess most of them just use a windows User Agent to avoid being flagged.

  • I don't and that's why I say it would be curious to see the numbers that could potentially expose the bots-vs-users discrepancy.

    Without numbers an educated guess looks like this:

    1) Even if say 70% of bots set Windows UA, the remaining 30% of Linux UA will still skew the numbers noticeably because 30% is much more than the "natural" Linux market share.

    2) Many bots don't modify the UA just because they don't care and are not being blocked often enough, not on the domains that they scrape.

    3) Many bots don't modify the UA because they care a lot and follow the strategy of emulating a real chrome desktop user with high fidelity. In this case it's better to leave the real Linux Chrome UA than to risk being detected by discrepancies between the UA and the browser capabilities detected by JS.

I suppose that would imply that scrapers are scraping more or there are more of them.

  • I think they meant that the bots would account for a signicant portion of the total measured market share, not necessarily an increasing portion.

  • Both things that seem to be happening due to the current LLM hype. Everyone locks down their data from scrapers while complaining about increased bot traffic.