Comment by palmfacehn
19 hours ago
My impression is that there's less effort for them to go directly to headless browsers. There are several foot guns in using a raw HTML parsing lib and dispatching HTTP requests. People don't care about resource usage, spammers even less and many of them lack the skills.
Most black hat spammers use botnets, especially against bigger targets which have enough traffic to build statistics to fingerprint clients and map out bad ASNs and so on, and most botnets are low powered. You're not running chrome on a smart fridge or an enterprise router.
True, but the bad actor's code doesn't typically run directly on the infected device. Typically the infected router or camera is just acting as a proxy.
There are ways to detect that and it will still require a lot of CPU and ram behind the proxies.
Chrome is probably the worst browser possible to run for these things, so it's not the basis for comparison.
We have many smaller browsers, that run javascript, that work on low powered devices as well.
Starting from webkit and stripping down the rendering parts just to execute JavaScript and process the DOM, the RAM usage would be significantly lower.
A major player in this space is apparently looking for people experienced in scraping without using browser automation. My guess is that not running a browser results in using far fewer resources, thus reducing their costs heavily.
Running a headless browser also means that any differences in the headless environment vs. a "headed" one can be discovered, as well as any of your Javascript executing within the page, which significantly makes it difficult to scale your operation.
My experience is that headless browsers use about 100x more RAM, and at least 10x more bandwidth and 10x more processing power, and page loads take about 10x as long time to finish (vs curl). Though these numbers may be a bit low, there are instances you need to add another zero to one or more of them.
There's also considerably more jank with headless browsers, since you typically want to re-use instances to avoid incurring the cost of spawning a new browser for each retrieval.
Is it possible to pause a VM just after the browser has started up? Then map it as copy-on-write memory and spin up many VMs from that "image".
2 replies →
On the other hand you need to be able to do basics like match the headers, sometimes request irrelevant resources, handle malformed documents, catch changing form parameters, and other gotchas. Many would just copy the request from the browser console.
The change rate for Chromium is also so high that it's hard to spot the addition of code targeting whatever you are doing on the client side.
so much more expensive and slow vs just scraping the html. It is not hard to scrape raw html if the target is well-defined (like google).