Comment by evertedsphere

5 days ago

In the past I've considered forking Chromium so every asset that it downloads (images, scripts, etc) is saved somewhere to produce a sort of "passive scraper".

This article made me consider creating a new CDP domain as a possible option, but tbf I haven't thought about this problem in ages so maybe there's something less stupid that I could do.

2 comments

evertedsphere

debazel 5 days ago

Ha, I've had the exact same thought before as well, but due to lack of experience and time constraints I ended up using mitmproxy with a small Python script instead. It was slow and buggy, but it served it purpose...

While searching for a tool I found several others asking for something similar, so I'm sure there are quite a few who would be interested in the project if you ever do decide to pick it up.

dunham 5 days ago

It's not quite the same, but in the past I've written (in python) scrapers that run off of the cache. E.g. it would extract recipes from web pages that I had visited. The script would run through the cache and run an appropriate scraper based on the url. I think I also looked for json-ld and microdata.

The down sides were that it only works with cached data, and I had to tweak it a couple of times because they changed the format of the cache keys.