Comment by ipv6ipv4

1 year ago

This can be done with little AWS Lambda scripts that periodically scrape (or API) whatever sites you want and e-mail you results. All the credentials to login to whatever sites can be personal/dedicated to your instance (so no real API limits), and the usage will almost certainly fall into the AWS free tier since it's only for you.

The ideal install workflow would be to have a repo of AWS CloudFormation templates to automate the installation of the lambdas for different sites in your account. Anyone can open an AWS account, and using CloudFormation is a few fields, and a button click.

Also, if the scripts are developed properly, they are runnable locally. A sane developer will run them locally during development, and then test deployed before releasing.

With an AWS IP and a bot usage pattern they’ll surely ban your account pretty quickly or put you in front of a CAPTCHA. I wish it was as easy as a small script. Without anti-bot techniques, sites would be overflown by scraping bots. Try to scrape a Cloudflare protected site, for example. They’re really good in figuring out if you’re human or a bot. IIRC they even fingerprint your TLS handshake or cypher suite, which ultimately made me give up with headless Chrome and Puppeteer even after proxying through my residential IP, spoofing user-agent and screen size and rate limiting. Unfortunately, there’s no way to distinguish good bots for personal usage from bad bots.

In theory, anything is possible with months of developer work. The trouble is, there are billions of people addicted to social media. There aren't many widespread solutions to scrape it. Whenever a scraper becomes even remotely popular, Facebook takes action against it, as accessing posts outside the walled garden is a violation of their terms of service. Currently, I am using a combination of Feedbro and Nitter to scrape all the accounts I want to follow. They currently work with Facebook and have not been blocked.

  • Yes.

    But there is no aggregation - each user runs their own instances. For any site the offers an API, the API would need to have breaking changes to disable this, or block access from AWS.

    It's easy to make work for a developer like crowd (very little time to write). It would work for most developers just fine, and could, with more considerable development time, be good for anyone.

    Distributed guerilla social media deconstruction.