Comment by petercooper
7 months ago
I bet someone like Cloudflare could pull the dataset each day and serve up a plain text/Markdown version of Wikipedia for rounding error levels of spend. I just loaded a random Wikipedia page and it had a weight of 1.5MB in all for what I worked out would be about 30KB of Markdown (i.e. 50x less bandwidth).
Of course, the problem then is getting all these scrapers and bots to actually use the alternative, but Wikimedia could potentially redirect suspected clients in that direction..
Someone suggested to me to apply a filter that serves .md or txt to bots/ai scrapers instead of the regular website, seems smart if it works but i hate it when i get captchas and this could end up similarly detecting non-bots as bots
maybe a view full website link loaded on js so bots dont see it idk
I would love to see most sites serve me markdown. I'd happily install a browser extension to mask me as a a AI bot scraper if it means I can just get the text without all the noise.
someone built a service for ai bots called pure.md its been a godsend to curl websites as markdown on the occasional where it doesnt work first time and works great for occasional use with the free tier
I have good news. It (almost) exists, it is called Gemini [0]
- [0] https://geminiprotocol.net/
1 reply →
lol
me too tbh
someone pointed out you can enable by default reader mode on safar under settings but even then not all website’s pages are seeved as reader mode enabled pages