Comment by petercooper

9 months ago

I bet someone like Cloudflare could pull the dataset each day and serve up a plain text/Markdown version of Wikipedia for rounding error levels of spend. I just loaded a random Wikipedia page and it had a weight of 1.5MB in all for what I worked out would be about 30KB of Markdown (i.e. 50x less bandwidth).

Of course, the problem then is getting all these scrapers and bots to actually use the alternative, but Wikimedia could potentially redirect suspected clients in that direction..

6 comments

petercooper

tough 9 months ago

Someone suggested to me to apply a filter that serves .md or txt to bots/ai scrapers instead of the regular website, seems smart if it works but i hate it when i get captchas and this could end up similarly detecting non-bots as bots

maybe a view full website link loaded on js so bots dont see it idk

3036e4 9 months ago
I would love to see most sites serve me markdown. I'd happily install a browser extension to mask me as a a AI bot scraper if it means I can just get the text without all the noise.
- tough 9 months ago
  
  someone built a service for ai bots called pure.md its been a godsend to curl websites as markdown on the occasional where it doesnt work first time and works great for occasional use with the free tier
- sunshine-o 9 months ago
  
  I have good news. It (almost) exists, it is called Gemini [0]
  - [0] https://geminiprotocol.net/
  
  1 reply →
- tough 9 months ago
  
  lol
  me too tbh
  someone pointed out you can enable by default reader mode on safar under settings but even then not all website’s pages are seeved as reader mode enabled pages