Comment by sunshine-o

7 hours ago

Nice, I understand it is similar to ArchiveBox + its web extension.

Now to be honest, while it's optimal to archive pages from you browser view I am not sure I want a random web extension to be in everything I see from a security point of view.

I would rather have a local proxy doing it. Maybe something like the InternetArchive warcproc [0]. Haven't tried yet.

- [0] https://github.com/internetarchive/warcprox

2 comments

sunshine-o

hexagonwin 2 hours ago

for a short time i had warcprox sitting behind my firefox and auto feeding its output to pywb, it seemed to work but i had connections failing randomly after having warcprox running for more than a few hours~days. not sure if it's an issue with pywb or warcprox but there were some urls missing that i did browse on firefox, and many dynamic pages couldn't be replayed at all.

sunshine-o 16 minutes ago

I am not surprised...
I am unfamiliar with web caching proxies like squid [0] but I am wondering if that might be the most straightforward way to do this.
So use squid and then have a batch job that go through /var/spool/squid every day and update your web archive according to some defined filters.
- [0] https://www.squid-cache.org/