Comment by QuantumNomad_
14 hours ago
It is not uncommon for enterprises to intercept HTTPS for inspection and logging. They may or may not also do caching of responses at the point where HTTPS is intercepted.
I previously experimented a bit with Squid Cache on my home network for web archival purposes, and set it up to intercept HTTPS. I then added the TLS certificate to the trust store on my client, and was able to intercept and cache HTTPS responses.
In the end, Squid Cache was a little bit inflexible in terms of making sure that the browsed data would be stored forever as was my goal.
This Christmas I have been playing with using mitmproxy instead. I previously used mitmproxy for some debugging, and found out now that I might be able to use it for archival by adding a custom extension written in Python.
It’s working well so far. I browse HTTPS pages in Firefox and I persist URLs and timestamps in SQLite and write out request and response headers plus response body to disk.
My main focus at the moment is archiving some video courses that I paid for in the past, so that even the site I bought the courses from ceased operation I will still have those video courses. After I finish archiving the video courses, I will proceed to archiving other digital things I’ve bought like VST plugins, sample packs, 3d assets etc.
And after that I will give another shot at archiving all the random pages on the open web that I’ve bookmarked etc.
For me, archiving things by using an intercepting proxy is the best way. I have various manually organised copies of files from all over the place, both paid stuff and openly accessible things. But having a sort of Internet Archive of my own with all of the associated pages where I bought things and all the JS and CSS and images surrounding things is the dream. And at the moment it seems to be working pretty well with this mitmproxy + custom Python extension setup.
I am also aware of various existing web scrapers and internet archival systems for self hosting and have tried a few of them. But for me the system I am doing is the ideal.
No comments yet
Contribute on Hacker News ↗