← Back to context

Comment by gwern

11 days ago

> Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.

There's something of a question here of what hashes are buying you here and what the threat model is. In terms of archiving, we're often dealing with half-broken web pages (any of whose contents may themselves be broken) which may have gone through a chain of a dozen owners, where we have no possible web of trust to the original creator, assuming there is even one in any meaningful sense, and where our major failure modes tend to be total file loss or partial corruption somewhere during storage. A random JPG flipping a bit during the HTTPS range request download from the most recent server is in many ways the least of our problems in terms of availability and integrity.

This is why I spent a lot more time thinking about how to build FEC in, like with appending PAR2. I'm vastly more concerned about files being corrupted during storage or the chain of transmission or damaged by a server rewriting stuff, and how to recover from that instead of simply saying 'at least one bit changed somewhere along the way; good luck!'. If your connection is flaky and a JPEG doesn't look right, refresh the page. If the only Gwtar of a page that disappeared 20 years ago is missing half a file because a disk sector went bad in a hobbyist's PC 3 mirrors ago, you're SOL without FEC. (And even if you can find another good mirror... Where's your hash for that?)

> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?

The IETF spec lists a number of justifying use cases. SXG was rejected then for a number of reasons IIUC

Browsers check SRI integrity hashes if they're there

There's HTTP-in-RDF, and Memento protocol. VCR.py and similar can replay HTTP sessions, but SSL socket patching or the TLS cookie or adding a cert for e.g. an archiving https proxy is necessary

Browser Devtools can export HAR HTTP archives

If all of the resource origins are changed to one hostname for archival, that bypasses same origin controls on js and cookies; such that the archived page runs all the scripts in the same origin that the archive is served from? Also, Browsers have restrictions on even inline JS scripts served from file:/// urls.

FWIU Web Bundles and SXG were intended to preserve the unique origins of resources in order to safely and faithfully archive for interactive offline review.