Comment by makeitdouble
15 hours ago
Purely anecdotal, but I hoard a lot of personal documents (shopping receipts, confirmation emails, scans etc.) and for stuff I saved only 10 years ago, the toughest to reopen are the pure text files.
You rightly mention Unicode, as before that there was a jungle of formats. I have some in UTF-16, some in SJIS, a ton in EUC, other were already utf-8, many don't have a BOM. I could try each encoding and see what works for each of the files (except on mobile...it's just a PITA to deal with that on mobile).
But in comparison there's a set of file I never had issues opening now and then: PDFs and jpegs. All the files that my scanner produced are still readable absolutely everywhere. Even with slight bitrot they're readable, and with the current OCR processes I could probably put it all back in text if ever needed.
If I had to archive more stuff now and can afford the space, I'd go for an image format without hesitation.
PS: I'm surprised you don't mention the Unicode character limitations for minority languages or academic use. There will still be characters that either can't be represented, or don't have an exact 1 to 1 match between the code point and the representation.
> stuff I saved only 10 years ago
There have been a lot of practical options around in the last three decades for using Unicode. To name just a few: Unicode is around since 1991. UTF-16 was supported in Windows NT in 1993. XML (1998) was specified based on Unicode code points. ...
As for many standards, the question is less what's available/supported and more what's the format actually used irl.
Half the mail I received from that period was in iso-2022 (a JIS variant), most of the rest was latin-1. I have an auto-generated mail from google plus(!) from 2015 in iso-2022-jp, I actually wonder when Google decided it was safe to fully move to utf-8.