Comment by makeitdouble

1 month ago

Purely anecdotal, but I hoard a lot of personal documents (shopping receipts, confirmation emails, scans etc.) and for stuff I saved only 10 years ago, the toughest to reopen are the pure text files.

You rightly mention Unicode, as before that there was a jungle of formats. I have some in UTF-16, some in SJIS, a ton in EUC, other were already utf-8, many don't have a BOM. I could try each encoding and see what works for each of the files (except on mobile...it's just a PITA to deal with that on mobile).

But in comparison there's a set of file I never had issues opening now and then: PDFs and jpegs. All the files that my scanner produced are still readable absolutely everywhere. Even with slight bitrot they're readable, and with the current OCR processes I could probably put it all back in text if ever needed.

If I had to archive more stuff now and can afford the space, I'd go for an image format without hesitation.

PS: I'm surprised you don't mention the Unicode character limitations for minority languages or academic use. There will still be characters that either can't be represented, or don't have an exact 1 to 1 match between the code point and the representation.

4 comments

makeitdouble

mcswell 1 month ago

BOM is normally used with UTF-16, not with UTF-8 (both of which, along with UTF-32, are encodings of Unicode).

I've worked with lots of minority languages in academic situations, but I've never run into anything that couldn't be encoded in Unicode. There's a procedure for adding characters (or blocks of characters) for characters or character sets that aren't already included. There are fewer and fewer of those. The main requirement is documentation.

makeitdouble 1 month ago

Thanks!
On adding new characters to Unicode, as for any commitee there will be rejection and cases where going through the whole process is cumbersome/not worth it.
It's more commonly discussed in the CJK circles, it reminded me of the Wikipedia entry (unsurprisingly with no English equivalent)
https://ja.wikipedia.org/wiki/Wikipedia:%E8%A1%A8%E7%A4%BA%E...
> minority languages
More archaic that minority, but one language I had in mind was one using color coded strings and knots representation. There are latin alphabet mappings, so as long as we trust the translation record keeping per se works in Unicode, but if one wanted to keep the exact original writing it would obviously not work out in plain text. I imagined it's not an isolated instance, but I'm also way out of my depth on this one
https://en.wikipedia.org/wiki/Quipu

Archelaos 1 month ago

> stuff I saved only 10 years ago

There have been a lot of practical options around in the last three decades for using Unicode. To name just a few: Unicode is around since 1991. UTF-16 was supported in Windows NT in 1993. XML (1998) was specified based on Unicode code points. ...

makeitdouble 1 month ago

As for many standards, the question is less what's available/supported and more what's the format actually used irl.
Half the mail I received from that period was in iso-2022 (a JIS variant), most of the rest was latin-1. I have an auto-generated mail from google plus(!) from 2015 in iso-2022-jp, I actually wonder when Google decided it was safe to fully move to utf-8.