Comment by mcswell

1 month ago

gnabgib points out that this same article has been posted for comment here three other times since it was written. That said, afaict no one has commented any of these times on what I'm about to say, so hopefully this will be new.

I'm a linguist, and I've worked in endangered languages and in minority languages (many of which will some day become endangered, in the sense of not having native speakers). The advantage of plain text (Unicode) formats for documenting such languages (as opposed to binary formats like Word used to be, or databases, or even PDFs) is that text formats are the only thing that will stanmd the test of time. The article by Steven Bird and Gary Simons "Seven Dimensions of Portability for Language Documentation and Description" was the seminal paper on this topic, published in 2002. I've given later conference talks on the topic, pointing out that we can still read grammars of Greek and Latin (and Sanskrit) written thousands of years ago. And while the group I led published our grammars in paper form via PDF, we wrote and archived them as XML documents, which (along with JSON) are probably as reproducible a structured format as you can get. I'm hoping that 2000 years from now, someone will find these documents both readable and valuable.

There is of course no replacement for some binary format when it comes to audio.

(By "binary" format I mean file formats that are not sequential and readily interpretable, whereas text files are interpretable once you know the encoding.)

Purely anecdotal, but I hoard a lot of personal documents (shopping receipts, confirmation emails, scans etc.) and for stuff I saved only 10 years ago, the toughest to reopen are the pure text files.

You rightly mention Unicode, as before that there was a jungle of formats. I have some in UTF-16, some in SJIS, a ton in EUC, other were already utf-8, many don't have a BOM. I could try each encoding and see what works for each of the files (except on mobile...it's just a PITA to deal with that on mobile).

But in comparison there's a set of file I never had issues opening now and then: PDFs and jpegs. All the files that my scanner produced are still readable absolutely everywhere. Even with slight bitrot they're readable, and with the current OCR processes I could probably put it all back in text if ever needed.

If I had to archive more stuff now and can afford the space, I'd go for an image format without hesitation.

PS: I'm surprised you don't mention the Unicode character limitations for minority languages or academic use. There will still be characters that either can't be represented, or don't have an exact 1 to 1 match between the code point and the representation.

  • BOM is normally used with UTF-16, not with UTF-8 (both of which, along with UTF-32, are encodings of Unicode).

    I've worked with lots of minority languages in academic situations, but I've never run into anything that couldn't be encoded in Unicode. There's a procedure for adding characters (or blocks of characters) for characters or character sets that aren't already included. There are fewer and fewer of those. The main requirement is documentation.

    • Thanks!

      On adding new characters to Unicode, as for any commitee there will be rejection and cases where going through the whole process is cumbersome/not worth it.

      It's more commonly discussed in the CJK circles, it reminded me of the Wikipedia entry (unsurprisingly with no English equivalent)

      https://ja.wikipedia.org/wiki/Wikipedia:%E8%A1%A8%E7%A4%BA%E...

      > minority languages

      More archaic that minority, but one language I had in mind was one using color coded strings and knots representation. There are latin alphabet mappings, so as long as we trust the translation record keeping per se works in Unicode, but if one wanted to keep the exact original writing it would obviously not work out in plain text. I imagined it's not an isolated instance, but I'm also way out of my depth on this one

      https://en.wikipedia.org/wiki/Quipu

  • > stuff I saved only 10 years ago

    There have been a lot of practical options around in the last three decades for using Unicode. To name just a few: Unicode is around since 1991. UTF-16 was supported in Windows NT in 1993. XML (1998) was specified based on Unicode code points. ...

    • As for many standards, the question is less what's available/supported and more what's the format actually used irl.

      Half the mail I received from that period was in iso-2022 (a JIS variant), most of the rest was latin-1. I have an auto-generated mail from google plus(!) from 2015 in iso-2022-jp, I actually wonder when Google decided it was safe to fully move to utf-8.

This is all true, but I think you're too focused on your area. Finding musical notes that we can interpret correctly from an ancient civilization, would that be "text" or "binary"? I think it's a false choice.

Similarly, cave paintings express the painting someone intended to make better than a textual description of it.