Comment by syncsynchalt

6 hours ago

I've written some Unicode transcoders; UTF-8 decoding devolves to a quartet of switch statements and each of the issues you've mentioned end up being a case statement where the solution is to replace the offending sequence with U+FFFD.

UTF-16 is simple as well but you still need code to absorb BOMs, perform endian detection heuristically if there's no BOM, and check surrogate ordering (and emit a U+FFFD when an illegal pair is found).

I don't think there's an argument for either being complex, the UTFs are meant to be as simple and algorithmic as possible. -8 has to deal with invalid sequences, -16 has to deal with byte ordering, other than that it's bit shifting akin to base64. Normalization is much worse by comparison.

My preference for UTF-8 isn't one of code complexity, I just like that all my 70's-era text processing tools continue working without too many surprises. The features like self-synchronization are nice too compared to what we _could_ have gotten as UTF-8.

0 comments

syncsynchalt

No comments yet

Contribute on Hacker News ↗